“Sequence Duplication Levels” module still fails after pre-processing Illumina data












1














I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question









New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    Why do you think this is a problem to begin with?
    – Devon Ryan
    2 hours ago










  • I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    – yy97
    2 hours ago








  • 2




    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    – Bioathlete
    1 hour ago
















1














I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question









New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    Why do you think this is a problem to begin with?
    – Devon Ryan
    2 hours ago










  • I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    – yy97
    2 hours ago








  • 2




    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    – Bioathlete
    1 hour ago














1












1








1







I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here










share|improve this question









New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I want to ask about why the sequence duplication levels are high after I trimmed by using Trimmomatic? I am using the following Trimmomatic operations: HEADCROP = 19 TRAILING = 20 MINLEN = 66.



How can i solve this problem? Thank You.



enter image description here



enter image description here







illumina data-preprocessing trimming fastqc






share|improve this question









New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 52 mins ago









Daniel Standage

1,978327




1,978327






New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 hours ago









yy97yy97

61




61




New contributor




yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






yy97 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1




    Why do you think this is a problem to begin with?
    – Devon Ryan
    2 hours ago










  • I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    – yy97
    2 hours ago








  • 2




    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    – Bioathlete
    1 hour ago














  • 1




    Why do you think this is a problem to begin with?
    – Devon Ryan
    2 hours ago










  • I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
    – yy97
    2 hours ago








  • 2




    What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
    – Bioathlete
    1 hour ago








1




1




Why do you think this is a problem to begin with?
– Devon Ryan
2 hours ago




Why do you think this is a problem to begin with?
– Devon Ryan
2 hours ago












I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
– yy97
2 hours ago






I though the cross sign (X) means some kind of error and should be eliminated with certain pre-processing technique (like trimming to solve adapter content error)? I am new to Illumina, thank you for advising :)
– yy97
2 hours ago






2




2




What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
– Bioathlete
1 hour ago




What type of sequencing libraries are these? Whole genome, RNA Seq, whole exome, targeted seqeuencing? What genome? also how complex is the library you sequencing.
– Bioathlete
1 hour ago










2 Answers
2






active

oldest

votes


















3














FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






share|improve this answer





























    1














    To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




    The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





    • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

    • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

    • Large numbers of adapter dimers or rRNA may be present in your sample.


    But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




    Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




    So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






    share|improve this answer





















      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "676"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });






      yy97 is a new contributor. Be nice, and check out our Code of Conduct.










      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






      share|improve this answer


























        3














        FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






        share|improve this answer
























          3












          3








          3






          FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.






          share|improve this answer












          FastQC assumes that all samples are for whole genome sequencing and will flag them as failed if they differ too much from that assumption. This will, for example, cause essentially all RNA-seq, ChIP-seq, and ATAC-seq samples to fail in one module or another. This is not any cause for concern and is completely expected. Primarily concern yourself with whether all of your samples are similar in their metrics.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 2 hours ago









          Devon RyanDevon Ryan

          12.9k21236




          12.9k21236























              1














              To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




              The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





              • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

              • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

              • Large numbers of adapter dimers or rRNA may be present in your sample.


              But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




              Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




              So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






              share|improve this answer


























                1














                To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




                The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





                • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

                • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

                • Large numbers of adapter dimers or rRNA may be present in your sample.


                But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




                Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




                So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






                share|improve this answer
























                  1












                  1








                  1






                  To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




                  The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





                  • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

                  • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

                  • Large numbers of adapter dimers or rRNA may be present in your sample.


                  But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




                  Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




                  So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.






                  share|improve this answer












                  To answer your direct question, there are a few reasons why there might be high levels of sequence duplication. From the FastQC help:




                  The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.





                  • As @DevonRyan mentioned, with certain sequencing protocols such as RNA-Seq, two sequence reads at exactly the same location aren't that uncommon. This isn't a problem with RNA-Seq data, or with Trimmomatic, or with FastQC. It's just that this kind of data violates the assumption, and therefore should be ignored in those circumstances.

                  • PCR duplicates are another possible cause. PCR duplicates can give the false impression of high coverage at a particular locus when in fact it's just a single observed read that has been duplicated many times (see here for more details). PCR duplicates can usually be detected and removed if your analysis involves mapping to a reference genome. But whether this is actually a problem you need to fix depends on what type of data you have and what types of analysis you want to do.

                  • Large numbers of adapter dimers or rRNA may be present in your sample.


                  But I think it's also important to address how quality control (QC) is run. It can be tempting to run and re-run QC tools like Trimmomatic until all errors go away, but to be blunt these tools cannot think for you. For example, it's possible to get rid of most adapters by aggressively cropping/trimming both ends of each read, but you'll likely throw away a lot of good data that way. You may want to look into Trimmomatic's ILLUMINACLIP operation. It's also may be tempting to crop/trim reads aggressively if there are compositional biases near the beginning or end of the read. In fact, random hexamer priming can cause the Per-base Sequence Content module to fail on almost any RNA-Seq sample. Again, from the FastQC help (emphasis mine):




                  Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.




                  So in other words, it's best to determine what can cause each FastQC module to fail, investigate whether this is actually a problem for your data set (referring to documentation as needed), and make a deliberate QC plan that addresses the issues that need attention.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 54 mins ago









                  Daniel StandageDaniel Standage

                  1,978327




                  1,978327






















                      yy97 is a new contributor. Be nice, and check out our Code of Conduct.










                      draft saved

                      draft discarded


















                      yy97 is a new contributor. Be nice, and check out our Code of Conduct.













                      yy97 is a new contributor. Be nice, and check out our Code of Conduct.












                      yy97 is a new contributor. Be nice, and check out our Code of Conduct.
















                      Thanks for contributing an answer to Bioinformatics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6786%2fsequence-duplication-levels-module-still-fails-after-pre-processing-illumina-d%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      404 Error Contact Form 7 ajax form submitting

                      How to know if a Active Directory user can login interactively

                      TypeError: fit_transform() missing 1 required positional argument: 'X'