Why does this human bam file only have one copy of each chromosome?












2














As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



Q1: Where is the other gene copy in the sequence or have I have missed something?



Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










share|improve this question





























    2














    As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



    Q1: Where is the other gene copy in the sequence or have I have missed something?



    Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










    share|improve this question



























      2












      2








      2







      As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



      Q1: Where is the other gene copy in the sequence or have I have missed something?



      Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?










      share|improve this question















      As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.



      Q1: Where is the other gene copy in the sequence or have I have missed something?



      Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?







      bam sequencing fastq exome






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 55 mins ago









      conchoecia

      1,569223




      1,569223










      asked 2 hours ago









      Lot_to_learn

      758




      758






















          1 Answer
          1






          active

          oldest

          votes


















          2














          The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



          Response to Q1



          Your question, in other words, is: Why do bam files not differentiate between haplotypes?



          Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



          This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



          This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



          Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



          Response to Q2



          If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



          If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



          If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






          share|improve this answer























            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "676"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



            Response to Q1



            Your question, in other words, is: Why do bam files not differentiate between haplotypes?



            Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



            This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



            This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



            Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



            Response to Q2



            If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



            If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



            If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






            share|improve this answer




























              2














              The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



              Response to Q1



              Your question, in other words, is: Why do bam files not differentiate between haplotypes?



              Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



              This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



              This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



              Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



              Response to Q2



              If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



              If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



              If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






              share|improve this answer


























                2












                2








                2






                The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



                Response to Q1



                Your question, in other words, is: Why do bam files not differentiate between haplotypes?



                Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



                This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



                This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



                Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



                Response to Q2



                If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



                If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



                If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).






                share|improve this answer














                The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.



                Response to Q1



                Your question, in other words, is: Why do bam files not differentiate between haplotypes?



                Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.



                This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.



                This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.



                Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.



                Response to Q2



                If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.



                If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.



                If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 47 mins ago

























                answered 1 hour ago









                conchoecia

                1,569223




                1,569223






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Bioinformatics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    404 Error Contact Form 7 ajax form submitting

                    How to know if a Active Directory user can login interactively

                    TypeError: fit_transform() missing 1 required positional argument: 'X'