Completing the human reference genome one chromosome at a time
Implications for understanding human disease and genetic variation
A person’s genome influences so much about their life: it can affect how they look, how they act, and their susceptibility to genetic diseases, among others. One of the reasons that we know so much about genetic variation and its influence in a person is because we have a reference genome with which to compare. The first draft of the human reference genome was published in 2001 as part of the Human Genome Project, with the full sequence later “finished” in 2003 (1). This $2.7 billion, 13-year feat was taken on by an international consortium of scientists who painstakingly assembled the genome of several volunteers to a degree which the scientific community considered “finished”. This “finished” genome had no more than one error in 100,000 bases with a nearly continuous sequence (2).
The first human reference genome has served as a resource for the scientific community by advancing our understanding of disease susceptibility, prevention, and treatment (3). For example, genome-wide association studies (GWAS) have been used to screen the genome for single nucleotide polymorphisms (SNPs). SNPs are small changes in the genetic code that can be implicated as risk factors for diseases. Moreover, the field of pharmacogenomics has allowed clinicians to tailor disease treatment to individual patients based on their genetic information, a core component of precision medicine. Although the 2003 reference covered 99% of the gene-containing regions of the human genome with 99.99% accuracy, two decades later, it still has some shortcomings (4). Since the human genome consists of ~3 billion bases on a single set of chromosomes, 99% coverage and 99.99% accuracy still leaves room for thousands of gaps and inaccuracies in the assembly. Until recently, no single human chromosome had been sequenced from end to end.
Until recently, no single human chromosome had been
sequenced from end to end.
Improving the reference human genome (currently, GRCh38) has important implications for patient health. Consider Chromosome 21 for example, it is involved in diseases ranging from Down Syndrome (affecting 1 in 700 live births) to various cancers (5). However, the original reference sequence of Chromosome 21 has several gaps that comprise about 100,000 base pairs (5). These gaps could hold key information about the causes of numerous genetic diseases and traits; and this is in only one chromosome out of 23 pairs that the human genome contains.
Sequencing technologies have come a long way since the days of the Human Genome Project. Advances in sequencing technologies have outpaced improvements in computation, with the cost per genome decreasing faster than what is expected for computational power (6). With these innovations in sequencing, scientists have decided to re-approach filling in the gaps in the human genome and correct the inaccuracies of the latest reference genome. In a recent publication in Nature, a multicenter research team led by Dr. Adam Phillipy from the National Human Genome Research Institute reached a major milestone by sequencing the human X chromosome from “telomere to telomere”, in other words, from end to end (Figure) (7).
What is so hard about sequencing a complete chromosome? Since sequencing technologies today cannot read the entire human genome at once, we must rely on computers to put together the pieces. These pieces, called sequencing “reads”, are sequences of base pairs corresponding to either part or the entirety of a DNA fragment. Sequencing read length can vary greatly depending on the method of sequencing: some massive parallel sequencing technologies (pyrosequencing and reversible terminator from Thermofisher and Illumina, respectively), for example, produce reads that are only about 150-200 base pairs in length, whereas Sanger sequencing usually generates about 500 base pair reads and can go up to ~800 bps. Real-time sequencing (PacBio and Oxford Nanopore) are long-read sequencing technologies that can sequence over one hundred thousand base pair fragments. Most technologies discern the identity of base pairs with good accuracy; the problem arises with assembling repetitive sequences or complex structural rearrangements (e.g. segmental duplications, inversions, deletions). In particular, repetitive regions are challenging to sequence due to computational and sequencing errors. Computationally, when merging reads to form a contig, a consensus sequence of DNA formed by overlapping sequencing reads, it is difficult to discern where one repetition ends and another begins. Computer algorithms may erroneously determine that two contiguous repetitive sequences are the same sequence, merging them together and leading to a smaller number of repeats in the assembly than truly exists at that position in the genome. From a sequencing perspective, a repetitive sequence will give the same signal, be it via a fluorescent ping (Sanger), change in an ionic current (Nanopore and ion semiconductor sequencing) or light pulses emitted from nucleotides (PacBio). As a result, it can be difficult to discern whether the repetitive signal is an error or a distinct sequence.
These problems manifest when trying to assemble the centromeric region of the chromosome, the central region that is important for cell division. The centromere is composed of repetitive DNA arrays that can span megabases (Mbp). On the X chromosome, the 3.1 Mbp centromere is essentially one large repetitive array called DXZ1. To assemble this huge repetitive region, a group of scientists took advantage of the “ultra-long reads” produced by PacBio and Oxford Nanopore to manually identify reads that span large repeat regions, anchored by unique “marker” regions. To supplement these long reads, the scientists used another technique called optical mapping to validate that the structure of the repeat regions was correct. The scientists have equated this challenge to solving a puzzle. In this puzzle of assembling the human genome, one does not know how many pieces there are; most pieces are identical and there are just a few unique features in the overall image. If the pieces were small, this puzzle would be next to impossible to assemble. However, by using ultra-long reads (PacBio/Nanopore), the scientists were able to increase the size of the puzzle pieces so that there are fewer pieces to assemble in the first place. Then, the scientists were able to identify unique, or “marker”, features in the image that anchor the assembly of the larger puzzle pieces. Once the researchers thought they put the puzzle together correctly, they were able to use optical mapping to verify that they had the correct number of pieces in the correct orientation. By using larger pieces, identifying unique features, and isolating misassembled regions with optical maps, a once un-assemblable puzzle was finally assembled.
In this puzzle of assembling the human genome, one does not
know how many pieces there are; most pieces are identical and
there are just a few unique features in the overall image.
How does this impact the future of genetics/genomics? To this day, there are still thousands of gaps and inaccuracies throughout the human reference genome, which may harbor variants associated with human health and disease. The completion of the first human chromosome assembly from end to end represents a landmark achievement in genetics. This achievement will give scientists access to new information that could improve our understanding of disease pathology and better inform the development of therapeutics. With another 23 chromosomes left to sequence from end to end (including Y, which is only present in males), the Telomere-to-Telomere Consortium has set out to release a truly complete reference genome for research applications, biotechnology, and clinical care. Beyond human genetics, the techniques established in this study could revolutionize how scientists attempt to reconstruct the most challenging, repetitive assemblies across all branches of the tree of life. Ultimately, having a better reference genome means better informing all of the fields that use it.
Ultimately, having a better reference genome means better
informing all of the fields that use it.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860–921.
- International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931–45.
- Bloss CS, Jeste DV, Schork NJ. Genomics for disease treatment and prevention. Psychiatr Clin North Am. 2011 Mar;34(1):147–66.
- 20 years later, genomicists remember the draft human genome sequence [Internet]. [cited 2020 Aug 11]. Available from: https://www.genome.gov/aboutnhgri/Director/genomics-landscape/July-2-2020-twenty-years-later-genomicists-rememberannouncement-of-draft-human-genome-sequence
- Hattori M, Fujiyama A, Taylor TD, Watanabe H, Yada T, Park HS, et al. The DNA sequence of human chromosome 21. Nature. 2000 May 18;405(6784):311–9.
- Kris WA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). [cited 2020 Aug 10]; Available from: http://www.genome.gov/sequencingcostsdata
- Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, et al. Telomere-totelomere assembly of a complete human X chromosome. Nature. 2020 Jul 14;
- Trost B, Engchuan W, Nguyen CM, Thiruvahindrapuram B, Dolzhenko E, Backstrom I, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020 Jul 27;
- López Castel A, Cleary JD, Pearson CE. Repeat instability as the basis for human diseases and as a potential target for therapy. Nat Rev Mol Cell Biol. 2010 Mar;11(3):165–
About the Author
Julie Lake is a post-baccalaureate fellow in the National Institute of Neurological Disorders and Stroke. She graduated from University of California, Berkeley in May 2020 with a B.S. in Microbial Biology and a minor in Data Science. Her current research focuses on understanding the genetics of neurodegenerative diseases such as Parkinson’s disease and Lewy body dementia.