Researchers rely on the human reference genome as a baseline to identify genetic differences between individuals, which are crucial for understanding human physiology, disease, and evolution. In this study, we focused on the implications of the first-ever complete human reference genome, which improves the identification of genetic variation and ushers in the beginning of a new era in genetics.
Reading time 4 min
published on Jul 26, 2023
In 2001, the Human Genome Project produced the first-ever sequence of the human genetic code, known as the human reference genome. Researchers like us rely on the reference genome to identify genetic differences between individuals, or genetic variants. Identifying these variants can help us determine ancestry and genetic disease risk, and lead to the development of personalized medical treatments.
Although the human reference genome has enabled a myriad of genomic discoveries, it suffers from shortcomings due to the limitations of the original technology used for its construction. When building a genome, the first step is to fragment DNA into small pieces, the sequences of which can be determined by a DNA sequencing machine. Then, scientists use software to assemble these pieces together, similar to solving a very large jigsaw puzzle. However, many regions of the genome are repetitive, like large regions of blue sky in a jigsaw puzzle. If the puzzle pieces in these areas are too small, the way they fit together will be ambiguous to even the most sophisticated software. As a result, there were errors affecting several million letters in the human reference genome, and the most repetitive regions of the genome (~8%) remained completely unresolved.
Recent technological advancements have enabled the production of longer reads, analogous to larger puzzle pieces, which span many of these repetitive regions and extend into unique sequences on either side. Using these longer reads, the Telomere-to-Telomere (T2T) Consortium produced the first complete human reference genome, resolving the entire sequence of the genome without gaps. In our study, which accompanied the release of the T2T genome, we focused on the implications of a complete human reference genome for the future study of human genetics, evolution, and disease.
No two individuals have the same DNA sequence, and the differences between our genomes are crucial for understanding human physiology, diseases, and evolution. To identify these genetic variants, we researchers compare an individual's DNA sequence to that of a reference genome in a process known as variant calling. However, gaps and errors in the reference may result in inaccurate variant calls. By testing out the T2T reference genome on a DNA sequencing dataset of more than 3,200 individuals from five continents, we demonstrated that fixing these reference errors led to improvements in variant calling. One example of how we measured these improvements was by counting the number of variants detected in 100% of the individuals in our dataset but absent from the reference genome. These often represent regions where the reference genome contains errors or poorly represents the human population. We discovered that the number of these variants went down when using the T2T reference genome. Thus, our data supports that the new reference improves the accuracy of analyses of human variation.
The T2T genome also contains nearly 200 million letters of novel sequence previously not represented in the reference—about as much DNA as a typical human chromosome! Even more excitingly, across our diverse sample set, we detected about 1.5 million variants in these regions. No one has ever analyzed most of these variants, so even though we don't yet know their biological effects, they hold considerable potential for future discoveries.
These improvements in genetic variant detection will make the T2T reference genome a key resource for biomedical research and personalized medicine for years to come. Accurately identifying genetic variation allows researchers like us to study the potential role of these variants in traits and disease. A common approach for understanding these effects is comparing the genetic makeup of people with and without a disease, an experimental design known as genome-wide association studies. In our work, we demonstrated that the T2T reference genome improved variant detection across hundreds of medically relevant genes, enabling better clinical genetic testing.
Our work is just the first step toward uncovering the genetic insights that we can gain from a complete human genome sequence. Most importantly, it has become clear that just one genome is insufficient for representing the diversity of the human species. The T2T Consortium is continuing to work on T2T assemblies for several individuals from Africa, Asia, and the Americas, which will reveal more insights into genetic diversity within and across human populations. We hope this project represents the beginning of a new era of genomics, where diverse and high-quality genome assemblies are readily available for research and medicine.
Aganezov, S., Yan, S. M., Soto, D. C., Kirsche, M., Zarate, S., Avdeyev, P., Taylor, D. J., Shafin, K., Shumate, A., Xiao, C., Wagner, J., McDaniel, J., Olson, N. D., Sauria, M. E. G., Vollger, M. R., Rhie, A., Meredith, M., Martin, S., Lee, J., … Schatz, M. C. (2022). A complete reference genome improves analysis of human genetic variation. Science, 376(6588). https://doi.org/10.1126/science.abl3533