The human genome is made up of over three billion base pairs of DNA, which in turn create the 23 chromosome pairs which make up a human being’s genetic code. The initial draft sequence of the human genome was unveiled by Celera Genomics and the International Human Genome Sequencing Consortium in February 2001.
The publication paved the way for substantial advances in the scientific understanding of human biology and disease. But this initial draft was still imperfect. Some information was incorrect, and some sections of the genome were left unsequenced.
In 2019, the Telomere-to-Telomere Consortium (T2T Consortium) was founded, bringing together genomics experts with many different subspecialties to assemble and characterise the remaining 8% of the genome. Established by US National Human Genome Research Institute computational biologist Adam Phillippy and University of California, Santa Cruz geneticist Karen Miga, T2T eventually expanded to a team of over 100 scientists.
Now, the consortium claims to have sequenced and assembled the entirety of the human genome, patching up parts that were missed during the initial sequencing two decades ago.
“We’ve revealed the final 8% of the human genome and will be sure to find new discoveries within this significant fraction of the genome,” says Phillippy. “As a first pass, we have released a set of five additional companion papers that begin to analyse this new sequence. In particular, we focus on the newly uncovered segmental duplications, centromeric satellite arrays, variants, transposable elements, and epigenetic profiles.”
Their work was released to the public as a pre-print in May, meaning it has yet to be peer-reviewed. Miga has said that she won’t consider the announcement official until the paper is finally published in a medical journal.
The current official human reference genome, GRCh38, was released by the Genome Reference Consortium in December 2013. This draft contained only around 250 gaps, where the first version in 2001 had around 150,000, but these 250 gaps still accounted for around 8% of the genome overall.
If the T2T Consortium’s model (T2T-CHM13) is adopted instead, it will have major implications for the entire field of genomics.
How did the T2T Consortium complete its work?
Current industry standard DNA sequencers, made by Illumina, take small fragments of DNA and decode them before reassembling the results. This works well for most of the genome, but not in parts where the DNA code is made up of long, repeating patterns.
If a computer has to patch together small fragments, it’s hard for the system to put them back together in the right order when they all appear to be identical.
University of California, Berkeley fellow Nicolas Altemose, one of T2T’s key researchers, says: “DNA sequencing technologies only let us determine the sequence of relatively small fragments of DNA, so to sequence a whole genome, we have to shred it into smaller pieces, sequence those smaller pieces, then stitch them back together.
“This is relatively easy in parts of the genome with totally unique sequences, but it becomes really difficult in regions where the same DNA sequence is found repeated over and over.
“These repetitive regions are akin to the blue sky pieces in a jigsaw puzzle, as they lack distinguishing features that help us place them exactly, making them the most challenging regions to put together.”
Altemose was the lead author for the team that explored the largest components of the formerly missing regions of the genome: tandemly repeated sequences found in and around the centromere of each chromosome. These sequences constitute about 6% of the genome and some of them play essential roles in cell division.
Instead of chopping the genome up and putting it back together again, the sequencing of T2T-CHM13 was made possible through the technologies developed by two private DNA sequencing companies: Pacific Biosciences (PacBio) and Oxford Nanopore.
The PacBio and Oxford Nanopore technologies do not cut the DNA up into tiny pieces. Instead, PacBio’s tech uses lasers to repeatedly examine the same sequence of DNA, creating highly accurate readouts. Meanwhile, the Oxford Nanopore technology runs DNA molecules through tiny holes, resulting in a very long sequence. These platforms allowed the researchers to sequence much larger fragments of DNA at a time, simplifying the puzzle.
Phillippy says: “Until recently, DNA sequencing methods did not have the necessary combination of accuracy and read length to successfully sequence and assemble the most repetitive parts of the genome.
“Now that we can sequence 20,000 or more base pairs per read, with very high accuracy, we were able to overcome the remaining challenges.”
The DNA sequence the T2T researchers used was not from a person but a hydatidiform mole, a growth that occurs in a woman’s uterus when a sperm fertilises an egg that does not have a nucleus. This meant it contained two copies of the same 23 chromosomes, rather than two differing sets, making the computational effort of creating the DNA sequence simpler.
All in all, the researchers added or fixed over 200 million base pairs in the reference genome, finding that the human genome measures approximately 3.05 billion base pairs long.
The new sequence includes five chromosome arms that were entirely missing from past reference sequences and the centromeric satellite arrays for all chromosomes. These newly uncovered sequences can now be investigated to better understand their function and potential associations with disease.
What does this mean for the clinical lab?
Altemose says: “With the T2T assembly, the scales have fallen from our eyes and we have finally been able to observe the detailed structure of vast regions of the genome that were previously very poorly characterised. This has revealed new insights into how large repetitive regions evolve in the human genome, along with the discovery of new repeat families and new genes within these regions.
“Excitingly, this assembly also opens up these formerly missing regions to be studied using modern experimental approaches that can reveal how they vary across people and how they function within human cells.”
This new approach could have a significant impact on genomic projects in clinical labs when it comes to whole genome sequencing for individual patients, particularly those suspected of having or already diagnosed with a rare genetic disease.
Currently, missing regions of the genome assembly are still sequenced during this process but will then either fail to align or align inappropriately to their closest match in the reference.
Altemose says: “This can lead to false positive variant calls in the assembled regions of the genome, and it omits potentially clinically relevant variation in the missing parts of the genome assembly. One team within the T2T Consortium explored how the new T2T assembly improves variant calling using large sequencing datasets from many individuals.”
The researchers saw a particularly large drop in false positive variant calls in 269 medically relevant genes in and around some of the newly improved regions of the genome.
Phillippy says: “This will clearly improve the accuracy of studies involving these genes. However, looking long-term, I am most excited about the potential discovery of new genes and disease associations within the extra 200 million bases of sequence we have added.”
Even though T2T-CHM13 has yet to be adopted as the official reference genome, some researchers have already started using it in their work. Children’s Mercy Hospital director of molecular oncology Dr Midhat Farooqui has begun to use the genome for his research into rare childhood diseases, lining up DNA from his patients against now-filled gaps to search for previously undetected mutations.
Phillippy says: “It will take years of research to understand the potential function or effect of these sequences, but those findings will help complete our understanding of the genome and eventually make their way into the clinic.”