Breaking the unknowns of the human reference genome

The publication of designs of the human genome in 2001 was a milestone¹^,². Scientists were able to study long stretches of each human chromosome for the first time, base by base. As such, researchers could begin to understand how individual genes were arranged and how the surrounding non-protein-coding DNA was structured and organized. Despite these amazing advancements, the design genomes were still incomplete, with over 150 million bases missing³. Technological advancements in the intervening years allowed researchers to contribute to the design, finally achieving full sequencing of a chromosome⁴^,⁵ in 2020. As a result, new and uncharacterized parts of the human genome are beginning to surface, ushering in a new exciting period of biological discovery.

What exactly was included in the design genomes? The original design featured many previously unexplored intergenic regions. It also included the vast majority of genes. The International Human Genome Sequencing Consortium¹ initially estimated the genome contained 30,000-40,000 protein-coding genes, although an updated genome had been published⁶ in 2004, along with improved gene prediction approaches⁷, led the number to be revised to about 20,000. The 2004 genome yielded a high-resolution map of 2.85 billion nucleotides of euchromatin – the loosely packed regions of DNA that are enriched in genes and make up about 92% of the human genome.

The reference genome launched the scientific community into an era of genome exploration, shifting focus from single genes to more complete, genome-wide studies. However, gaps remained on each of the 23 pairs of human chromosomes, estimated to contain over 150 megabases of unknown sequence³ (Figure 1). The largest gaps were at sites enriched with highly repetitive DNA or sequences for which many are nearly identical copies. These sections were originally difficult to clone, sequence, and assemble correctly. As a result, the human genome project intentionally underrepresented these repetitive sequences. Although researchers had a very basic idea of the nature of sequences in these regions, the high-resolution genomic organization of the regions remained elusive.

**Figure 1 | Filling in the missing sequence in the human genome.** a, The 2001 Human Genome Design¹^,² covered most of the gene-rich DNA, which is loosely packed in the nucleus. But many gaps remained in densely packed regions rich in repetitive DNA sequences, which are often not transcribed (the overall size of the gaps is exaggerated here, for easy interpretation). bThanks to advances in sequencing and bioinformatics, researchers can now study all of these missing sequences. These include the telomere and subtelomeric regions that shield chromosomes; centromeric structures essential for cell division; and in particular short and highly repetitive chromosome arms called acrocentric arms. Regions in which DNA is duplicated, either in a single location or in a segmented manner, can now also be analyzed.

Early attempts to close the gaps used long read sequences to span repeating sequences – but such readings were initially highly error prone. New opportunities arose in the 2010s, thanks to advances in the ability to read longer stretches of sequence (described, for example, in Refs. 8 and 9), along with the development of scalable bioinformatics tools. Sequence reads from tens to hundreds of kilobases allowed the study of the genomic organization of many medium gaps. This provided insights into some subtelomeric regions⁹ repeat-rich DNA adjacent to the telomere structures covering the ends of chromosomes. It also enabled the study of the first centromere satellite array¹⁰, in which short sequences are repeated in succession for about 300 kilobases. A subset of segmental duplications (sequence pieces that share 90-100% of their bases and are found in multiple locations) have also been resolved, many of which contain genes previously missing from the reference genome⁹^,¹¹. However, many of the largest multi-megabase repeating regions remained stubborn.

In recent years, the combination of both ultra-long reads⁹ and very accurate long read data¹² has turned out to be a game-changer for solving these regions¹³^,¹⁴, which for the first time reveals extremely long tracts of tandem repeats and regions enriched with segmental duplications. By breaking through these technological barriers, scientists are now discovering extensive repeat-rich regions that can span millions of bases and form the very short arms of chromosomes.

Researchers don’t yet fully understand why parts of the human genome are organized in this way. But gaining such insight will no doubt be valuable, as these repeat-rich sequences are often positioned in locations critical to life. For example, long stretches of ribosomal DNA (rDNA) repeats encode RNA components of the protein synthesizing machinery of the cell and play an important role in nuclear organization.¹⁵. And the repetitive DNA of structures called centromeres is essential for proper chromosome segregation during cell division¹⁶.

These large pieces of repetitive DNA come with different sets of rules, in terms of their genomic organization and evolution. They are also subject to different epigenetic regulation (molecular modifications of DNA and associated proteins that do not change the underlying DNA sequence), which makes repetitive DNA different from euchromatin in terms of its organization, replication timing and transcriptional activity.¹⁷^–¹⁹. Many genome-wide tools and datasets cannot yet fully capture all of this information from extremely repetitive regions of DNA, and so scientists do not yet have a full picture of which transcription factors bind to them, how these regions are spatially organized in the nucleus, or how parts of our genome change during development and in disease states. Now, just like the genome’s initial release decades ago, researchers are faced with a new, unexplored functional landscape in the human genome. Access to this information will drive technology and innovation to encompass these repeat regions, further broadening our understanding of genomic biology.

Over the past year, scientists have used extremely long and highly accurate sequence reads to reconstruct entire human chromosomes, from telomere to telomere⁴^,⁵. Last year also saw the release of an almost completely human reference genome from an effective ‘haploid’ human cell line, with only five remaining gaps marking the locations of rDNA arrays (go.nature.com/3rgz93y). In this line, cells have two identical chromosome pairs, which simplifies the challenge of repeated assembly compared to typical human cells (which are diploid, with different chromosomes inherited from the mother and father). These maps together provide the first high-resolution glimpse of centromere regions, segmental duplications, subtelomeric repeats, and each of the five acrocentric chromosomes, which have very short arms made up almost entirely of highly repetitive DNA at one end.

It’s tempting to think scientists are finally approaching the finish line. However, a single genome assembly, even if it is complete with near-perfect sequence accuracy, is an insufficient reference to study the sequence variation that exists in the human population. Existing maps mapping diversity across the euchromatic parts of the genome need to be expanded to fully capture repetitive regions, where copy number and the organization of repetition differ from person to person. To do this, strategies must be developed for routine production and analysis of whole human diploid genomes. The ambitious goal of arriving at a more complete and comprehensive reference of humanity will no doubt improve our understanding of genome structure and its role in human disease, and will tie in with the promise and legacy of the Human Genome Project.

Source

Share this:

Related