Computer algorithm deciphers the most difficult genome segments

Charlotte Edwards 6 April 2018 (Last Updated April 6th, 2018 16:14)

Researchers from Columbia University have developed a computer algorithm that has the potential to decipher the most hard-to-translate sections of the genome.

Computer algorithm deciphers the most difficult genome segments
Geneticists continue to look for ways to decipher the mysteries hidden in DNA. Credit: Mehmet Pinarci.

Researchers from Columbia University have developed a computer algorithm that has the potential to decipher the most hard-to-translate sections of the genome.

As DNA determines almost everything in the human body, from growth and development to aging and disease susceptibility, attempts to decode it involve many challenges. The computer algorithm created by the Columbia researchers could act as a solution to many of these challenges.

Dr Richard Mann, principal investigator at Columbia’s Mortimer B. Zuckerman Mind Brain Behavior Institute and a senior author of the study, said: “The genomes of even simple organisms such as the fruit fly contain 120 million letters worth of DNA, much of which has yet to be decoded because the cues it provides have been too subtle for existing tools to pick up.

“But our new algorithm lets us sweep through these millions of lines of genetic code and pick up even the faintest signals, resulting in a much more complete picture.”

The researchers believe the algorithm could be critical for finding new ways to reduce risk of diseases and disorders such as schizophrenia, Parkinson’s and autism.

One of the biggest challenges facing genome decipherment is the mystery of Hox genes.

Mann explained: “Hox genes are the body’s master architects. They drive some of the earliest and most critical aspects of growth and differentiation, such as where in a developing embryo the head and limbs should be positioned.

“Hox genes do this by producing proteins called transcription factors, which bind to DNA sequences in order to turn large cohorts of genes on or off; like flipping thousands of switches in exactly the right order.”

Decades of Hox gene research has highlighted the paradox that although each individual Hox gene guides a different feature of growth, the Hox transcription factors were all binding strongly to the same set of easily identifiable DNA sequences.

In 2015, Mann and his team discovered that Hox transcription factors also bound at many other locations called low-affinity sites. They theorised that these sites were key to the Hox transcription factors being able to drive one aspect of development against another but they needed a way to decipher these sites from the genome.

The researchers joined forces with the lab of Dr Harmen Bussemaker, a professor in Columbia’s Department of Biological Sciences and Systems Biology. The two labs developed a genetic sequencing method called SELLEX-seq which can systematically characterise all Hox binding sites.

However, the approach had several limitations because the same DNA fragment had to be sequenced over and over again. This was equivalent to translating the same paragraph in a search engine multiple times but still only getting 10% of the words accurately translated.

To address this problem, Bussemaker and his team developed the new computer algorithm that is able to explain, for the first time, the behaviour of all DNA sequences in the SELEX-seq experiment. They called this algorithm No Read Left Behind (NRLB).

Bussemaker said: “In simple terms, NRLB allows us to cover the entire spectrum of binding sites, from the highest to the lowest affinity, with a much greater degree of sensitivity and accuracy than any existing method, including state-of-the-art deep learning algorithms.

“Building on that foundation, we now hope to develop more in-depth biological and computational models to help answer the most complicated questions about the genome.”