Can an AI predict the language of viral mutation?

Viruses lead one quite repetitive. They go into a cell, hijack its machinery to turn it into a viral copier, and those copies go to other cells, armed with instructions to do the same. That’s how it goes, time and time again. But sometimes things get mixed up during this repeated copy and paste. Mutations occur in the copies. Sometimes a mutation means that an amino acid is not created and that a vital protein does not fold – so that viral version goes in the garbage can of evolutionary history. Sometimes the mutation doesn’t do anything at all, because different sequences encoding the same proteins make up for the error. But occasionally mutations go perfectly well. The changes do not affect the virus’ ability to exist; instead, they bring about useful change, such as making the virus unrecognizable to a person’s immune system. If this allows the virus to evade antibodies generated by previous infections or by a vaccine, that mutant variant of the virus is said to have ‘escaped’.

Scientists are always on the lookout for signs of possible escape. That’s true of SARS-CoV-2 as new strains emerge and scientists are investigating what genetic changes could mean for a long-term vaccine. (So ​​far everything looks fine.) It’s also what confuses researchers studying flu and HIV, which routinely evade our immune defenses. So in an effort to see what might happen next, researchers are creating hypothetical mutants in the lab and seeing if they can evade antibodies taken from recent patients or vaccine recipients. But the genetic code offers too many testing options each The virus’s evolutionary branch can last over time. It is a matter of keeping track.

Last winter, Brian Hie, a computational biologist at MIT and a fan of John Donne’s lyrical poetry, thought of this problem when he discovered an analogy: What if we think of viral sequences as we think of written language? Every viral sequence has some sort of grammar, he reasoned – a set of rules it has to follow to be that particular virus. When mutations violate that grammar, the virus reaches an evolutionary dead end. In virological terms, ‘fitness’ is lacking. Like language, from the immune system perspective, sequence can also be said to have some sort of semantics. There are some sequences that the immune system can interpret – and thus stop the virus with antibodies and other defense mechanisms – and others cannot. So a viral escape can be thought of as a change that preserves the grammar of the sequence but changes its meaning.

The analogy had a simple, almost too simple elegance. But for Hie it was also practical. In recent years, AI systems have become very good at modeling grammar and semantics principles in human language. They do this by training a system with datasets of billions of words, arranged in sentences and paragraphs, from which the system derives patterns. In this way, without being told specific rules, the system learns where the commas should go and how a clause should be structured. It can also be said to sense the meaning of certain strings – words and phrases – intuitively based on the many contexts in which they occur in the dataset. They are patterns, all the way down. That’s how the most advanced language models, such as OpenAI’s GPT-3, can learn to produce perfect grammatical prose that manages to stay reasonably on topic.

An advantage of this idea is that it is generalizable. For a machine learning model, a sequence is a sequence, whether arranged in sonnets or amino acids. According to Jeremy Howard, an AI researcher at the University of San Francisco and a language model expert, applying such models to biological sequences can be beneficial. With enough data from, for example, genetic sequences of viruses known to be infectious, the model will implicitly learn something about how infectious viruses are structured. “That model will contain a lot of advanced and complex knowledge,” he says. Hie knew this was the case. His graduate advisor, computer scientist Bonnie Berger, had previously done similar work with another of her lab members, using AI to predict protein folding patterns.

.Source