Last month we mentioned the challenges of studying blue whales in a laboratory setting. If we want to know what a particular gene or protein does in an animal that large, we often have to rely on inference based on what a comparable gene or protein does in an animal we can study more readily like a mouse. And how do we know which protein is comparable–or homologous, to use the more technical term? Typically the search for homologous proteins starts with a sequence similarity scan, a check of an entire library of proteins from various species to see which have a large fraction of the same amino acids at the same locations. That’s straightforward enough when the match is at the level of 80%, but at just 20% similarity, it’s a bigger challenge–until now.
Shortly after two species have diverged, most of their proteins will be identical in sequence. As time passes, one amino acid can be substituted for another without a substantive change in function or structure. Some amino acids are very similar to each other chemically. For example, several of them have a net positive or negative electrical charge in water at physiological pH. Opposite charged amino acids attract, and these electrical attractions can help give a protein its shape. For that purpose, one negatively charged amino acid can be as good as the next. Similarities in size or affinity for water molecules provide additional opportunities for substitutions with minimal impact. After many generations, this creates something like a protein of Theseus conundrum; how many amino acids can be replaced before it’s not the same protein?
In practice, there isn’t a universal numerical answer to that question. Just one or two different amino acids out of hundreds can render a functional protein inactive. On the other hand, there are examples where ~80% of amino acids have changed yet the structure and function remain similar enough to be considered equivalent. So what we really want to know is how similar the structure and function of two proteins are. But figuring out the function of a protein is what got us started down this path in the first place! If we could readily determine that directly, we would be less concerned about sequence differences. Likewise, determining the structure of a protein is labor-intensive and does not lend itself to high-throughput pipelines, at least not yet. But sequences of genes, and by extension proteins, are available in the hundreds of millions.
So we have to infer something about structure and function similarity from sequence similarity while taking into account that they are not linearly related. Presumably that requires understanding the relationships between amino acids. There are 20 which occur commonly. So a simple guess might be that we need to learn 20×20 pairwise relationships. But as you might guess, context matters. And the relevant context might not be in the immediate vicinity within the linear sequence. Think of garden path sentences like the popular “The horse raced past the barn fell.” For most readers, the correct understanding of ‘raced’ doesn’t become clear until you get to the ‘fell’ at the end. Likewise, the ‘meaning’ of an arginine might depend on a threonine 40 positions away in the sequence. So instead of a two dimensional 20×20 space, we are talking about a much higher number of dimensions and a far greater number of possible relationships. Practically speaking, that is a challenge both to learn and to employ in a search.
Enter GPT-3, ChatGPT and their technological cousins. They have to solve a similar challenge to process language and work out whether the horse or the barn fell. A key breakthrough was identifying a compression scheme that could preserve the essential details from that high dimensional space and represent them using fewer dimensions. The same techniques could be applied to protein sequence data. And that’s exactly what Mesih Kilinc, Kejue Jia and Robert L. Jernigan did, as they report in a paper published a couple of weeks ago. Using the large language model approach, they were able to implement a search procedure that ran very efficiently (i.e. in seconds) to find homologous proteins with just 20% sequence similarity from a database with millions of sequences. Essentially what the compression scheme must be doing is taking the amino acid identity information and extracting the properties like size or charge that are relevant in different contexts. This allows the search to find the proteins which have similar properties even when the amino acid identities are not the same. This is what would allow it to pick out those proteins which are actually homologous from all the sequences that share 20% amino acid identities.
As the paper notes, other groups are also applying the large language model approach to our vast biological sequence repositories to answer other useful questions. For example, it is not an easy task to infer the three dimensional structure of a protein from its linear sequence of amino acids. These new techniques have demonstrated some facility with this challenge also, which makes sense since it is a related problem. Just as the full potential of GPT-3 and other language models has yet to be realized when it comes to human languages, there are likely plenty of other applications in biology that will be discovered in the weeks and years to come.
While we wait for that to happen (and maybe some of you are not merely waiting but actually working toward that end), I’d like to reflect for a moment on why this is even a challenge in the first place. We might imagine a world in which protein structure and function are not so forgiving. It could be the case that there was only one sequence that could achieve a given structure, or only a very small number. It could also be the case that there were many such sequences, but the number of differences between any two of them was so big that you couldn’t incrementally go from one to another. Yet the world we live in is apparently one where numerous solutions to the same problem exist, and they are close enough to change between one and the next. I choose to see this as a form of grace, a way to allow for significant diversity and individuality without a cost to flourishing.
Andy has worn many hats in his life. He knows this is a dreadfully clichéd notion, but since it is also literally true he uses it anyway. Among his current metaphorical hats: husband of one wife, father of two teenagers, reader of science fiction and science fact, enthusiast of contemporary symphonic music, and chief science officer. Previous metaphorical hats include: comp bio postdoc, molecular biology grad student, InterVarsity chapter president (that one came with a literal hat), music store clerk, house painter, and mosquito trapper. Among his more unique literal hats: British bobby, captain’s hats (of varying levels of authenticity) of several specific vessels, a deerstalker from 221B Baker St, and a railroad engineer’s cap. His monthly Science in Review is drawn from his weekly Science Corner posts — Wednesdays, 8am (Eastern) on the Emerging Scholars Network Blog. His book Faith across the Multiverse is available from Hendrickson.