Proteins are natural molecules that perform critical cellular functions in the body and are building blocks of all diseases. Characterizing proteins can reveal mechanisms of disease, including ways to slow or reverse disease, while creating proteins can lead to the development of entirely new drugs and therapies.
access:
Microsoft China Official Mall-Homepage
However, the current process of designing proteins in the laboratory is expensive from a computational and human resource perspective. It requires coming up with a protein structure that performs a specific task in the body, and then finding a protein sequence (the sequence of amino acids that make up a protein) that might "fold" into that structure. (Proteins must fold correctly into a three-dimensional shape in order to perform their intended function).
It doesn’t have to be this complicated.
This week, Microsoft launched EvoDiff, a general framework that the company claims can generate "high-fidelity" and "diverse" proteins based on protein sequences. Unlike other protein generation frameworks, EvoDiff does not require any structural information of the target protein, eliminating what is usually the most laborious step.
Kevin Yang, a senior researcher at Microsoft, said that after EvoDiff is open sourced, it can be used to create enzymes for new treatments and drug delivery methods, as well as new enzymes for industrial chemical reactions.
"Our vision is that EvoDiff will expand the capabilities of protein engineering beyond the structure-function paradigm toward programmable, sequence-first design," Yang, one of EvoDiff's co-creators, told TechCrunch in an email interview. "With EvoDiff, we demonstrated that we may not actually need structure, but rather 'the protein sequence is all you need', to controllably design new proteins."
At the heart of the EvoDiff framework is a 640-parameter model trained on data from all different species and functional classes of proteins. (Parameters are what the AI model learns from the training data and essentially define the model's skill at handling the problem -- in this case, generating proteins.) The data for training the model comes from the OpenFold dataset of sequence alignments and from UniRef50, a subset of the UniProt dataset, a database of protein sequence and functional information maintained by the UniProt consortium.
EvoDiff is a diffusion model whose structure is similar to many modern image generation models such as stable diffusion and DALL-E2. EvoDiff learns to gradually subtract noise from a starting protein that consists almost entirely of noise, allowing it to slowly, step-by-step approach the protein sequence.
Diffusion models have been increasingly used in areas beyond image generation, from designing novel proteins (such as EvoDiff), to composing music, and even synthesizing speech.
"If there's one takeaway [from EvoDiff], I think it's that we can - and should - generate proteins from sequence because we enable versatility, scale and modularity," Ava Amini, another EvoDiff co-contributor and a senior researcher at Microsoft, said via email. "Our diffusion framework gives us the ability to do this and also allows us to control how these proteins are designed to achieve specific functional goals."
To Amini's point, EvoDiff not only creates new proteins but also fills "gaps" in existing protein designs. For example, if a certain part of a protein binds to another protein, the model can generate a sequence of the protein's amino acids around that part that meets a series of criteria.
Because EvoDiff designs proteins in "sequence space" rather than protein structure, it can also synthesize "disordered proteins" that ultimately fail to fold into their final three-dimensional structure. Like normally functioning proteins, disordered proteins play important roles in biology and disease, such as enhancing or reducing the activity of other proteins.
It's important to point out that the research behind EvoDiff hasn't been peer-reviewed -- at least not yet. Sarah AlAMDari, a Microsoft data scientist involved in the project, admitted that "there is still a lot of scaling work to be done" before the framework can be put into commercial use.
"This is just a 640 million-parameter model, and if we scaled it up to billions of parameters, we might see an improvement in the quality of the generation," Alamdari said via email. "While we demonstrated some coarse-grained strategies, to achieve finer control, we would like EvoDiff to be conditioned on text, chemical information, or other means to specify the desired features."
Next, the EvoDiff team plans to test the model on proteins generated in the lab to see if they work. If it works, they will start work on the next generation framework.