Machine learning helps predict protein functions
Proteins are an integral part of all living organisms. They are made up of a sequence of building blocks called amino acids. This sequence determines their function, which can range from defining cell structure to regulating metabolism. Scientists can modify a protein sequence and experimentally test if and how this change alters its function. However, there are too many possible changes in the amino acid sequence to test them all in the laboratory. Instead, researchers build very complex computer models that predict the function of proteins based on their amino acid sequence. This is essential for engineering proteins with new functions. Scientists have now combined several machine learning approaches to create a simple predictive model that often performs better than established complex methods.
Natural proteins perform many crucial functions in sustaining life. But scientists can also engineer natural proteins for desired purposes such as gene editing and the synthesis of valuable chemicals. This new combined modeling approach to predict protein function will aid in the design and engineering of novel proteins. This approach will allow scientists to easily redesign proteins for a wide range of applications such as new enzymes to convert plant matter into biofuels or bioproducts or to create new biomaterials.
Scientists have several approaches to predicting the functional properties of a given protein that use the amino acid sequence of the protein to build a computer model. Scientists create such models using both classical statistical methods and modern machine learning computational approaches. One such statistical method, called regression analysis, relates a given amino acid sequence to an experimentally measured functional property of a protein. To increase the amount of data available to make functional predictions for a protein, researchers include evolutionarily related protein sequences as additional input. In general, these evolutionarily related proteins are likely to share the property of the protein of interest, although often without direct experimental evidence. The researchers use a machine learning modeling approach based on the statistical properties of these sequences. In the study presented here, the researchers combined regression analysis and evolutionary data to come up with a simple and effective machine learning approach. Researchers have found that this simple combination approach is competitive with, and often outperforms, more sophisticated methods.
Partial support was provided by the Department of Energy Office of Science, Office of Biological and Environmental Research, Genomic Science Program, Lawrence Livermore National Laboratory’s Secure Biosystems Design Scientific Focus Area, Chan Zuckerberg Investigator Program, and C3. have. This material is also based on work supported by the National Library of Medicine of the National Institutes of Health and the National Science Foundation Graduate Research Fellowship Program.