Proteins perform a numerous functions in living organisms. They participate in repairing, building, catalyzing, signaling, and overall maintaining the proper functioning and life of cells. Advancements in genome sequencing have resulted in an abundance of protein sequences (Durairaj et al. 2023). For example, UniProtKB 2024_03 contains around 245.5 million sequence records. However, functional characterization lags behind, with only a subset of proteins being annotated manually or through automated pipelines (Vu and Jung 2021). In the example above only around 570 thousand sequences are annotated (SwissProt database) while the rest (almost 245 million sequences) remain unannotated (TrEMBL). It means that we know function only for 0.2% of the entire amount of protein sequences! Despite this, a significant portion of sequences, including those with unknown functions or from undiscovered protein families, remain unannotated.
Tools for protein function prediction and annotation are essential for several reasons:
Moreover, these tools aid in comparative genomics, evolutionary studies, and understanding the mechanisms underlying diseases, boosting molecular biology, genetics, drug discovery, and biotechnology (Idhaya, Suruliandi, and Raja 2024; de Crécy-Lagard et al. 2022). Anyone involved in biological research, from academic scientists to pharmaceutical companies, can benefit from using these tools.
ProtNLM (Protein Natural Language Model) is integrated into UniProt’s Automatic Annotation pipeline to automatically classify and annotate unreviewed records in UniProtKB by predicting protein names from amino acid sequences. You can find the results of the ProtNLM’s prediction at the “Names & Taxonomy”–>”Protein names”–>”Recommended name” section on the UniProt web site (here’s A0A2Z4IEP2, the UniProt’s example). Overall ProtNLM has already annotated around 49 million previously unannotated protein sequences in the UniProt database.
ProtNLM was developed by Google Research in 2023. ProtNLM utilizes a transformer sequence-to-sequence model (similar to assigning titles to images or documents), to generate textual descriptions for proteins. Others have typically treated this as a classification problem (there is a fixed set of possible outputs) rather than a captioning problem (new annotation is possible) or have started with structure (e.g. DeepFRI) rather than sequence alone. It’s still work in progress as stated at the very beginning of the preprint and main challenges include the ambiguity of assigning multiple names to a single protein and the difficulty in verifying proposed descriptions without external evidence. The model is trained on UniProt data, filtered for quality, and validated through automated and manual evaluation. Recent improvements include leveraging additional information such as organism and secondary structure, and employing ensemble models for enhanced performance. However, acknowledging the potential for errors, UniProt encourages user feedback for continuous improvement and accuracy assessment.
We tried ProtNLM in practice with 3 existing in nature sequences (insulin, hemoglobin, and collagen) and 3 completely random ones (the corresponding protein for each sequence written in gold). In the snapshots you can see the ProtNLM output through 310.ai copilot. In the results you see a predicted name (an annotation for a protein sequence) and a score (a confidence in the prediction). The higher the score the more accurate annotation is. You can observe that the scores for random sequences are much lower than the score for existing proteins. Also you can notice that the predicted names are very close to each other in natural sequences but vary quite a lot with the random ones. Both findings show that the higher scores with higher numbers of similar predictions in the top 10 reflect more confident results. By the way the cute robot icon was found on flaticon.com.
You can try the copilot to annotate your own sequences. It’s easy, fast, and convenient, considering the web-based chat-style interface. Also you can find a tutorial on how to use it on YouTube.
At Gene X, we blend technology and medicine to bring advanced healthcare solutions. With a focus on innovation and over 12 patented inventions, we strive to enhance patient care through cutting-edge medical devices and personalized services for doctors. Join us on our mission to revolutionize the future of healthcare.