A review on protein languauge models

Protein language models resembling human language are evolving, with transformer models now being applied to protein data. Protein sequences, made up of 20 amino acids, dictate the structure and function of proteins, much like words form sentences. Encoder models, like TCR-BERT and ProtTrans, aim to generate protein embeddings for various tasks. Inverse folding models, such as ESM-IF, predict sequences based on protein structures, paving the way for efficient protein design. Recent advancements, like ProGen, have successfully designed functional proteins that outperform natural ones. Larger models, such as ESM-2, are pushing the boundaries of protein structure prediction, hinting at exciting possibilities in protein science.

https://www.apoorva-srinivasan.com/plms/