Genome modeling and design across all domains of life with Evo 2
Genome modeling and design across all domains of life with Evo 2
Garyk Brixi , Matthew G. Durrant , Jerome Ku , Michael Poli , Greg Brockman , Daniel Chang , Gabriel A. Gonzalez , Samuel H. King , David B. Li , Aditi T. Merchant , Mohsen Naghipourfar , Eric Nguyen , Chiara Ricci-Tam , David W. Romero , Gwanggyu Sun , Ali Taghibakshi , Anton Vorontsov , Brandon Yang , Myra Deng , Liv Gorton , Nam Nguyen , Nicholas K. Wang , Etowah Adams , Stephen A. Baccus , Steven Dillmann , Stefano Ermon , Daniel Guo , Rajesh Ilango , Ken Janik , Amy X. Lu , Reshma Mehta , Mohammad R.K. Mofrad , Madelena Y. Ng , Jaspreet Pannu , Christopher Ré , Jonathan C. Schmok , John St. John , Jeremy Sullivan , Kevin Zhu , Greg Zynda , Daniel Balsam , Patrick Collison , Anthony B. Costa , Tina Hernandez-Boussard , Eric Ho , Ming-Yu Liu , Thomas McGrath , Kimberly Powell , Dave P. Burke , Hani Goodarzi , Patrick D. Hsu , Brian L. Hie
bioRxiv, 2025
Evo 2 is a 40B parameter genomic foundation model capable of predicting functional impacts of genetic variations, autonomously learning biological features, and generating novel genomic sequences across all domains of life.
All-Atom Protein Generation with Latent Diffusion
All-Atom Protein Generation with Latent Diffusion
Amy X. Lu , Wilson Yan , Sarah A. Robinson , Simon Kelow , Kevin K. Yang , Vladimir Gligorijevic , Kyunghyun Cho , Richard Bonneau , Pieter Abbeel , Nathan Frey
bioRxiv, 2024
PLAID is a multimodal protein generation model that generates all-atom protein structures from function and organism prompts, but requires only sequence training data.
Protein Language Model Fitness Is a Matter of Preference
Protein Language Model Fitness Is a Matter of Preference
Cade Gordon , Amy X. Lu , Pieter Abbeel
International Conference on Learning Representations (ICLR), 2025
Enabled by a one-pass pseudolikelihood algorithm, we find that pLMs capture artifacts of training data selection rather than true fitness landscape via influence functions.
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure
Amy X. Lu , Wilson Yan , Kevin K. Yang , Vladimir Gligorijevic , Kyunghyun Cho , Pieter Abbeel , Richard Bonneau , Nathan Frey
bioRxiv, 2024
CHEAP is a joint embedding of protein sequence and structure that can be obtained from sequence alone, and unveil insights into the compressibilitiy, tokenizability, and mechanistic interpretability of protein folding models.
Self-Supervised Contrastive Learning of Protein Representations by Mutual Information Maximization
Self-Supervised Contrastive Learning of Protein Representations by Mutual Information Maximization
Amy X. Lu , Haoran Zhang , Marzyeh Ghassemi , Alan Moses
Machine Learning for Computational Biology (MLCB), 2020
CPCProt uses contrastive learning to learn a parameter-efficient way of embedding proteins, and performs competitively with large language models.
TOPH: Adapting A Contrastive Question-Answering Framework for Protein Search
TOPH: Adapting A Contrastive Question-Answering Framework for Protein Search
Ron Boger* , Amy X. Lu* , Seyone Chithrananda* , Kevin Yang , Petr Skopintsev , Ben Adler , Eric Wallace , Peter Yoon , Pieter Abbeel , Jennifer Doudna
ICML Workshop on Computational Biology, 2023
We present a protein semantic similarity search method for RNA-Guided endonuclease discovery, inspired by dense retrieval methods in open-domain question answering, and introduce a new dataset of CRISPR-Cas and evolutionary-related nucleases.
Pretraining strategies for effective promoter-driven gene expression prediction
Pretraining strategies for effective promoter-driven gene expression prediction
Aniketh Janardhan Reddy , Michael H. Herschl , Sathvik Kolli , Amy X. Lu , Xinyang Geng , Aviral Kumar , Patrick D. Hsu , Sergey Levine , Nilah M. Ioannidis
bioRxiv, 2023
Pretraining and transfer learning strategies for improving model-based design of promoters for cell type-specific expression.
Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics
Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics
Sathvik Kolli , Amy X. Lu , Xinyang Geng , Aviral Kumar , Sergey Levine
ICLR Workshop on Machine Learning for Drug Discovery (MLDD), 2022
Strategies for data curation, model-training, optimization, and evaluation heuristics for data-driven proposals of novel de novo proteins.
Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning
Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning
Alex X Lu , Amy X. Lu , Iva Pritišanac , Taraneh Zarin , Julie D Forman-Kay , Alan M Moses
PLOS Computational Biology, 2022
Reverse Homology is a self-supervised method which captures evolutionary information by contrastive learning to discover molecular features of intrinsically disordered regions.
Learned embeddings from deep learning to visualize and predict protein sets
Learned embeddings from deep learning to visualize and predict protein sets
Christian Dallago , Konstantin Schütze , Michael Heinzinger , Tobias Olenyi , Maria Littmann , Amy X. Lu , Kevin K Yang , Seonwoo Min , Sungroh Yoon , James T Morton , Burkhard Rost
Current Protocols, 2021
Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning
Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning
Amy X. Lu , Alex X. Lu , Alan Moses
Machine Learning for Computational Biology (MLCB), 2020
We outline how viewing evolution as natural sequence augmentation for contrastive learning recapitulates comparative genomics, and maximizes the mutual information between sequence and function.
Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings
Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings
Haoran Zhang* , Amy X. Lu* , Mohamed Abdalla , Matthew McDermott , Marzyeh Ghassemi
ACM Conference on Health, Inference, and Learning (CHIL), 2020
We apply fairness definitions to quantify the cross-group bias in BERT embeddings pretrained on medical notes, and find statistically significant differences in classifier performance.
The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers
The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers
Alex X Lu , Amy X. Lu , Wiebke Schormann , Marzyeh Ghassemi , David Andrews , Alan Moses
Neural Information Processing Systems (NeurIPS), 2019
Introduces the COOS-7 dataset to benchmark and evaluate the capacity of feature learning methods to generalize to natural distribution shifts in microscopy images.