Publications

Genome modeling and design across all domains of life with Evo 2

Garyk Brixi , Matthew G. Durrant , Jerome Ku , Michael Poli , Greg Brockman , Daniel Chang , Gabriel A. Gonzalez , Samuel H. King , David B. Li , Aditi T. Merchant , Mohsen Naghipourfar , Eric Nguyen , Chiara Ricci-Tam , David W. Romero , Gwanggyu Sun , Ali Taghibakshi , Anton Vorontsov , Brandon Yang , Myra Deng , Liv Gorton , Nam Nguyen , Nicholas K. Wang , Etowah Adams , Stephen A. Baccus , Steven Dillmann , Stefano Ermon , Daniel Guo , Rajesh Ilango , Ken Janik , Amy X. Lu , Reshma Mehta , Mohammad R.K. Mofrad , Madelena Y. Ng , Jaspreet Pannu , Christopher Ré , Jonathan C. Schmok , John St. John , Jeremy Sullivan , Kevin Zhu , Greg Zynda , Daniel Balsam , Patrick Collison , Anthony B. Costa , Tina Hernandez-Boussard , Eric Ho , Ming-Yu Liu , Thomas McGrath , Kimberly Powell , Dave P. Burke , Hani Goodarzi , Patrick D. Hsu , Brian L. Hie

bioRxiv, 2025

Paper / Code / Server

Evo 2 is a 40B parameter genomic foundation model capable of predicting functional impacts of genetic variations, autonomously learning biological features, and generating novel genomic sequences across all domains of life.

All-Atom Protein Generation with Latent Diffusion

Amy X. Lu , Wilson Yan , Sarah A. Robinson , Simon Kelow , Kevin K. Yang , Vladimir Gligorijevic , Kyunghyun Cho , Richard Bonneau , Pieter Abbeel , Nathan Frey

bioRxiv, 2024

Paper / Code / Poster / Slides

PLAID is a multimodal protein generation model that generates all-atom protein structures from function and organism prompts, but requires only sequence training data.

Protein Language Model Fitness Is a Matter of Preference

Cade Gordon , Amy X. Lu , Pieter Abbeel

International Conference on Learning Representations (ICLR), 2025

Paper

Enabled by a one-pass pseudolikelihood algorithm, we find that pLMs capture artifacts of training data selection rather than true fitness landscape via influence functions.

Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

Amy X. Lu , Wilson Yan , Kevin K. Yang , Vladimir Gligorijevic , Kyunghyun Cho , Pieter Abbeel , Richard Bonneau , Nathan Frey

Cell Patterns, 2024

Paper / Poster / Code

CHEAP is a joint embedding of protein sequence and structure that can be obtained from sequence alone, and unveil insights into the compressibilitiy, tokenizability, and mechanistic interpretability of protein folding models.

Self-Supervised Contrastive Learning of Protein Representations by Mutual Information Maximization

Amy X. Lu , Haoran Zhang , Marzyeh Ghassemi , Alan Moses

Machine Learning for Computational Biology (MLCB), 2020

Paper / Poster / Code

CPCProt uses contrastive learning to learn a parameter-efficient way of embedding proteins, and performs competitively with large language models.

TOPH: Adapting A Contrastive Question-Answering Framework for Protein Search

Ron Boger* , Amy X. Lu* , Seyone Chithrananda* , Kevin Yang , Petr Skopintsev , Ben Adler , Eric Wallace , Peter Yoon , Pieter Abbeel , Jennifer Doudna

ICML Workshop on Computational Biology, 2023

Paper / Poster

We present a protein semantic similarity search method for RNA-Guided endonuclease discovery, inspired by dense retrieval methods in open-domain question answering, and introduce a new dataset of CRISPR-Cas and evolutionary-related nucleases.

Pretraining strategies for effective promoter-driven gene expression prediction

Aniketh Janardhan Reddy , Michael H. Herschl , Sathvik Kolli , Amy X. Lu , Xinyang Geng , Aviral Kumar , Patrick D. Hsu , Sergey Levine , Nilah M. Ioannidis

bioRxiv, 2023

Paper

Pretraining and transfer learning strategies for improving model-based design of promoters for cell type-specific expression.

Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics

Sathvik Kolli , Amy X. Lu , Xinyang Geng , Aviral Kumar , Sergey Levine

ICLR Workshop on Machine Learning for Drug Discovery (MLDD), 2022

Paper

Strategies for data curation, model-training, optimization, and evaluation heuristics for data-driven proposals of novel de novo proteins.

Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning

Alex X Lu , Amy X. Lu , Iva Pritišanac , Taraneh Zarin , Julie D Forman-Kay , Alan M Moses

PLOS Computational Biology, 2022

Paper / Preprint

Reverse Homology is a self-supervised method which captures evolutionary information by contrastive learning to discover molecular features of intrinsically disordered regions.

Learned embeddings from deep learning to visualize and predict protein sets

Christian Dallago , Konstantin Schütze , Michael Heinzinger , Tobias Olenyi , Maria Littmann , Amy X. Lu , Kevin K Yang , Seonwoo Min , Sungroh Yoon , James T Morton , Burkhard Rost

Current Protocols, 2021

Paper / Web Server / Code

Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning

Amy X. Lu , Alex X. Lu , Alan Moses

Machine Learning for Computational Biology (MLCB), 2020

Paper / Poster

We outline how viewing evolution as natural sequence augmentation for contrastive learning recapitulates comparative genomics, and maximizes the mutual information between sequence and function.

Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings

Haoran Zhang* , Amy X. Lu* , Mohamed Abdalla , Matthew McDermott , Marzyeh Ghassemi

ACM Conference on Health, Inference, and Learning (CHIL), 2020

Paper / ArXiv / Poster / Code

We apply fairness definitions to quantify the cross-group bias in BERT embeddings pretrained on medical notes, and find statistically significant differences in classifier performance.

The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers

Alex X Lu , Amy X. Lu , Wiebke Schormann , Marzyeh Ghassemi , David Andrews , Alan Moses

Neural Information Processing Systems (NeurIPS), 2019

Paper / ArXiv

Introduces the COOS-7 dataset to benchmark and evaluate the capacity of feature learning methods to generalize to natural distribution shifts in microscopy images.