Hi, I'm Seyone, an undergrad student at Berkeley studying computer science & bioengineering. I'm interested in using computation as a mechanism for understanding how biology works — and how we can engineer it to improve human health. My research goal is to build computational tools for the design, engineering & interpretation of biological systems.
Beyond research, I'm involved on campus with Machine Learning at Berkeley, where I lead the research commitee. We organize reading groups, mentor homegrown research projects, support members with funding to attend conferences, and create high-quality technical blogposts, among other things. I also co-organize the BioML seminar series with Samarth Jajoo, bringing leaders at the cutting-edge of comptuation and biology in industry, academia and policy to Berkeley. I am especially interested in mentoring and encouraging younger students interested in exploring research, startups, and biotech more broadly.
If you'd like to chat, feel free to email me!! I'm always excited to talk all things science. You can also find me on Twitter and GitHub.
Presently, I'm interested in the intersection of machine learning for single-cell genomics & protein biology.
At Berkeley, I work in the laboratory of Jennifer Doudna with Ron Boger on problems like model-guided CRISPR engineering, structure-conditioned biological sequence design, and tools for proteomic search. I am also a research intern at Microsoft Research New England on the BioML team, advised by Kevin Yang and Judith Amores. Previously, I was an research intern at Dyno Therapeutics, working on machine learning for protein engineering. In high school, I worked in the laboratory of Prof. Alan Aspuru-Guzik at the University of Toronto, and hacked on open source software at DeepChem, co-creating the ChemBERTa suite of chemical language models. My publications are viewable on Google Scholar.
TOPH: Adapting A Contrastive Question-Answering Framework for Protein Search
Ron Boger*, Amy X. Lu*, Seyone Chithrananda*, Kevin Yang, Petr Skopintsev, Ben Adler, Eric Wallace, Peter Yoon, Pieter Abbeel, Jennifer Doudna
ICML Workshop on Computational Biology (2023)
We present a protein semantic similarity search method for RNA-Guided endonuclease discovery, inspired by dense retrieval methods in open-domain question answering, and introduce a new dataset of CRISPR-Cas and evolutionary-related nucleases.
A Benchmark Framework for Evaluating Structure-to-Sequence Models for Protein Design
Jeffrey Chan, Seyone Chithrananda, David Brookes, Sam Sinai
NeurIPS ML for Structural Biology Workshop (2022)
We built a benchmark for evaluating autoregressive structure-based protein design models (ESM-IF1, ProteinMPNN) on protein design using the GB1 dataset, by evaluating the log-likelihoods and joint sampling probabilities on highly epistastic regions of the protein. Alongside Jeffrey and David, I developed a model distillation approach for improving the generative capabilities of ESM.
ChemBERTa-2: Towards Chemical Foundation Models
Walid Ahmed, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
ELLIS ML for Molecule Discovery Workshop, arXiv (2021, 2022)
Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks.
Continuation of ChemBERTa project, working to scale pre-training language models from 10M to 77M molecules, and develop a novel pre-training task, multi-task regression. Worked on open-source integration of tokenizers, models within DeepChem library.
Assigning confidence to molecular property prediction
AkshatKumar Nigam, Robert Pollice, Matthew Hurley, Riley Hickman, Matteo Aldeghi, Naruki Yoshikawa, Seyone Chithrananda, Vincent Voelz, Alan Aspuru-Guzik
Expert Opinions in Drug Discovery (2021)
Introduction: Computational modeling has rapidly advanced over the last decades. Recently, machine learning has emerged as a powerful and cost-effective strategy to learn from existing datasets and perform predictions on unseen molecules. Accordingly, the explosive rise of data-driven techniques raises an important question: What confidence can be assigned to molecular property predictions and what techniques can be used?
Areas covered: The authors discuss popular strategies for predicting molecular properties, their corresponding uncertainty sources and methods to quantify uncertainty. First, the authors’ considerations for assessing confidence begin with dataset bias and size, data-driven property prediction and feature design. Next, the authors discuss property simulation via computations of binding affinity in detail. Lastly, they investigate how these uncertainties propagate to generative models, as they are usually coupled with property predictors.
Expert opinion: Computational techniques are paramount to reduce the prohibitive cost of brute-force experimentation during exploration. The authors believe that assessing uncertainty in property prediction models is essential whenever closed-loop drug design campaigns relying on high-throughput virtual screening are deployed. Accordingly, considering sources of uncertainty leads to better-informed validations, more reliable predictions and more realistic expectations of the entire workflow. Overall, this increases confidence in the predictions and, ultimately, accelerates drug design.
Part of team contributing a review encompassing the various sources of uncertainty in molecular property prediction. My contribution focused on detailing the various sources of uncertainty in data inputs, such as sources of experimental noise, docking errors, or various errors with forms of molecular representations like SMILES.
ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
NeurIPS ML for Molecules Workshop, 2020
GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining.
Lead author on ChemBERTa, the first language model for molecular property prediction. Developed open-source project, amassing ~1M model downloads and ~110 citations to date.
Analysis of human hematopoietic cells using Scanpy and Dynamo
I analyzed raw hematopoiesis data from the Cell paper, ‘Mapping transcriptomic vector fields of single cells’. First, I used clustering tools such as the Leiden algorithm in Scanpy to find marker genes by analyzing the most differentially expressed genes in specific clusters with regards to their respective cell type. I then used dynamo’s ability to compute RNA velocity for sc-RNA seq data to better understand cell fate transition in the dataset, and compute in-silico perturbation tests to investigate cell fate outcomes after perturbing key regulators like GAT1. Class project for BIOENG 190/290 (cross-listed graduate course).
My homepage layout uses Kian Fazi's wonderful website template (also inspired by Jon Barron).
I've been very priviledged to learn from some incredible mentors over the years, both from research and other experience. Many of them took me on when I had nothing to offer them, and I can't thank them enough. In no particular order:
- Dr. Bharath Ramsundar ~ Deep Forest Sciences
- Dr. Angelica Parente ~ Sutter Hill Ventures
- Prof. Tyler Cowen ~ Mercatus Centre/Emergent Ventures
- Prof. Alan Aspuru-Guzik ~ University of Toronto
- Prof. Eric Lander ~ Broad Institute of MIT & Harvard
- Dr. Jeffrey Chan & Dr. David Brookes ~ Dyno Therapeutics
- Navid Nathoo & Nadeem Nathoo ~ The Knowledge Society
- Arjun Sripathy ~ Tesla Autopilot/Berkeley