Seyone Chithrananda/say-on/

Hi, I'm Seyone, an undergrad student at Berkeley studying computer science & bioengineering. I'm interested in using computation as a mechanism for understanding how biology works — and how we can engineer it to improve human health. My research goal is to build computational tools for the design, engineering & interpretation of biological systems.

If you'd like to chat, feel free to email me!! I'm always excited to talk all things science. You can also find me on Twitter and GitHub.

headshot

Blog

I'm starting to write again about science, philosophy, biotech and public policy. Posts to be added below, and check out my Medium for older ones.

Research

Presently, I'm interested in the intersection of machine learning for single-cell genomics & protein biology.

I'm a visiting undergrad researcher at the Broad Institute, advised by Eric Lander and Eeshit Dhaival Vashnav. This past summer, I was an research intern at Dyno Therapeutics, working on machine learning for protein engineering. Previously, I worked in the laboratory of Prof. Alan Aspuru-Guzik at the University of Toronto, and hacked on open source software at DeepChem, co-creating the ChemBERTa suite of chemical language models. My publications are viewable on Google Scholar.

gb1 A Benchmark Framework for Evaluating Structure-to-Sequence Models for Protein Design
Jeffrey Chan, Seyone Chithrananda, David Brookes, Sam Sinai
NeurIPS ML for Structural Biology Workshop (2022)
[] Abstract here. / [Presentation] / [Poster]

Preprint public soon. I built a benchmark for evaluating autoregressive structure-based protein design models (ESM-IF1, ProteinMPNN) on protein design using the GB1 dataset, by evaluating the log-likelihoods and joint sampling probabilities on highly epistastic regions of the protein. Alongside Jeffrey and David, I developed a model distillation approach for improving the generative capabilities of ESM.

schematic ChemBERTa-2: Towards Chemical Foundation Models
Walid Ahmed, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
ELLIS ML for Molecule Discovery Workshop, arXiv (2021, 2022)
[] Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks. / [preprint]/ [Gtthub]

Continuation of ChemBERTa project, working to scale pre-training language models from 10M to 77M molecules, and develop a novel pre-training task, multi-task regression. Worked on open-source integration of tokenizers, models within DeepChem library.

schematic Assigning confidence to molecular property prediction
AkshatKumar Nigam, Robert Pollice, Matthew Hurley, Riley Hickman, Matteo Aldeghi, Naruki Yoshikawa, Seyone Chithrananda, Vincent Voelz, Alan Aspuru-Guzik
Expert Opinions in Drug Discovery (2021)
[] Introduction: Computational modeling has rapidly advanced over the last decades. Recently, machine learning has emerged as a powerful and cost-effective strategy to learn from existing datasets and perform predictions on unseen molecules. Accordingly, the explosive rise of data-driven techniques raises an important question: What confidence can be assigned to molecular property predictions and what techniques can be used? Areas covered: The authors discuss popular strategies for predicting molecular properties, their corresponding uncertainty sources and methods to quantify uncertainty. First, the authors’ considerations for assessing confidence begin with dataset bias and size, data-driven property prediction and feature design. Next, the authors discuss property simulation via computations of binding affinity in detail. Lastly, they investigate how these uncertainties propagate to generative models, as they are usually coupled with property predictors. Expert opinion: Computational techniques are paramount to reduce the prohibitive cost of brute-force experimentation during exploration. The authors believe that assessing uncertainty in property prediction models is essential whenever closed-loop drug design campaigns relying on high-throughput virtual screening are deployed. Accordingly, considering sources of uncertainty leads to better-informed validations, more reliable predictions and more realistic expectations of the entire workflow. Overall, this increases confidence in the predictions and, ultimately, accelerates drug design. / [publication]/ [preprint]

Part of team contributing a review encompassing the various sources of uncertainty in molecular property prediction. My contribution focused on detailing the various sources of uncertainty in data inputs, such as sources of experimental noise, docking errors, or various errors with forms of molecular representations like SMILES.

data scaling law ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
NeurIPS ML for Molecules Workshop, 2020
[] GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining. / [workshop paper]/ [Talk]

Lead author on ChemBERTa, the first language model for molecular property prediction. Developed open-source project, amassing ~1M model downloads and ~110 citations to date.

dynamo Analysis of human hematopoietic cells using Scanpy and Dynamo
[Report]

I analyzed raw hematopoiesis data from the Cell paper, ‘Mapping transcriptomic vector fields of single cells’. First, I used clustering tools such as the Leiden algorithm in Scanpy to find marker genes by analyzing the most differentially expressed genes in specific clusters with regards to their respective cell type. I then used dynamo’s ability to compute RNA velocity for sc-RNA seq data to better understand cell fate transition in the dataset, and compute in-silico perturbation tests to investigate cell fate outcomes after perturbing key regulators like GAT1. Class project for BIOENG 190/290 (cross-listed graduate course).

Acknowledgements

My homepage layout uses Kian Fazi's wonderful website template (also inspired by Jon Barron).