PreMode / analysis /Hsu.et.al.git /README.md

Upload folder using huggingface_hub

7718235 verified 11 months ago

8.68 kB

	# Learning protein fitness models from evolutionary and assay-labelled data

	This repo is a collection of code and scripts for evaluating methods that combine evolutionary and assay-labelled data for protein fitness prediction.

	For more details, please see our pre-print [Combining evolutionary and assay-labelled data for protein fitness prediction](https://www.biorxiv.org/content/10.1101/2021.03.28.437402v1.abstract).

	## Contents
	- Repo contents
	- System requirements
	- Installation
	- Demo
	- Jackhmmer search
	- Fitness data
	- Density models
	- Predictors


	## Repo contents
	There are several main components of the repo.
	- `data`: Processed protein fitness data. (Only one example data set is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
	- `alignments`: Processed multiple sequence alignments. (Only one example alignment is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
	- `scripts`: Bash and Python scripts for data collection and data analysis.
	- `src`: Python code for training and evaluating the methods assessed in the paper.
	Also includes the evaluation and comparison framework of different predictors.
	- `environment.yml`: Software dependencies for conda environment.

	When running the provided scripts, the outputs will be written to the following directories:
	- `inference`: Directory for intermediate files such as inferred sequence log likelihoods.
	- `results`: Directory for results as csv files.

	## System requirements

	### Hardware requirements
	Some of the methods, in particular DeepSequence VAE, UniRep mLSTM, and ESM
	Transformer, require GPU for training and inference. The GPU code in this repo
	has been tested on NVIDIA Quadro RTX 8000 GPU.

	Evaluating all the methods each with 20 random seeds, 19 data sets, and 10
	training setups would require a relatively long time on a single core. Our
	evaluation code supports multiprocessing and has been tested on 32 cores.

	For storing all intermediate files for all methods and all data sets,
	approximately 100G of disk space will be needed.

	### Software requirements
	The code has been tested on Ubuntu 18.04.5 LTS (Bionic Beaver) with conda 4.10.0
	and Python 3.8.5.
	The (optional) slurm scripts have been tested on slurm 17.11.12.
	The list of software dependencies are provided in the `environment.yml` file.

	## Installation

	1. Create the conda environment from the environment.yml file:
	```
	conda env create -f environment.yml
	```

	2. Activate the new conda environment:
	```
	conda activate protein_fitness_prediction
	```

	3. Install the [plmc package](https://github.com/debbiemarkslab/plmc):
	```
	cd $HOME (or use another directory for plmc <directory_to_install_plmc> and
	modify `scripts/plmc.sh` accordingly with the custom directory)
	git clone https://github.com/debbiemarkslab/plmc.git
	cd plmc
	make all-openmp
	```

	The installation should finish in a few minutes.

	## Demo
	The one-hot linear model is the simplest example as it only requires
	assay-labelled data. To evaluate the one-hot linear model on the Poly(A)-binding
	protein (PABP) data with 240 training examples and 20 seeds on a single core:
	```
	python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 onehot --n_seeds=20 --n_threads=1 --n_train=240
	```

	When the program finishes, the results from the 20 runs will be available in the
	file `results/BLAT_ECOLX_Ranganathan2015-2500/results.csv`.

	As another example that involves both evolutionary and assay-lablled data, here
	we show the process to evaluate the augmented Potts model on the same protein.

	The multiple sequence alignments (MSAs) are available in the `alignments`
	directory for all proteins used in our assessment. For other proteins,
	MSAs can be retrieved by jackhmmer search (see the jackhmmer search section).

	From the MSA, first run PLMC to estimate the couplings model:
	```
	bash scripts/plmc.sh BLAT_ECOLX BLAT_ECOLX_Ranganathan2015-2500
	```
	The resulting models are saved at `inference/BLAT_ECOLX_Ranganathan2015-2500/plmc`.

	Then, similar to the one-hot linear model evaluation, run:
	```
	python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 ev+onehot --n_seeds=20 --n_threads=1 --n_train=240
	```
	The evaluation should finish in a few minutes, and all results will be saved to
	`results/BLAT_ECOLX_Ranganathan2015-2500/results.csv`.

	Here, `ev+onehot` refers to the augmented Potts model. Other models and data
	sets can also be similarly evaluated as long as the corresponding prerequisite
	files are present in the inference directory.

	## Jackhmmer search
	1. Downloaded UniRef100 in fasta format from [UniProt](https://www.uniprot.org/downloads).
	2. Index the uniref100 fasta file into ssi with
	```
	esl-sfetch --index <seqdb>.fasta
	```
	3. Set the file location of the fasta file in `scripts/jackhmmer.sh`.
	4. To run jackhmmer, use `scripts/jackhmmer.sh` to search the local fasta
	file. In addition to running jackhmmer search, the script also implicitly calls
	the other file conversion scripts. For example, it extracts target ids from the
	jackhmmer tabular output by calling `scripts/tblout2ids.py`; converts the
	fasta output to list of sequences by `scripts/fasta2txt.py`; and splits the
	sequences into train and validation with `scripts/randsplit.py`.
	5. The outputs of the jackhmmer script will be in
	`jackhmmer/<dataset>/<run_id>`, where each iteration's alignment is saved as
	`iter-<N>.a2m` and the final alignment is saved as `alignment.a2m`. The list of
	full length target sequences is at `target_seqs.fasta` and `target_seqs.txt`.

	## Fitness data
	In the example data set in the `data` directory (and also for all other data sets
	available on Dryad), each subdirectory (e.g. `data/BLAT_ECOLX_Ranganathan2015-2500`)
	represents a data set of interest. In the subdirectory, there are two key files.
	- `wt.fasta` documents the WT sequence.
	- `data.csv` contains three columns: `seq`, `log_fitness`, `n_mut`.
	`seq` is the sequence with mutation, and should be the same length as WT seq.
	`log_fitness` is the log enrichment ratio or other log-scale fitness values,
	where higher is better. Although referred to as `log_fitness` here, this
	corresponds to `fitness` in the paper.
	`n_mut` is how many mutations away the sequence is from WT, where 0 indicates WT.

	## Density models

	### Potts model
	For learning a Potts model (EVmutation / plmc) from an MSA, see ``scripts/plmc.sh``.
	The resulting couplings model files (saved to the inference directory) can be
	directly parsed by our correpsonding `ev` and `ev+onehot` predictors.

	### DeepSequence VAE
	1. Install the [DeepSequence
	package](https://github.com/debbiemarkslab/DeepSequence.git).
	2. Put the DeepSequence package directory as `WORKING_DIR` in both `src/train_vae.py`
	and `src/inference_vae.py`.
	3. Use `scripts/train_vae.sh` for training a VAE model from an MSA.
	4. For retrieving ELBOs from VAEs, see `scripts/inference_vae.sh`.
	5. The saved elbo files in the inference directory can be parsed by the
	corresponding `vae` and `vae+onehot` predictors.

	### ESM
	1. Follow the instructions from the [ESM
	repo](https://github.com/facebookresearch/esm.git) to download the pre-trained
	model weights.
	2. Put the downloaded pre-trained weights location into
	`scripts/inference_esm.sh`.
	3. To retrieve ESM Transformer approximate pseudo-log likelihoods for sequences
	in a fasta file, see `scripts/inference_esm.sh`. The results will be in the
	inference directory and can be used by the `esm` and `esm+onehot` predictors.

	### UniRep
	1. Download the pre-trained UniRep weights (1900-unit) from the [UniRep
	repo](https://github.com/churchlab/UniRep#obtaining-weight-files).
	2. Put the location for the downloaded weights into `scripts/evotune_unirep.sh`.
	3. Use the `scripts/evotune_unirep.sh` script to evotune the UniRep model with
	an MSA file as `seqspath`.
	4. Use `scripts/inference_unirep.sh` to calculate log-likelihoods from an
	evotuned unirep model.

	## Predictors
	Each type of predictor is represented by a Python class in `src/predictors`.
	A predictor class represents a prediction strategy for protein fitness that
	depends on evolutionary data, assay-labelled data, or both.
	The base predictor class, `BasePredictor` is defined at `src/predictors/base_predictors.py`.
	All predictor classes inherit from this class and have the `train` and `predict` methods.
	The `JointPredictor` class is a meta predictor that combines the features from multiple
	existing predictor classes, and can be easily specified by the sub-predictor names.
	See `src/predictors/__init__.py` for a full list of implemented predictors.