File size: 8,121 Bytes
97b6013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

# Seq2Species: Neural Network Models for Species Classification

*A deep learning solution for read-level taxonomic classification with 16s.*

Recent improvements in sequencing technology have made possible large, public
databases of biological sequencing data, bringing about new data richness for
many important problems in bioinformatics. However, this growing availability of
data creates a need for analysis methods capable of efficiently handling these
large sequencing datasets. We on the [Genomics team in Google
Brain](https://ai.google/research/teams/brain/healthcare-biosciences) are
particularly interested in the class of problems which can be framed as
assigning meaningful labels to short biological sequences, and are exploring the
possiblity of creating a general deep learning solution for solving this class
of sequence-labeling problems. We are excited to share our initial progress in
this direction by releasing Seq2Species, an open-source neural network framework
for [TensorFlow](https://www.tensorflow.org/) for predicting read-level
taxonomic labels from genomic sequence. Our release includes all the code
necessary to train new Seq2Species models.

## About Seq2Species

Briefly, Seq2Species provides a framework for training deep neural networks to
predict database-derived labels directly from short reads of DNA. Thus far, our
research has focused predominantly on demonstrating the value of this deep
learning approach on the problem of determining the species of origin of
next-generation sequencing reads from [16S ribosomal
DNA](https://en.wikipedia.org/wiki/16S_ribosomal_RNA). We used this
Seq2Species framework to train depthwise separable convolutional neural networks
on short subsequences from the 16S genes of more than 13 thousand distinct
species. The resulting classification model assign species-level probabilities
to individual 16S reads.

For more information about the use cases we have explored, or for technical
details describing how Seq2Species work, please see our
[preprint](https://www.biorxiv.org/content/early/2018/06/22/353474).

## Installation

Training Seq2Species models requires installing the following dependencies:

* python 2.7

* protocol buffers

* numpy

* absl

### Dependencies

Detailed instructions for installing TensorFlow are available on the [Installing
TensorFlow](https://www.tensorflow.org/install/) website. Please follow the
full instructions for installing TensorFlow with GPU support. For most
users, the following command will suffice for continuing with CPU support only:
```bash
# For CPU
pip install --upgrade tensorflow
```

The TensorFlow installation should also include installation of the numpy and
absl libraries, which are two of TensorFlow's python dependencies. If
necessary, instructions for standalone installation are available:

* [numpy](https://scipy.org/install.html)

* [absl](https://github.com/abseil/abseil-py)

Information about protocol buffers, as well as download and installation
intructions for the protocol buffer (protobuf) compiler, are available on the [Google
Developers website](https://developers.google.com/protocol-buffers/). A typical
Ubuntu user can install this library using `apt-get`:
```bash
sudo apt-get install protobuf-compiler
```

### Clone

Now, clone `tensorflow/models` to start working with the code:
```bash
git clone https://github.com/tensorflow/models.git
```

### Protobuf Compilation

Seq2Species uses protobufs to store and save dataset and model metadata. Before
the framework can be used to build and train models, the protobuf libraries must
be compiled. This can be accomplished using the following command:
```bash
# From tensorflow/models/research
protoc seq2species/protos/seq2label.proto --python_out=.
```

### Testing the Installation

One can test that Seq2Species has been installed correctly by running the
following command:
```bash
python seq2species/run_training_test.py
```

## Usage Information

Input data to Seq2Species models should be [tf.train.Example protocol messages](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) stored in
[TFRecord format](https://www.tensorflow.org/versions/r1.0/api_guides/python/python_io#tfrecords_format_details).
Specifically, the input pipeline expects tf.train.Examples with a 'sequence' field
containing a genomic sequence as an upper-case string, as one field for each
target label (e.g. 'species'). There should also be an accompanying
Seq2LabelDatasetInfo text protobuf containing metadata about the input, including
the possible label values for each target.

Below, we give an example command that could be used to launch training for 1000
steps, assuming that appropriate data and metadata files are stored at
`${TFRECORD}` and `${DATASET_INFO}`:
```bash
python seq2species/run_training.py --train_files ${TFRECORD}
--metadata_path ${DATASET_INFO} --hparams 'train_steps=1000'
--logdir $HOME/seq2species
```
This will output [TensorBoard
summaries](https://www.tensorflow.org/guide/summaries_and_tensorboard), [TensorFlow
checkpoints](https://www.tensorflow.org/guide/variables#checkpoint_files), Seq2LabelModelInfo and
Seq2LabelExperimentMeasures metadata to the logdir `$HOME/seq2species`.

### Preprocessed Seq2Species Data

We have provided preprocessed data based on 16S reference sequences from the
[NCBI RefSeq Targeted Loci
Project](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) in a Seq2Species
bucket on Google Cloud Storage. After installing the
[Cloud SDK](https://cloud.google.com/sdk/install),
one can download those data (roughly 25 GB) to a local directory `${DEST}` using
the `gsutil` command:
```bash
BUCKET=gs://brain-genomics-public/research/seq2species
mkdir -p ${DEST}
gsutil -m cp ${BUCKET}/* ${DEST}
```

To check if the copy has completed successsfully, check the `${DEST}` directory:
```bash
ls -1 ${DEST}
```
which should produce:
```bash
ncbi_100bp_revcomp.dataset_info.pbtxt
ncbi_100bp_revcomp.tfrecord
```

The following command can be used to train a copy of one of our best-perfoming
deep neural network models for 100 base pair (bp) data. This command also
illustrates how to set hyperparameter values explicitly from the commandline.
The file `configuration.py` provides a full list of hyperparameters, their descriptions,
and their default values. Additional flags are described at the top of
`run_training.py`.
```bash
python seq2species/run_training.py \
--num_filters 3 \
--noise_rate 0.04 \
--train_files ${DEST}/ncbi_100bp_revcomp.tfrecord \
--metadata_path ${DEST}/ncbi_100bp_revcomp.dataset_info.pbtxt \
--logdir $HOME/seq2species \
--hparams 'filter_depths=[1,1,1],filter_widths=[5,9,13],grad_clip_norm=20.0,keep_prob=0.94017831318,
lr_decay=0.0655052811,lr_init=0.000469689635793,lrelu_slope=0.0125376069918,min_read_length=100,num_fc_layers=2,num_fc_units=2828,optimizer=adam,optimizer_hp=0.885769367218,pointwise_depths=[84,58,180],pooling_type=avg,train_steps=3000000,use_depthwise_separable=true,weight_scale=1.18409526348'
```

### Visualization

[TensorBoard](https://github.com/tensorflow/tensorboard) can be used to
visualize training curves and other metrics stored in the summary files produced
by `run_training.py`. Use the following command to launch a TensorBoard instance
for the example model directory `$HOME/seq2species`:
```bash
tensorboard --logdir=$HOME/seq2species
```

## Contact

Any issues with the Seq2Species framework should be filed with the
[TensorFlow/models issue tracker](https://github.com/tensorflow/models/issues).
Questions regarding Seq2Species capabilities can be directed to
[[email protected]](mailto:[email protected]). This
code is maintained by [@apbusia](https://github.com/apbusia) and
[@depristo](https://github.com/depristo).