Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -26,17 +26,17 @@ Enter a DNA sequence and the coordinates where you sampled it. (We can easily ex
|
|
26 |
Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
|
27 |
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
-
The model we used to train the DNA embeddings is BarcodeBERT
|
30 |
|
31 |
## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
|
32 |
-
We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the
|
33 |
-
| Model | Test Accuracy |
|
34 |
| ----------- | ------------- |
|
35 |
| DNA Only | % |
|
36 |
| DNA & Env | % |
|
37 |
-
|
38 |
Our results can be validated with the
|
39 |
-
BarcodeBERT-Finetuned-Amazon-DNAOnly
|
40 |
|
41 |
## Future Work and Downstream Tasks
|
42 |
We describe interesting avenues for future work that were not in scope due to time constraints.
|
|
|
26 |
Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
|
27 |
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
+
The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the BarcodeBERT approach (https://arxiv.org/pdf/2311.02401).
|
30 |
|
31 |
## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
|
32 |
+
We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
|
33 |
+
<!-- | Model | Test Accuracy |
|
34 |
| ----------- | ------------- |
|
35 |
| DNA Only | % |
|
36 |
| DNA & Env | % |
|
37 |
+
-->
|
38 |
Our results can be validated with the
|
39 |
+
[BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
|
40 |
|
41 |
## Future Work and Downstream Tasks
|
42 |
We describe interesting avenues for future work that were not in scope due to time constraints.
|