Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -28,25 +28,25 @@ Prehaps we have a DNA sequence for which the highest genus probability is very l
|
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
|
30 |
|
31 |
-
## Classification Model Performance
|
|
|
32 |
We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
|
33 |
<!-- | Model | Test Accuracy |
|
34 |
| ----------- | ------------- |
|
35 |
| DNA Only | % |
|
36 |
| DNA & Env | % |
|
37 |
-->
|
38 |
-
Our results can be validated with the
|
39 |
-
[BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
|
40 |
|
41 |
## Future Work and Downstream Tasks
|
42 |
We describe interesting avenues for future work that were not in scope due to time constraints.
|
43 |
|
44 |
Future Tasks for the DNA Identifier Tool:
|
45 |
-
- The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show
|
46 |
-
- Add legends with images of each genus
|
47 |
- Include a CSV upload task to process many DNA sequences at once
|
48 |
- Include more ecological layers such as layers peratining to soil properties
|
49 |
-
-
|
|
|
50 |
- BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
|
51 |
- Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
|
52 |
|
|
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
|
30 |
|
31 |
+
## Classification Model Performance
|
32 |
+
<!-- Baseline Comparison of Ecological Layer Data Inclusion -->
|
33 |
We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
|
34 |
<!-- | Model | Test Accuracy |
|
35 |
| ----------- | ------------- |
|
36 |
| DNA Only | % |
|
37 |
| DNA & Env | % |
|
38 |
-->
|
39 |
+
Our results can be validated with the [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
|
|
|
40 |
|
41 |
## Future Work and Downstream Tasks
|
42 |
We describe interesting avenues for future work that were not in scope due to time constraints.
|
43 |
|
44 |
Future Tasks for the DNA Identifier Tool:
|
45 |
+
- The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show its nearest species clusters
|
|
|
46 |
- Include a CSV upload task to process many DNA sequences at once
|
47 |
- Include more ecological layers such as layers peratining to soil properties
|
48 |
+
- Add legends with images of each genus
|
49 |
+
- Compare more models for the genus classification task, like the [BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) model
|
50 |
- BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
|
51 |
- Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
|
52 |
|