jennzhuge commited on
Commit
111e4de
·
verified ·
1 Parent(s): fb3175f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -28,25 +28,25 @@ Prehaps we have a DNA sequence for which the highest genus probability is very l
28
  ## BarcodeBERT DNA Embeddings.
29
  The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
30
 
31
- ## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
 
32
  We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
33
  <!-- | Model | Test Accuracy |
34
  | ----------- | ------------- |
35
  | DNA Only | % |
36
  | DNA & Env | % |
37
  -->
38
- Our results can be validated with the
39
- [BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
40
 
41
  ## Future Work and Downstream Tasks
42
  We describe interesting avenues for future work that were not in scope due to time constraints.
43
 
44
  Future Tasks for the DNA Identifier Tool:
45
- - The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show it's nearest species clusters
46
- - Add legends with images of each genus
47
  - Include a CSV upload task to process many DNA sequences at once
48
  - Include more ecological layers such as layers peratining to soil properties
49
- - Compare different models for the genus classification task
 
50
  - BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
51
  - Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
52
 
 
28
  ## BarcodeBERT DNA Embeddings.
29
  The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
30
 
31
+ ## Classification Model Performance
32
+ <!-- Baseline Comparison of Ecological Layer Data Inclusion -->
33
  We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
34
  <!-- | Model | Test Accuracy |
35
  | ----------- | ------------- |
36
  | DNA Only | % |
37
  | DNA & Env | % |
38
  -->
39
+ Our results can be validated with the [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
 
40
 
41
  ## Future Work and Downstream Tasks
42
  We describe interesting avenues for future work that were not in scope due to time constraints.
43
 
44
  Future Tasks for the DNA Identifier Tool:
45
+ - The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show its nearest species clusters
 
46
  - Include a CSV upload task to process many DNA sequences at once
47
  - Include more ecological layers such as layers peratining to soil properties
48
+ - Add legends with images of each genus
49
+ - Compare more models for the genus classification task, like the [BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) model
50
  - BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
51
  - Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
52