jennzhuge commited on
Commit
08031ca
·
verified ·
1 Parent(s): 8260f31

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -26,17 +26,17 @@ Enter a DNA sequence and the coordinates where you sampled it. (We can easily ex
26
  Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
27
 
28
  ## BarcodeBERT DNA Embeddings.
29
- The model we used to train the DNA embeddings is BarcodeBERT (https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the BOLD Database (http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the BarcodeBERT approach (https://arxiv.org/pdf/2311.02401).
30
 
31
  ## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
32
- We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the inclusion on ecological layer data improved our accuracy by _____.
33
- | Model | Test Accuracy |
34
  | ----------- | ------------- |
35
  | DNA Only | % |
36
  | DNA & Env | % |
37
-
38
  Our results can be validated with the
39
- BarcodeBERT-Finetuned-Amazon-DNAOnly (https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and BarcodeBERT-Finetuned-Amazon (https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the BOLD-Embeddings-Amazon (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), BOLD-Embeddings-Ecolayers-Amazon (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our repo (https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
40
 
41
  ## Future Work and Downstream Tasks
42
  We describe interesting avenues for future work that were not in scope due to time constraints.
 
26
  Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
27
 
28
  ## BarcodeBERT DNA Embeddings.
29
+ The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the BarcodeBERT approach (https://arxiv.org/pdf/2311.02401).
30
 
31
  ## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
32
+ We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%.
33
+ <!-- | Model | Test Accuracy |
34
  | ----------- | ------------- |
35
  | DNA Only | % |
36
  | DNA & Env | % |
37
+ -->
38
  Our results can be validated with the
39
+ [BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the [BOLD-Embeddings-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
40
 
41
  ## Future Work and Downstream Tasks
42
  We describe interesting avenues for future work that were not in scope due to time constraints.