jennzhuge commited on
Commit
1218439
·
verified ·
1 Parent(s): 199c1b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -9
README.md CHANGED
@@ -14,32 +14,47 @@ This tool is intended to help conservationists/biologists identify unmatched eDN
14
 
15
  Conserving and monitoring biodiversity is crucial but challenging, especially in remote and densely vegetated areas. Current methods like camera traps and bioacoustic monitoring require processing huge stores of video/audio feeds. Additionally, satellite imagery analysis is challenging in areas with constant cloud cover or dense canopy cover which may obscure the true conditions on the ground. Found in various states of decay within water, soil, or sediment, DNA can last from a few hours in temperate waters to millennia in cold, dry permafrost. These so called environmental DNA (eDNA) samples allow for the direct extraction of DNA without any traces of the organism itself, offering a much less labor intense way to monitor biodiversity-- that is if the DNA sequences can be identified.
16
 
17
- The current method to identify eDNA sequences involves searching for a match within 2-3% difference in an incomplete reference libary comprising direct specimen DNA samples (BOLD). However, many species are elusive or living in inacessible regions making direct sampling infeasible. We attempt to overcome the limitations of traditional species identification by using ecological layer data and environmental eDNA. We hypothesize that besides pure DNA similarity, there may be knowledge about the area in which a sequence was found that can give clues as to what the sequence could be. We introduce the largest DNA barcode model, trained on a global dataset of over five million sequences gathered from the Barcode of Life Data System([Ratnasingham and Hebert, 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890991/)), and a comprehensive dataset from the Amazon rainforest, including DNA sequences and ecological layer data describing the coordinates where each sequence was sampled. The layers we use describe Annual Mean Air Temperature, Temperature Seasonality, Annual Precipitation, Precipitation Seasonality, Human Footprint, Elevation, and Population Density.
18
 
19
  Our findings show that our DNA model clusters species effectively in the embedding space with modest training. Additionally, incorporating ecological layer data improves accuracy in genus classification tasks. This integration of eDNA and ecological layers offers a scalable method for advanced biodiversity analysis and conservation.
20
 
21
 
22
- ## Genus Prediction
23
- Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.)
24
- Our tool will output the top three most probable genuses that your sample belongs to based on DNA and environmental factors of the sample location. You can also see the top three most probable genuses based on DNA similarity alone.
25
 
26
- ## DNA Embedding Space Visualization
27
- Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The left t-SNE plot show the embedding space of the top N most common species in the area surrounding the given coordinate. We can see clear group distinctions between species. The right t-SNE plot show how the sample sequence embedding is positioned in the space and identified nearest species clusters.
28
 
29
- # Future Work and Downstream Tasks
 
 
 
 
 
 
 
 
 
 
 
 
30
  We describe interesting avenues for future work that were not in scope due to time constraints.
31
 
32
  Future Tasks for the DNA Identifier Tool:
 
 
33
  - Include a CSV upload task to process many DNA sequences at once
34
  - Include more ecological layers such as layers peratining to soil properties
 
35
  - BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
36
  - Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
37
 
38
  Potential downstream tasks include:
39
- - Identifying invasive species
40
  - Reclassifying wrongly classified species, e.g. a red panda is called a panda, but it's actually more genetically similar to a raccoon
 
41
 
42
  # Thank You
43
- This tool was developed as part of the GainForest EcoHackathon AI for Biodiversity Track. Thank you very much to the GainForest team and all mentors for such an engaging and fun EcoHackathon! <3 Lofi Amazon Beats
44
 
45
  <!-- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference -->
 
14
 
15
  Conserving and monitoring biodiversity is crucial but challenging, especially in remote and densely vegetated areas. Current methods like camera traps and bioacoustic monitoring require processing huge stores of video/audio feeds. Additionally, satellite imagery analysis is challenging in areas with constant cloud cover or dense canopy cover which may obscure the true conditions on the ground. Found in various states of decay within water, soil, or sediment, DNA can last from a few hours in temperate waters to millennia in cold, dry permafrost. These so called environmental DNA (eDNA) samples allow for the direct extraction of DNA without any traces of the organism itself, offering a much less labor intense way to monitor biodiversity-- that is if the DNA sequences can be identified.
16
 
17
+ The current method to identify eDNA sequences involves searching for a match within 2-3% difference in an incomplete reference libary comprising direct specimen DNA samples (BOLD). However, many species are elusive or living in inacessible regions making direct sampling infeasible. We attempt to overcome the limitations of traditional species identification by using ecological layer data and environmental eDNA. We hypothesize that besides pure DNA similarity, there may be knowledge about the area in which a sequence was found that can give clues as to what the sequence could be. We introduce the largest DNA barcode model, trained on a global dataset of over five million sequences gathered from the Barcode of Life Data System([Ratnasingham and Hebert, 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890991/)), and a comprehensive dataset from the Amazon rainforest, including DNA sequences and ecological layer data describing the coordinates where each sequence was sampled (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon). The layers we use describe Annual Mean Air Temperature, Temperature Seasonality, Annual Precipitation, Precipitation Seasonality, Human Footprint, Elevation, and Population Density.
18
 
19
  Our findings show that our DNA model clusters species effectively in the embedding space with modest training. Additionally, incorporating ecological layer data improves accuracy in genus classification tasks. This integration of eDNA and ecological layers offers a scalable method for advanced biodiversity analysis and conservation.
20
 
21
 
22
+ ### Genus Prediction
23
+ Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.) You can choose between two methods to predict the most probable genus. 'Cosine' will calculate the cosine similarity between the embeddings of your unidentified eDNA sequence and existing labelled sequences to determine the most probable genuses; this method is not aware of environmental data. 'fine_tuned_model' will output the predictions of a model trained on DNA embeddings and ecological layer data to predict the most probable genuses. A plot of the most probable genuses is shown.
 
24
 
25
+ ### DNA Embedding Space Visualization
26
+ Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
27
 
28
+ ## BarcodeBERT DNA Embeddings.
29
+ The model we used to train the DNA embeddings is BarcodeBERT (https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the BOLD Database (http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the BarcodeBERT approach (https://arxiv.org/pdf/2311.02401).
30
+
31
+ ## Classification Model Performance and Baseline Comparison of Ecological Layer Data Inclusion
32
+ We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the inclusion on ecological layer data improved our accuracy by _____.
33
+ | Model | Validation Accuracy |
34
+ | ----------- | ----------- |
35
+ | DNA Only | % |
36
+ | DNA & Env | % |
37
+ Our results can be validated with the
38
+ BarcodeBERT-Finetuned-Amazon-DNAOnly (https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) and BarcodeBERT-Finetuned-Amazon (https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) models, testing splits of the BOLD-Embeddings-Amazon (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Amazon), BOLD-Embeddings-Ecolayers-Amazon (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) datasets, and the fine tuning script on our repo (https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py).
39
+
40
+ ## Future Work and Downstream Tasks
41
  We describe interesting avenues for future work that were not in scope due to time constraints.
42
 
43
  Future Tasks for the DNA Identifier Tool:
44
+ - The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show it's nearest species clusters
45
+ - Add legends with images of each genus
46
  - Include a CSV upload task to process many DNA sequences at once
47
  - Include more ecological layers such as layers peratining to soil properties
48
+ - Compare different models for the genus classification task
49
  - BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
50
  - Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
51
 
52
  Potential downstream tasks include:
53
+ - Identifying invasive species from highly confident DNA matches of a sequence seen far from its native territory.
54
  - Reclassifying wrongly classified species, e.g. a red panda is called a panda, but it's actually more genetically similar to a raccoon
55
+ - Investigating how environmental factors affect DNA sequences, e.g. mutations.
56
 
57
  # Thank You
58
+ This tool was developed as part of the GainForest EcoHackathon: AI for Biodiversity Track. Thank you very much to the GainForest team and all mentors for such an engaging and fun EcoHackathon! <3 Lofi Amazon Beats
59
 
60
  <!-- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference -->