yurakuratov
commited on
Commit
·
36b0b34
1
Parent(s):
3d0d8f6
add bibtex
Browse files
README.md
CHANGED
@@ -48,3 +48,18 @@ GENA-LM (BigBird-base T2T) model is trained in a masked language model (MLM) fas
|
|
48 |
- 32k Vocabulary size, tokenizer trained on DNA data.
|
49 |
|
50 |
We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling SNPs human mutations. Pre-training was performed for 1,070,000 iterations with batch size 256.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
- 32k Vocabulary size, tokenizer trained on DNA data.
|
49 |
|
50 |
We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling SNPs human mutations. Pre-training was performed for 1,070,000 iterations with batch size 256.
|
51 |
+
|
52 |
+
## Citation
|
53 |
+
```
|
54 |
+
@article {GENA_LM,
|
55 |
+
author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
|
56 |
+
title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
|
57 |
+
elocation-id = {2023.06.12.544594},
|
58 |
+
year = {2023},
|
59 |
+
doi = {10.1101/2023.06.12.544594},
|
60 |
+
publisher = {Cold Spring Harbor Laboratory},
|
61 |
+
URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
|
62 |
+
eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
|
63 |
+
journal = {bioRxiv}
|
64 |
+
}
|
65 |
+
```
|