yurakuratov commited on
Commit
36b0b34
·
1 Parent(s): 3d0d8f6

add bibtex

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -48,3 +48,18 @@ GENA-LM (BigBird-base T2T) model is trained in a masked language model (MLM) fas
48
  - 32k Vocabulary size, tokenizer trained on DNA data.
49
 
50
  We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling SNPs human mutations. Pre-training was performed for 1,070,000 iterations with batch size 256.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  - 32k Vocabulary size, tokenizer trained on DNA data.
49
 
50
  We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling SNPs human mutations. Pre-training was performed for 1,070,000 iterations with batch size 256.
51
+
52
+ ## Citation
53
+ ```
54
+ @article {GENA_LM,
55
+ author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
56
+ title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
57
+ elocation-id = {2023.06.12.544594},
58
+ year = {2023},
59
+ doi = {10.1101/2023.06.12.544594},
60
+ publisher = {Cold Spring Harbor Laboratory},
61
+ URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
62
+ eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
63
+ journal = {bioRxiv}
64
+ }
65
+ ```