antichronology commited on
Commit
e00d5c4
·
verified ·
1 Parent(s): 0d59a3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - biology
5
+ - medical
6
+ metrics:
7
+ - pearsonr
8
+ ---
9
+
10
+ ## Model Overview
11
+ Orthrus is a mature RNA model for RNA property prediction. It uses a mamba encoder backbone, a variant of state-space models specifically designed for long-sequence data, such as RNA.
12
+
13
+ Two versions of Orthrus are available:
14
+
15
+ - 4-track base version: Encodes the mRNA sequence with a simplified one-hot approach.
16
+ - 6-track large version: Adds biological context by including splice site indicators and coding sequence markers, which is crucial for accurate mRNA property prediction such as RNA half-life, ribosome load, and exon junction detection.
17
+
18
+ **Why the Mamba Backbone?**
19
+ The mamba architecture is an extension of the S4 (structured state-space) model family, which excels at handling long sequences like mRNAs that can reach over 12,000 nucleotides. This makes it an ideal fit for RNA property prediction models for several reasons:
20
+
21
+ - _Efficient Memory Usage:_ Unlike transformers, which require quadratic memory scaling with sequence length, the mamba backbone scales linearly, making it computationally efficient for long sequences.
22
+ - _Variable Context Filtering:_ RNA sequences often contain functionally relevant motifs separated by variable spacing. The mamba model is capable of selectively focusing on these important elements
23
+ - _Selective Context Compression:_ Genomic sequences often have uneven information density, with critical regulatory elements scattered across regions of varying importance. The mamba model selectively compresses less informative regions while preserving the context of key functional areas
24
+
25
+ ## Using Orthrus
26
+
27
+ Orthrus was trained on full RNA sequences, making its usage different from models like DNABERT or Enformer, which focus on arbitrary DNA segments.
28
+ Orthrus was instead trained on full mature RNA sequences so if you pass an incomplete piece of a spliced RNA the input sample will be out of distribution.
29
+
30
+ To generate embeddings using Orthrus for spliced mature RNA sequences, follow the steps below:
31
+
32
+ ### Generating Embeddings
33
+
34
+ #### 4-Track Model
35
+
36
+ The 4-track model requires only a one-hot encoded sequence of your mRNA. This representation captures the basic nucleotide information of the mRNA sequence.
37
+
38
+ Here is example code
39
+ ```
40
+ # Sequence for short mRNA
41
+
42
+ # One hot encode function
43
+
44
+ # Load Orthrus
45
+
46
+ # Generate embedding
47
+ ```
48
+
49
+ #### 6-Track Model (Recommended)
50
+ The 6-track model offers a more detailed representation by incorporating additional biological context, including splice site and coding sequence information. To generate embeddings for this model:
51
+
52
+ We're going to be using an awesome library called GenomeKit to extract DNA sequences and build 4/6 track representations of mRNA transcripts, which will be used as input for Orthrus. GenomeKit makes it easy to work with genomic data, such as sequences and annotations, by providing tools to access and manipulate reference genomes and variants efficiently. It's built by the awesome folks at Deep Genomics
53
+
54
+ For more details, you can refer to the [GenomeKit documentation](https://deepgenomics.github.io/GenomeKit/api.html).
55
+
56
+ To install it:
57
+ ```
58
+ mamba install "genomekit>=6.0.0"
59
+ # we now want to download the genome annotations and the 2bit genome files
60
+ wget -O starter_build.sh https://raw.githubusercontent.com/deepgenomics/GenomeKit/main/starter/build.sh
61
+ chmod +x starter_build.sh
62
+ ./starter_build.sh
63
+ ```
64
+
65
+ We can now generate six track encodings for any transcript!
66
+ ```
67
+ # import six hot encoding function
68
+
69
+ # import Genome, Interval, instantiate Genome
70
+
71
+ # Load Orthrus 6 track
72
+
73
+ # Generate embedding
74
+
75
+ ```
76
+
77
+ Alternatively, this information can be extracted from gene pred files available for download from the UCSC Genome Browser [here](https://genome.ucsc.edu/cgi-bin/hgTables).