API¶
dscript.alphabets¶
-
class
dscript.alphabets.
Alphabet
(chars, encoding=None, mask=False, missing=255)[source]¶ Bases:
object
From Bepler & Berger.
- Parameters
chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]
-
class
dscript.alphabets.
Uniprot21
(mask=False)[source]¶ Bases:
dscript.alphabets.Alphabet
Uniprot 21 Amino Acid Encoding.
From Bepler & Berger.
dscript.fasta¶
-
dscript.fasta.
parse
(f, comment='#')[source]¶ Parse a file in
.fasta
format.- Parameters
f (_io.TextIOWrapper) – Input file object
comment (str) – Character used for comments
- Returns
names, sequence
- Return type
list[str], list[str]
dscript.language_model¶
-
dscript.language_model.
embed_from_directory
(directory, outputPath, device=0, verbose=False, extension='.seq')[source]¶ Embed all files in a directory in
.fasta
format using pre-trained language model from Bepler & Berger.- Parameters
directory (str) – Input directory (
.fasta
format)outputPath (str) – Output embedding file (
.h5
format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in
-
dscript.language_model.
embed_from_fasta
(fastaPath, outputPath, device=0, verbose=False)[source]¶ Embed sequences using pre-trained language model from Bepler & Berger.
- Parameters
fastaPath (str) – Input sequence file (
.fasta
format)outputPath (str) – Output embedding file (
.h5
format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
-
dscript.language_model.
lm_embed
(sequence, use_cuda=False)[source]¶ Embed a single sequence using pre-trained language model from Bepler & Berger.
- Parameters
sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]
- Returns
Embedded sequence
- Return type
torch.Tensor
dscript.pretrained¶
-
dscript.pretrained.
get_pretrained
(version='human_v1')[source]¶ Get pre-trained model object.
See the documentation for most up-to-date list.
lm_v1
- Language model from Bepler & Berger.human_v1
- Human trained model from D-SCRIPT manuscript.
Default:
human_v1
- Parameters
version (str) – Version of pre-trained model to get
- Returns
Pre-trained model
- Return type
dscript.models.*
-
dscript.pretrained.
get_state_dict
(version='human_v1', verbose=True)[source]¶ Download a pre-trained model if not already exists on local device.
- Parameters
version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]
- Returns
Path to state dictionary for pre-trained language model
- Return type
str
dscript.utils¶
-
class
dscript.utils.
PairedDataset
(X0, X1, Y)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.
- Parameters
X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels
-
dscript.utils.
RBF
(D, sigma=None)[source]¶ Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.
\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)
- Parameters
D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]
- Returns
Similarity matrix
- Return type
np.ndarray
-
dscript.utils.
gpu_mem
(device)[source]¶ Get current memory usage for GPU.
- Parameters
device (int) – GPU device number
- Returns
memory used, memory total
- Return type
int, int
-
dscript.utils.
log
(msg, file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)[source]¶ Log datetime-stamped message to file
- Parameters
msg – Message to log
f – Writable file object to log message to