API

dscript.alphabets

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]

Bases: object

From Bepler & Berger.

Parameters
  • chars (byte str) – List of characters in alphabet

  • encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]

  • mask (bool) – Set encoding mask [default: False]

  • missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]

Decode numeric encoding to byte string of this alphabet

Parameters

x (np.ndarray) – Numeric encoding

Returns

Amino acid string

Return type

byte str

encode(x)[source]

Encode a byte string into alphabet indices

Parameters

x (byte str) – Amino acid string

Returns

Numeric encoding

Return type

np.ndarray

class dscript.alphabets.Uniprot21(mask=False)[source]

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta

dscript.fasta.parse(f, comment='#')[source]

Parse a file in .fasta format.

Parameters
  • f (_io.TextIOWrapper) – Input file object

  • comment (str) – Character used for comments

Returns

names, sequence

Return type

list[str], list[str]

dscript.fasta.parse_directory(directory, extension='.seq')[source]

Parse all files in a directory ending with extension.

Parameters
  • directory (str) – Input directory

  • extension (str) – Extension of all files to read in

Returns

names, sequence

Return type

list[str], list[str]

dscript.fasta.write(nam, seq, f)[source]

Write a file in .fasta format.

Parameters
  • nam (list[str]) – List of names

  • seq (list[str]) – List of sequences

  • f (_io.TextIOWrapper) – Output file object

dscript.language_model

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters
  • directory (str) – Input directory (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

  • extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters
  • fastaPath (str) – Input sequence file (.fasta format)

  • outputPath (str) – Output embedding file (.h5 format)

  • device (int) – Compute device to use for embeddings [default: 0]

  • verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters
  • sequence (str) – Input sequence to be embedded

  • use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.pretrained

dscript.pretrained.get_pretrained(version='human_v1')[source]

Get pre-trained model object.

See the documentation for most up-to-date list.

  • lm_v1 - Language model from Bepler & Berger.

  • human_v1 - Human trained model from D-SCRIPT manuscript.

Default: human_v1

Parameters

version (str) – Version of pre-trained model to get

Returns

Pre-trained model

Return type

dscript.models.*

dscript.pretrained.get_state_dict(version='human_v1', verbose=True)[source]

Download a pre-trained model if not already exists on local device.

Parameters
  • version (str) – Version of trained model to download [default: human_1]

  • verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.utils

class dscript.utils.PairedDataset(X0, X1, Y)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters
  • X0 – List of first item in the pair

  • X1 – List of second item in the pair

  • Y – List of labels

dscript.utils.RBF(D, sigma=None)[source]

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters
  • D (np.ndarray) – Distance matrix

  • sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]

Collate function for PyTorch data loader.

dscript.utils.gpu_mem(device)[source]

Get current memory usage for GPU.

Parameters

device (int) – GPU device number

Returns

memory used, memory total

Return type

int, int

dscript.utils.log(msg, file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)[source]

Log datetime-stamped message to file

Parameters
  • msg – Message to log

  • f – Writable file object to log message to

dscript.utils.plot_PR_curve(y, phat, saveFile=None)[source]

Plot precision-recall curve.

Parameters
  • y (np.ndarray) – Labels

  • phat (np.ndarray) – Predicted probabilities

  • saveFile (str) – File for plot of curve to be saved to

dscript.utils.plot_ROC_curve(y, phat, saveFile=None)[source]

Plot receiver operating characteristic curve.

Parameters
  • y (np.ndarray) – Labels

  • phat (np.ndarray) – Predicted probabilities

  • saveFile (str) – File for plot of curve to be saved to