eachanjohnson commited on
Commit
efe6d99
·
1 Parent(s): bb77f03

Sat Oct 12 17:25:16 UTC 2024 :: HF Spaces deployment

Browse files
LICENSE DELETED
@@ -1,21 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) [year] [fullname]
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,87 +1,38 @@
1
- # ⬢⬢⬢ schemist
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- ![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/schemist/python-publish.yml)
4
- ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/schemist)
5
- ![PyPI](https://img.shields.io/pypi/v/schemist)
6
  [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/scbirlab/chem-converter)
7
 
8
- Cleaning, collating, and augmenting chemical datasets.
9
 
10
- - [Installation](#installation)
11
- - [Command-line usage](#command-line-usage)
12
- - [Python API](#python-api)
13
- - [Documentation](#documentation)
14
 
15
- ## Installation
 
 
 
16
 
17
- ### The easy way
18
 
19
- Install the pre-compiled version from PyPI:
 
 
 
 
 
 
 
 
 
20
 
21
- ```bash
22
- pip install schemist
23
- ```
24
-
25
- ### From source
26
-
27
- Clone the repository, then `cd` into it. Then run:
28
-
29
- ```bash
30
- pip install -e .
31
- ```
32
-
33
- ## Command-line usage
34
-
35
- **schemist** provides command-line utlities. The list of commands can be checked like so:
36
-
37
- ```bash
38
- $ schemist --help
39
- usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...
40
-
41
- Tools for cleaning, collating, and augmenting chemical datasets.
42
-
43
- options:
44
- -h, --help show this help message and exit
45
- --version, -v show program's version number and exit
46
-
47
- Sub-commands:
48
- {clean,convert,featurize,collate,dedup,enumerate,react,split}
49
- Use these commands to specify the tool you want to use.
50
- clean Clean and normalize SMILES column of a table.
51
- convert Convert between string representations of chemical structures.
52
- featurize Convert between string representations of chemical structures.
53
- collate Collect disparate tables or SDF files of libraries into a single table.
54
- dedup Deduplicate chemical structures and retain references.
55
- enumerate Enumerate bio-chemical structures within length and sequence constraints.
56
- react React compounds in silico in indicated columns using a named reaction.
57
- split Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
58
- ```
59
-
60
- Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.
61
-
62
- All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.
63
-
64
- The tools complete specific tasks which
65
- can be easily composed into analysis pipelines, because the TSV table output goes to
66
- `stdout` by default so they can be piped from one tool to another.
67
-
68
- To get help for a specific command, do
69
-
70
- ```bash
71
- schemist <command> --help
72
- ```
73
-
74
- For the Python API, [see below](#python-api).
75
-
76
-
77
- ## Python API
78
-
79
- **schemist** can be imported into Python to help make custom analyses.
80
-
81
- ```python
82
- >>> import schemist as sch
83
- ```
84
-
85
- ## Documentation
86
-
87
- Full API documentation is at [ReadTheDocs](https://schemist.readthedocs.org).
 
1
+ ---
2
+ title: Chemical string format converter
3
+ emoji: ⚗️
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: "5.0.2"
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: Trivial batch interconversion of 1D chemical formats.
11
+ ---
12
+ # Chemical string format converter
13
 
 
 
 
14
  [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/scbirlab/chem-converter)
15
 
16
+ Trivial batch interconversion of 1D chemical formats.
17
 
18
+ Frontend for [schemist](https://github.com/scbirlab/schemist) to allow interconversion from:
 
 
 
19
 
20
+ - SMILES
21
+ - SELFIES
22
+ - Amino acid sequences
23
+ - HELM
24
 
25
+ to...
26
 
27
+ - Strucure image
28
+ - SMILES
29
+ - SELFIES
30
+ - InChI
31
+ - InChIKey
32
+ - Name
33
+ - cLogP
34
+ - TPSA
35
+ - molecular weight
36
+ - charge
37
 
38
+ ... and several others!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/app.py → app.py RENAMED
File without changes
app/README.md DELETED
@@ -1,38 +0,0 @@
1
- ---
2
- title: Chemical string format converter
3
- emoji: ⚗️
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 5.0.2
8
- app_file: app.py
9
- pinned: false
10
- short_description: Trivial batch interconversion of 1D chemical formats.
11
- ---
12
- # Chemical string format converter
13
-
14
- [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/scbirlab/chem-converter)
15
-
16
- Trivial batch interconversion of 1D chemical formats.
17
-
18
- Frontend for [schemist](https://github.com/scbirlab/schemist) to allow interconversion from:
19
-
20
- - SMILES
21
- - SELFIES
22
- - Amino acid sequences
23
- - HELM
24
-
25
- to...
26
-
27
- - Strucure image
28
- - SMILES
29
- - SELFIES
30
- - InChI
31
- - InChIKey
32
- - Name
33
- - cLogP
34
- - TPSA
35
- - molecular weight
36
- - charge
37
-
38
- ... and several others!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/requirements.txt DELETED
@@ -1,8 +0,0 @@
1
- myst_parser
2
- matplotlib
3
- numpy
4
- openpyxl==3.1.0
5
- pandas
6
- scipy
7
- sphinx_rtd_theme
8
- ./
 
 
 
 
 
 
 
 
 
docs/source/conf.py DELETED
@@ -1,45 +0,0 @@
1
- # Configuration file for the Sphinx documentation builder.
2
- #
3
- # For the full list of built-in configuration values, see the documentation:
4
- # https://www.sphinx-doc.org/en/master/usage/configuration.html
5
-
6
- # -- Project information -----------------------------------------------------
7
- # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8
-
9
- project = 'schemist'
10
- copyright = '2024, Eachan Johnson'
11
- author = 'Eachan Johnson'
12
- release = '0.0.1'
13
-
14
- # -- General configuration ---------------------------------------------------
15
- # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
16
-
17
- extensions = ['sphinx.ext.doctest',
18
- 'sphinx.ext.autodoc',
19
- 'sphinx.ext.autosummary',
20
- 'sphinx.ext.napoleon',
21
- 'sphinx.ext.viewcode',
22
- 'myst_parser']
23
-
24
- myst_enable_extensions = [
25
- "amsmath",
26
- "dollarmath",
27
- ]
28
-
29
- source_suffix = {
30
- '.rst': 'restructuredtext',
31
- '.txt': 'markdown',
32
- '.md': 'markdown',
33
- }
34
-
35
-
36
- templates_path = ['_templates']
37
- exclude_patterns = []
38
-
39
-
40
-
41
- # -- Options for HTML output -------------------------------------------------
42
- # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
43
-
44
- html_theme = 'sphinx_rtd_theme'
45
- html_static_path = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/source/index.md DELETED
@@ -1,25 +0,0 @@
1
- # ⬢⬢⬢ schemist
2
-
3
- ![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/schemist/python-publish.yml)
4
- ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/schemist)
5
- ![PyPI](https://img.shields.io/pypi/v/schemist)
6
-
7
- Organizing and processing tables of chemical structures.
8
-
9
- ```{toctree}
10
- :maxdepth: 2
11
- :caption: Contents:
12
-
13
- installation
14
- usage
15
- python
16
- modules
17
- ```
18
-
19
- ## Issues, problems, suggestions
20
-
21
- Add to the [issue tracker](https://www.github.com/schemist/issues).
22
-
23
- ## Source
24
-
25
- View source at [GitHub](https://github.com/scbirlab/schemist).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/source/installation.md DELETED
@@ -1,17 +0,0 @@
1
- # Installation
2
-
3
- ## The easy way
4
-
5
- Install the pre-compiled version from GitHub:
6
-
7
- ```bash
8
- $ pip install schemist
9
- ```
10
-
11
- ## From source
12
-
13
- Clone the [repository](https://www.github.com/schemist), then `cd` into it. Then run:
14
-
15
- ```bash
16
- pip install -e .
17
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/source/modules.rst DELETED
@@ -1,7 +0,0 @@
1
- schemist
2
- ========
3
-
4
- .. toctree::
5
- :maxdepth: 4
6
-
7
- schemist
 
 
 
 
 
 
 
 
docs/source/schemist.rst DELETED
@@ -1,109 +0,0 @@
1
- schemist package
2
- ================
3
-
4
- Submodules
5
- ----------
6
-
7
- schemist.cleaning module
8
- ------------------------
9
-
10
- .. automodule:: schemist.cleaning
11
- :members:
12
- :undoc-members:
13
- :show-inheritance:
14
-
15
- schemist.cli module
16
- -------------------
17
-
18
- .. automodule:: schemist.cli
19
- :members:
20
- :undoc-members:
21
- :show-inheritance:
22
-
23
- schemist.collating module
24
- -------------------------
25
-
26
- .. automodule:: schemist.collating
27
- :members:
28
- :undoc-members:
29
- :show-inheritance:
30
-
31
- schemist.converting module
32
- --------------------------
33
-
34
- .. automodule:: schemist.converting
35
- :members:
36
- :undoc-members:
37
- :show-inheritance:
38
-
39
- schemist.features module
40
- ------------------------
41
-
42
- .. automodule:: schemist.features
43
- :members:
44
- :undoc-members:
45
- :show-inheritance:
46
-
47
- schemist.generating module
48
- --------------------------
49
-
50
- .. automodule:: schemist.generating
51
- :members:
52
- :undoc-members:
53
- :show-inheritance:
54
-
55
- schemist.io module
56
- ------------------
57
-
58
- .. automodule:: schemist.io
59
- :members:
60
- :undoc-members:
61
- :show-inheritance:
62
-
63
- schemist.rest\_lookup module
64
- ----------------------------
65
-
66
- .. automodule:: schemist.rest_lookup
67
- :members:
68
- :undoc-members:
69
- :show-inheritance:
70
-
71
- schemist.splitting module
72
- -------------------------
73
-
74
- .. automodule:: schemist.splitting
75
- :members:
76
- :undoc-members:
77
- :show-inheritance:
78
-
79
- schemist.tables module
80
- ----------------------
81
-
82
- .. automodule:: schemist.tables
83
- :members:
84
- :undoc-members:
85
- :show-inheritance:
86
-
87
- schemist.typing module
88
- ----------------------
89
-
90
- .. automodule:: schemist.typing
91
- :members:
92
- :undoc-members:
93
- :show-inheritance:
94
-
95
- schemist.utils module
96
- ---------------------
97
-
98
- .. automodule:: schemist.utils
99
- :members:
100
- :undoc-members:
101
- :show-inheritance:
102
-
103
- Module contents
104
- ---------------
105
-
106
- .. automodule:: schemist
107
- :members:
108
- :undoc-members:
109
- :show-inheritance:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/source/usage.md DELETED
@@ -1,55 +0,0 @@
1
- # Usage
2
-
3
- **schemist** has a variety of utilities which can be used through the command-line or the [Python API](#python-api).
4
-
5
- ## Command-line usage
6
-
7
- **schemist** provides command-line utlities. The list of commands can be checked like so:
8
-
9
- ```bash
10
- $ schemist --help
11
- usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...
12
-
13
- Tools for cleaning, collating, and augmenting chemical datasets.
14
-
15
- options:
16
- -h, --help show this help message and exit
17
- --version, -v show program's version number and exit
18
-
19
- Sub-commands:
20
- {clean,convert,featurize,collate,dedup,enumerate,react,split}
21
- Use these commands to specify the tool you want to use.
22
- clean Clean and normalize SMILES column of a table.
23
- convert Convert between string representations of chemical structures.
24
- featurize Convert between string representations of chemical structures.
25
- collate Collect disparate tables or SDF files of libraries into a single table.
26
- dedup Deduplicate chemical structures and retain references.
27
- enumerate Enumerate bio-chemical structures within length and sequence constraints.
28
- react React compounds in silico in indicated columns using a named reaction.
29
- split Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
30
- ```
31
-
32
- Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.
33
-
34
- All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.
35
-
36
- The tools complete specific tasks which
37
- can be easily composed into analysis pipelines, because the TSV table output goes to
38
- `stdout` by default so they can be piped from one tool to another.
39
-
40
- To get help for a specific command, do
41
-
42
- ```bash
43
- schemist <command> --help
44
- ```
45
-
46
- For the Python API, [see below](#python-api).
47
-
48
-
49
- ## Python API
50
-
51
- You can access the underlying functions of `schemist` to help custom analyses or develop other tools.
52
-
53
- ```python
54
- >>> import schemist as sch
55
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pyproject.toml DELETED
@@ -1,61 +0,0 @@
1
- [project]
2
- name = "schemist"
3
- version = "0.0.1"
4
- authors = [
5
- { name="Eachan Johnson", email="[email protected]" },
6
- ]
7
- description = "Organizing and processing tables of chemical structures."
8
- readme = "README.md"
9
- requires-python = ">=3.8"
10
- license = {file = "LICENSE"}
11
- keywords = ["science", "chemistry", "SMILES", "SELFIES", "cheminformatics"]
12
-
13
- classifiers = [
14
-
15
- "Development Status :: 3 - Alpha",
16
-
17
- # Indicate who your project is intended for
18
- "Intended Audience :: Science/Research",
19
- "Topic :: Scientific/Engineering :: Chemistry",
20
-
21
- "License :: OSI Approved :: MIT License",
22
-
23
- "Programming Language :: Python :: 3.8",
24
- "Programming Language :: Python :: 3.9",
25
- "Programming Language :: Python :: 3.10",
26
- "Programming Language :: Python :: 3.11",
27
- "Programming Language :: Python :: 3 :: Only",
28
- ]
29
-
30
- dependencies = [
31
- "carabiner-tools[pd]>=0.0.3.post1",
32
- "datamol",
33
- "descriptastorus==2.6.1",
34
- "nemony",
35
- "openpyxl==3.1.0",
36
- "pandas",
37
- "rdkit",
38
- "requests",
39
- "selfies",
40
- ]
41
-
42
- [project.urls]
43
- "Homepage" = "https://github.com/scbirlab/schemist"
44
- "Repository" = "https://github.com/scbirlab/schemist.git"
45
- "Bug Tracker" = "https://github.com/scbirlab/schemist/issues"
46
- "Documentation" = "https://readthedocs.org/schemist"
47
-
48
- [project.scripts] # Optional
49
- schemist = "schemist.cli:main"
50
-
51
- [tool.setuptools]
52
- packages = ["schemist"]
53
- # If there are data files included in your packages that need to be
54
- # installed, specify them here.
55
- # package-data = {"" = ["*.yml"]}
56
-
57
- [build-system]
58
- # These are the assumed default build requirements from pip:
59
- # https://pip.pypa.io/en/stable/reference/pip/#pep-517-and-518-support
60
- requires = ["setuptools>=43.0.0", "wheel"]
61
- build-backend = "setuptools.build_meta"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/requirements.txt → requirements.txt RENAMED
File without changes
schemist/__init__.py DELETED
@@ -1,3 +0,0 @@
1
- from importlib.metadata import version
2
-
3
- __version__ = version("schemist")
 
 
 
 
schemist/cleaning.py DELETED
@@ -1,27 +0,0 @@
1
- """Chemical structure cleaning routines."""
2
-
3
- from carabiner.decorators import vectorize
4
-
5
- from datamol import sanitize_smiles
6
- import selfies as sf
7
-
8
- @vectorize
9
- def clean_smiles(smiles: str,
10
- *args, **kwargs) -> str:
11
-
12
- """Sanitize a SMILES string or list of SMILES strings.
13
-
14
- """
15
-
16
- return sanitize_smiles(smiles, *args, **kwargs)
17
-
18
-
19
- @vectorize
20
- def clean_selfies(selfies: str,
21
- *args, **kwargs) -> str:
22
-
23
- """Sanitize a SELFIES string or list of SELFIES strings.
24
-
25
- """
26
-
27
- return sf.encode(sanitize_smiles(sf.decode(selfies), *args, **kwargs))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/cli.py DELETED
@@ -1,535 +0,0 @@
1
- """Command-line interface for schemist."""
2
-
3
- from typing import Any, Dict, List, Optional
4
-
5
- from argparse import FileType, Namespace
6
- from collections import Counter, defaultdict
7
- from functools import partial
8
- import os
9
- import sys
10
- from tempfile import NamedTemporaryFile, TemporaryDirectory
11
-
12
- from carabiner import pprint_dict, upper_and_lower
13
- from carabiner.cliutils import clicommand, CLIOption, CLICommand, CLIApp
14
- from carabiner.itertools import tenumerate
15
- from carabiner.pd import get_formats, write_stream
16
-
17
- from . import __version__
18
- from .collating import collate_inventory, deduplicate_file
19
- from .converting import _TO_FUNCTIONS, _FROM_FUNCTIONS
20
- from .generating import AA, REACTIONS
21
- from .io import _mutate_df_stream
22
- from .tables import (converter, cleaner, featurizer, assign_groups,
23
- _assign_splits, splitter, _peptide_table, reactor)
24
- from .splitting import _SPLITTERS, _GROUPED_SPLITTERS
25
-
26
- def _option_parser(x: Optional[List[str]]) -> Dict[str, Any]:
27
-
28
- options = {}
29
-
30
- try:
31
- for opt in x:
32
-
33
- try:
34
- key, value = opt.split('=')
35
- except ValueError:
36
- raise ValueError(f"Option {opt} is misformatted. It should be in the format keyword=value.")
37
-
38
- try:
39
- value = int(value)
40
- except ValueError:
41
- try:
42
- value = float(value)
43
- except ValueError:
44
- pass
45
-
46
- options[key] = value
47
-
48
- except TypeError:
49
-
50
- pass
51
-
52
- return options
53
-
54
-
55
- def _sum_tally(tallies: Counter,
56
- message: str = "Error counts",
57
- use_length: bool = False):
58
-
59
- total_tally = Counter()
60
-
61
- for tally in tallies:
62
-
63
- if use_length:
64
- total_tally.update({key: len(value) for key, value in tally.items()})
65
- else:
66
- total_tally.update(tally)
67
-
68
- if len(tallies) == 0:
69
- raise ValueError(f"Nothing generated!")
70
-
71
- pprint_dict(total_tally, message=message)
72
-
73
- return total_tally
74
-
75
-
76
- @clicommand(message="Cleaning file with the following parameters")
77
- def _clean(args: Namespace) -> None:
78
-
79
- error_tallies = _mutate_df_stream(input_file=args.input,
80
- output_file=args.output,
81
- function=partial(cleaner,
82
- column=args.column,
83
- input_representation=args.representation,
84
- prefix=args.prefix),
85
- file_format=args.format)
86
-
87
- _sum_tally(error_tallies)
88
-
89
- return None
90
-
91
-
92
- @clicommand(message="Converting between string representations with the following parameters")
93
- def _convert(args: Namespace) -> None:
94
-
95
- options = _option_parser(args.options)
96
-
97
- error_tallies = _mutate_df_stream(input_file=args.input,
98
- output_file=args.output,
99
- function=partial(converter,
100
- column=args.column,
101
- input_representation=args.representation,
102
- output_representation=args.to,
103
- prefix=args.prefix,
104
- options=options),
105
- file_format=args.format)
106
-
107
- _sum_tally(error_tallies)
108
-
109
- return None
110
-
111
-
112
- @clicommand(message="Adding features to files with the following parameters")
113
- def _featurize(args: Namespace) -> None:
114
-
115
- error_tallies = _mutate_df_stream(input_file=args.input,
116
- output_file=args.output,
117
- function=partial(featurizer,
118
- feature_type=args.feature,
119
- column=args.column,
120
- ids=args.id,
121
- input_representation=args.representation,
122
- prefix=args.prefix),
123
- file_format=args.format)
124
-
125
- _sum_tally(error_tallies)
126
-
127
- return None
128
-
129
-
130
- @clicommand(message="Splitting table with the following parameters")
131
- def _split(args: Namespace) -> None:
132
-
133
- split_type = args.type.casefold()
134
-
135
- if split_type in _GROUPED_SPLITTERS:
136
-
137
- chunk_processor, aggregator = _GROUPED_SPLITTERS[split_type]
138
-
139
- with TemporaryDirectory() as dir:
140
-
141
- with NamedTemporaryFile("w", dir=dir, delete=False) as f:
142
-
143
- group_idxs = _mutate_df_stream(input_file=args.input,
144
- output_file=f,
145
- function=partial(assign_groups,
146
- grouper=chunk_processor,
147
- group_name=split_type,
148
- column=args.column,
149
- input_representation=args.representation),
150
- file_format=args.format)
151
- f.close()
152
- new_group_idx = defaultdict(list)
153
-
154
- totals = 0
155
- for group_idx in group_idxs:
156
- these_totals = 0
157
- for key, value in group_idx.items():
158
- these_totals += len(value)
159
- new_group_idx[key] += [idx + totals for idx in value]
160
- totals += these_totals
161
-
162
- group_idx = aggregator(new_group_idx,
163
- train=args.train,
164
- test=args.test)
165
-
166
- split_tallies = _mutate_df_stream(input_file=f.name,
167
- output_file=args.output,
168
- function=partial(_assign_splits,
169
- split_idx=group_idx,
170
- use_df_index=True),
171
- file_format=args.format)
172
- if os.path.exists(f.name):
173
- os.remove(f.name)
174
-
175
- else:
176
-
177
- split_tallies = _mutate_df_stream(input_file=args.input,
178
- output_file=args.output,
179
- function=partial(splitter,
180
- split_type=args.type,
181
- column=args.column,
182
- input_representation=args.representation,
183
- train=args.train,
184
- test=args.test,
185
- set_seed=args.seed),
186
- file_format=args.format)
187
-
188
- _sum_tally(split_tallies,
189
- message="Split counts")
190
-
191
- return None
192
-
193
-
194
- @clicommand(message="Collating files with the following parameters")
195
- def _collate(args: Namespace) -> None:
196
-
197
- root_dir = args.data_dir or '.'
198
-
199
- error_tallies = _mutate_df_stream(input_file=args.input,
200
- output_file=args.output,
201
- function=partial(collate_inventory,
202
- root_dir=root_dir,
203
- drop_unmapped=not args.keep_extra_columns,
204
- catalog_smiles_column=args.column,
205
- id_column_name=args.id_column,
206
- id_n_digits=args.digits,
207
- id_prefix=args.prefix),
208
- file_format=args.format)
209
-
210
- _sum_tally(error_tallies,
211
- message="Collated chemicals:")
212
-
213
- return None
214
-
215
-
216
- @clicommand(message="Deduplicating chemical structures with the following parameters")
217
- def _dedup(args: Namespace) -> None:
218
-
219
- report, deduped_df = deduplicate_file(args.input,
220
- format=args.format,
221
- column=args.column,
222
- input_representation=args.representation,
223
- index_columns=args.indexes)
224
-
225
- if args.prefix is not None and 'inchikey' in deduped_df:
226
- deduped_df = deduped_df.rename(columns={'inchikey': f'{args.prefix}inchikey'})
227
-
228
- write_stream(deduped_df,
229
- output=args.output,
230
- format=args.format)
231
-
232
- pprint_dict(report, message="Finished deduplicating:")
233
-
234
- return None
235
-
236
-
237
- @clicommand(message="Enumerating peptides with the following parameters")
238
- def _enum(args: Namespace) -> None:
239
-
240
- tables = _peptide_table(max_length=args.max_length,
241
- min_length=args.min_length,
242
- n=args.number,
243
- indexes=args.slice,
244
- set_seed=args.seed,
245
- prefix=args.prefix,
246
- suffix=args.suffix,
247
- d_aa_only=args.d_aa_only,
248
- include_d_aa=args.include_d_aa,
249
- generator=True)
250
-
251
- dAA_use = any(aa.islower() for aa in args.prefix + args.suffix)
252
- dAA_use = dAA_use or args.include_d_aa or args.d_aa_only
253
-
254
- tallies, error_tallies = [], []
255
- options = _option_parser(args.options)
256
- _converter = partial(converter,
257
- column='peptide_sequence',
258
- input_representation='minihelm' if dAA_use else 'aa_seq', ## affects performance
259
- output_representation=args.to,
260
- options=options)
261
-
262
- for i, table in tenumerate(tables):
263
-
264
- _err_tally, df = _converter(table)
265
-
266
- tallies.append({"Number of peptides": df.shape[0]})
267
- error_tallies.append(_err_tally)
268
-
269
- write_stream(df,
270
- output=args.output,
271
- format=args.format,
272
- mode='w' if i == 0 else 'a',
273
- header=i == 0)
274
-
275
- _sum_tally(tallies,
276
- message="Enumerated peptides")
277
- _sum_tally(error_tallies,
278
- message="Conversion errors")
279
-
280
- return None
281
-
282
-
283
- @clicommand(message="Reacting peptides with the following parameters")
284
- def _react(args: Namespace) -> None:
285
-
286
- error_tallies = _mutate_df_stream(input_file=args.input,
287
- output_file=args.output,
288
- function=partial(reactor,
289
- column=args.column,
290
- input_representation=args.representation,
291
- reaction=args.reaction,
292
- product_name=args.name),
293
- file_format=args.format)
294
-
295
- _sum_tally(error_tallies)
296
-
297
- return None
298
-
299
-
300
- def main() -> None:
301
-
302
- inputs = CLIOption('input',
303
- default=sys.stdin,
304
- type=FileType('r'),
305
- nargs='?',
306
- help='Input columnar Excel, CSV or TSV file. Default: STDIN.')
307
- representation = CLIOption('--representation', '-r',
308
- type=str,
309
- default='SMILES',
310
- choices=upper_and_lower(_FROM_FUNCTIONS),
311
- help='Chemical representation to use for input. ')
312
- column = CLIOption('--column', '-c',
313
- default='smiles',
314
- type=str,
315
- help='Column to use as input string representation. ')
316
- prefix = CLIOption('--prefix', '-p',
317
- default=None,
318
- type=str,
319
- help='Prefix to add to new column name. Default: no prefix')
320
- to = CLIOption('--to', '-2',
321
- type=str,
322
- default='SMILES',
323
- nargs='*',
324
- choices=upper_and_lower(_TO_FUNCTIONS),
325
- help='Format to convert to.')
326
- options = CLIOption('--options', '-x',
327
- type=str,
328
- default=None,
329
- nargs='*',
330
- help='Options to pass to converter, in the format '
331
- '"keyword1=value1 keyword2=value2"')
332
- output = CLIOption('--output', '-o',
333
- type=FileType('w'),
334
- default=sys.stdout,
335
- help='Output file. Default: STDOUT')
336
- formatting = CLIOption('--format', '-f',
337
- type=str,
338
- default=None,
339
- choices=upper_and_lower(get_formats()),
340
- help='Override file extensions for input and output. '
341
- 'Default: infer from file extension.')
342
-
343
- ## featurize
344
- id_feat = CLIOption('--id', '-i',
345
- type=str,
346
- default=None,
347
- nargs='*',
348
- help='Columns to retain in output table. Default: use all')
349
- feature = CLIOption('--feature', '-t',
350
- type=str,
351
- default='2d',
352
- choices=['2d', 'fp'], ## TODO: implement 3d
353
- help='Which feature type to generate.')
354
-
355
- ## split
356
- type_ = CLIOption('--type', '-t',
357
- type=str,
358
- default='random',
359
- choices=upper_and_lower(_SPLITTERS),
360
- help='Which split type to use.')
361
- train = CLIOption('--train', '-a',
362
- type=float,
363
- default=1.,
364
- help='Proportion of data to use for training. ')
365
- test = CLIOption('--test', '-b',
366
- type=float,
367
- default=0.,
368
- help='Proportion of data to use for testing. ')
369
-
370
- ## collate
371
- data_dir = CLIOption('--data-dir', '-d',
372
- type=str,
373
- default=None,
374
- help='Directory containing data files. '
375
- 'Default: current directory')
376
- id_column = CLIOption('--id-column', '-s',
377
- default=None,
378
- type=str,
379
- help='If provided, add a structure ID column with this name. '
380
- 'Default: don\'t add structure IDs')
381
- prefix_collate = CLIOption('--prefix', '-p',
382
- default='ID-',
383
- type=str,
384
- help='Prefix to add to structure IDs. '
385
- 'Default: no prefix')
386
- digits = CLIOption('--digits', '-n',
387
- default=8,
388
- type=int,
389
- help='Number of digits in structure IDs. ')
390
- keep_extra_columns = CLIOption('--keep-extra-columns', '-x',
391
- action='store_true',
392
- help='Whether to keep columns not mentioned in the catalog. '
393
- 'Default: drop extra columns.')
394
- keep_invalid_smiles = CLIOption('--keep-invalid-smiles', '-y',
395
- action='store_true',
396
- help='Whether to keep rows with invalid SMILES. '
397
- 'Default: drop invalid rows.')
398
-
399
- ## dedup
400
- indexes = CLIOption('--indexes', '-x',
401
- type=str,
402
- default=None,
403
- nargs='*',
404
- help='Columns to retain and collapse (if multiple values per unique structure). '
405
- 'Default: retain no other columns than structure and InchiKey.')
406
- drop_inchikey = CLIOption('--drop-inchikey', '-d',
407
- action='store_true',
408
- help='Whether to drop the calculated InchiKey column. '
409
- 'Default: keep InchiKey.')
410
-
411
- ### enum
412
- max_length = CLIOption('--max-length', '-l',
413
- type=int,
414
- help='Maximum length of enumerated peptide. '
415
- 'Required.')
416
- min_length = CLIOption('--min-length', '-m',
417
- type=int,
418
- default=None,
419
- help='Minimum length of enumerated peptide. '
420
- 'Default: same as maximum, i.e. all peptides same length.')
421
- number_to_gen = CLIOption('--number', '-n',
422
- type=float,
423
- default=None,
424
- help='Number of peptides to sample from all possible '
425
- 'within the constraints. If less than 1, sample '
426
- 'that fraction of all possible. If greater than 1, '
427
- 'sample that number. '
428
- 'Default: return all peptides.')
429
- slicer = CLIOption('--slice', '-z',
430
- type=str,
431
- default=None,
432
- nargs='*',
433
- help='Subset of (possibly sampled) population to return, in the format <stop> '
434
- 'or <start> <stop> [<step>]. If "x" is used for <stop>, then it runs to the end. '
435
- 'For example, 1000 gives the first 1000, 2 600 gives items 2-600, and '
436
- '3 500 2 gives every other from 3 to 500. Default: return all.')
437
- alphabet = CLIOption('--alphabet', '-b',
438
- type=str,
439
- default=''.join(AA),
440
- help='Alphabet to use in sampling.')
441
- suffix = CLIOption('--suffix', '-s',
442
- type=str,
443
- default='',
444
- help='Sequence to add to end. Lowercase for D-amino acids. '
445
- 'Default: no suffix.')
446
- set_seed = CLIOption('--seed', '-e',
447
- type=int,
448
- default=None,
449
- help='Seed to use for reproducible randomness. '
450
- 'Default: don\'t enable reproducibility.')
451
- d_aa_only = CLIOption('--d-aa-only', '-a',
452
- action='store_true',
453
- help='Whether to only use D-amino acids. '
454
- 'Default: don\'t include.')
455
- include_d_aa = CLIOption('--include-d-aa', '-y',
456
- action='store_true',
457
- help='Whether to include D-amino acids in enumeration. '
458
- 'Default: don\'t include.')
459
-
460
- ## reaction
461
- name = CLIOption('--name', '-n',
462
- type=str,
463
- default=None,
464
- help='Name of column for product. '
465
- 'Default: same as reaction name.')
466
- reaction_opt = CLIOption('--reaction', '-x',
467
- type=str,
468
- nargs='*',
469
- choices=list(REACTIONS),
470
- default='N_to_C_cyclization',
471
- help='Reaction(s) to apply.')
472
-
473
- clean = CLICommand('clean',
474
- description='Clean and normalize SMILES column of a table.',
475
- main=_clean,
476
- options=[output, formatting, inputs, representation, column, prefix])
477
- convert = CLICommand('convert',
478
- description='Convert between string representations of chemical structures.',
479
- main=_convert,
480
- options=[output, formatting, inputs, representation, column, prefix, to, options])
481
- featurize = CLICommand('featurize',
482
- description='Convert between string representations of chemical structures.',
483
- main=_featurize,
484
- options=[output, formatting, inputs, representation, column, prefix,
485
- id_feat, feature])
486
- collate = CLICommand('collate',
487
- description='Collect disparate tables or SDF files of libraries into a single table.',
488
- main=_collate,
489
- options=[output, formatting, inputs, representation,
490
- data_dir, column.replace(default='input_smiles'), id_column, prefix_collate,
491
- digits, keep_extra_columns, keep_invalid_smiles])
492
- dedup = CLICommand('dedup',
493
- description='Deduplicate chemical structures and retain references.',
494
- main=_dedup,
495
- options=[output, formatting, inputs, representation, column, prefix,
496
- indexes, drop_inchikey])
497
- enum = CLICommand('enumerate',
498
- description='Enumerate bio-chemical structures within length and sequence constraints.',
499
- main=_enum,
500
- options=[output, formatting, to, options,
501
- alphabet, max_length, min_length, number_to_gen,
502
- slicer, set_seed,
503
- prefix.replace(default='',
504
- help='Sequence to prepend. Lowercase for D-amino acids. '
505
- 'Default: no prefix.'),
506
- suffix,
507
- type_.replace(default='aa',
508
- choices=['aa'],
509
- help='Type of bio sequence to enumerate. '
510
- 'Default: %(default)s.'),
511
- d_aa_only, include_d_aa])
512
- reaction = CLICommand('react',
513
- description='React compounds in silico in indicated columns using a named reaction.',
514
- main=_react,
515
- options=[output, formatting, inputs, representation, column, name,
516
- reaction_opt])
517
- split = CLICommand('split',
518
- description='Split table based on chosen algorithm, optionally taking account of chemical structure during splits.',
519
- main=_split,
520
- options=[output, formatting, inputs, representation, column, prefix,
521
- type_, train, test, set_seed])
522
-
523
- app = CLIApp("schemist",
524
- version=__version__,
525
- description="Tools for cleaning, collating, and augmenting chemical datasets.",
526
- commands=[clean, convert, featurize, collate, dedup, enum, reaction, split])
527
-
528
- app.run()
529
-
530
- return None
531
-
532
-
533
- if __name__ == "__main__":
534
-
535
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/collating.py DELETED
@@ -1,317 +0,0 @@
1
- """Tools to collate chemical data files."""
2
-
3
- from typing import Callable, Dict, Iterable, List, Optional, Tuple, TextIO, Union
4
-
5
- from collections import Counter
6
- from functools import partial
7
- from glob import glob
8
- import os
9
-
10
- from carabiner.pd import read_table, resolve_delim
11
- from carabiner import print_err
12
- import numpy as np
13
- from pandas import DataFrame, concat
14
-
15
- from .converting import convert_string_representation, _FROM_FUNCTIONS
16
- from .io import FILE_READERS
17
-
18
- GROUPING_COLUMNS = ("filename", "file_format", "library_name", "string_representation")
19
- ESSENTIAL_COLUMNS = GROUPING_COLUMNS + ("compound_collection", "plate_id", "well_id")
20
-
21
- def _column_mapper(df: DataFrame,
22
- cols: Iterable[str]) -> Tuple[Callable, Dict]:
23
-
24
- basic_map = {column: df[column].tolist()[0] for column in cols}
25
- inv_basic_map = {value: key for key, value in basic_map.items()}
26
-
27
- def column_mapper(x: DataFrame) -> DataFrame:
28
-
29
- new_df = DataFrame()
30
-
31
- for new_col, old_col in basic_map.items():
32
-
33
- # old_col = str(old_col)
34
-
35
- if old_col is None or str(old_col) in ('None', 'nan', 'NA'):
36
-
37
- new_df[new_col] = None
38
-
39
- elif '+' in old_col:
40
-
41
- splits = old_col.split('+')
42
- new_df[new_col] = x[splits[0]].str.cat([x[s].astype(str)
43
- for s in splits[1:]])
44
-
45
- elif ';' in old_col:
46
-
47
- col, char, index = old_col.split(';')
48
- index = [int(i) for i in index.split(':')]
49
-
50
- if len(index) == 1:
51
- index = slice(index[0], index[0] + 1)
52
- else:
53
- index = slice(*index)
54
-
55
- try:
56
-
57
- new_df[new_col] = (x[col]
58
- .str.split(char)
59
- .map(lambda y: char.join(y[index] if y is not np.nan else []))
60
- .str.strip())
61
-
62
- except TypeError as e:
63
-
64
- print_err(x[col].str.split(char))
65
-
66
- raise e
67
-
68
- else:
69
- try:
70
- new_df[new_col] = x[old_col].copy()
71
- except KeyError:
72
- raise KeyError(f"Column {old_col} mapped to {new_col} is not in the input data: " + ", ".join(x.columns))
73
-
74
- return new_df
75
-
76
- return column_mapper, inv_basic_map
77
-
78
-
79
- def _check_catalog(catalog: DataFrame,
80
- catalog_smiles_column: str = 'input_smiles') -> None:
81
-
82
- essential_columns = (catalog_smiles_column, ) + ESSENTIAL_COLUMNS
83
- missing_essential_cols = [col for col in essential_columns
84
- if col not in catalog]
85
-
86
- if len(missing_essential_cols) > 0:
87
-
88
- print_err(catalog.columns.tolist())
89
-
90
- raise KeyError("Missing required columns from catalog: " +
91
- ", ".join(missing_essential_cols))
92
-
93
- return None
94
-
95
-
96
- def collate_inventory(catalog: DataFrame,
97
- root_dir: Optional[str] = None,
98
- drop_invalid: bool = True,
99
- drop_unmapped: bool = False,
100
- catalog_smiles_column: str = 'input_smiles',
101
- id_column_name: Optional[str] = None,
102
- id_n_digits: int = 8,
103
- id_prefix: str = '') -> DataFrame:
104
-
105
- f"""Process a catalog of files containing chemical libraries into a uniform dataframe.
106
-
107
- The catalog table needs to have columns {', '.join(ESSENTIAL_COLUMNS)}:
108
-
109
- - filename is a glob pattern of files to collate
110
- - file_format is one of {', '.join(FILE_READERS.keys())}
111
- - smiles_column contains smiles strings
112
-
113
- Other columns are optional and can have any name, but must contain the name or a pattern
114
- matching a column (for tabular data) or field (for SDF data) in the files
115
- of the `filename` column. In the output DataFrame, the named column data will be mapped.
116
-
117
- Optional column contents can be either concatenated or split using the following
118
- pattern:
119
-
120
- - col1+col2: concatenates the contents of `col1` and `col2`
121
- - col1;-;1:2 : splits the contents of `col1` on the `-` character, and takes splits 1-2 (0-indexed)
122
-
123
- Parameters
124
- ----------
125
- catalog : pd.DataFrame
126
- Table cataloging locations and format of data. Requires
127
- columns {', '.join(ESSENTIAL_COLUMNS)}.
128
- root_dir : str, optional
129
- Path to look for data files. Default: current directory.
130
- drop_invalid : bool, optional
131
- Whether to drop rows containing invalid SMILES.
132
-
133
-
134
- Returns
135
- -------
136
- pd.DataFrame
137
- Collated chemical data.
138
-
139
- """
140
-
141
- root_dir = root_dir or '.'
142
-
143
- _check_catalog(catalog, catalog_smiles_column)
144
-
145
- nongroup_columns = [col for col in catalog
146
- if col not in GROUPING_COLUMNS]
147
- loaded_dataframes = []
148
- report = Counter({"invalid SMILES": 0,
149
- "rows processed": 0})
150
-
151
- grouped_catalog = catalog.groupby(list(GROUPING_COLUMNS))
152
- for (this_glob, this_filetype,
153
- this_library_name, this_representation), filename_df in grouped_catalog:
154
-
155
- print_err(f'\nProcessing {this_glob}:')
156
-
157
- this_glob = glob(os.path.join(root_dir, this_glob))
158
-
159
- these_filenames = sorted(f for f in this_glob
160
- if not os.path.basename(f).startswith('~$'))
161
- print_err('\t- ' + '\n\t- '.join(these_filenames))
162
-
163
- column_mapper, mapped_cols = _column_mapper(filename_df,
164
- nongroup_columns)
165
-
166
- reader = FILE_READERS.get(this_filetype, read_table)
167
-
168
- for filename in these_filenames:
169
-
170
- this_data0 = reader(filename)
171
-
172
- if not drop_unmapped:
173
- unmapped_cols = {col: 'x_' + col.casefold().replace(' ', '_')
174
- for col in this_data0 if col not in mapped_cols}
175
- this_data = this_data0[list(unmapped_cols)].rename(columns=unmapped_cols)
176
- this_data = concat([column_mapper(this_data0), this_data],
177
- axis=1)
178
- else:
179
- this_data = column_mapper(this_data0)
180
-
181
- if this_representation.casefold() not in _FROM_FUNCTIONS:
182
-
183
- raise TypeError(' or '.join(list(set(this_representation, this_representation.casefold()))) +
184
- "not a supported string representation. Try one of " + ", ".join(_FROM_FUNCTIONS))
185
-
186
- this_converter = partial(convert_string_representation,
187
- input_representation=this_representation.casefold())
188
-
189
- this_data = (this_data
190
- .query('compound_collection != "NA"')
191
- .assign(library_name=this_library_name,
192
- input_file_format=this_filetype,
193
- input_string_representation=this_representation,
194
- plate_id=lambda x: x['plate_id'].astype(str),
195
- plate_loc=lambda x: x['library_name'].str.cat([x['compound_collection'], x['plate_id'].astype(str), x['well_id'].astype(str)], sep=':'),
196
- canonical_smiles=lambda x: list(this_converter(x[catalog_smiles_column])),
197
- is_valid_smiles=lambda x: [s is not None for s in x['canonical_smiles']]))
198
-
199
- report.update({"invalid SMILES": (~this_data['is_valid_smiles']).sum(),
200
- "rows processed": this_data.shape[0]})
201
-
202
- if drop_invalid:
203
-
204
- this_data = this_data.query('is_valid_smiles')
205
-
206
- if id_column_name is not None:
207
-
208
- this_converter = partial(convert_string_representation,
209
- output_representation='id',
210
- options=dict(n=id_n_digits,
211
- prefix=id_prefix))
212
- this_data = this_data.assign(**{id_column_name: lambda x: list(this_converter(x['canonical_smiles']))})
213
-
214
- loaded_dataframes.append(this_data)
215
-
216
- collated_df = concat(loaded_dataframes, axis=0)
217
-
218
- return report, collated_df
219
-
220
-
221
- def collate_inventory_from_file(catalog_path: Union[str, TextIO],
222
- root_dir: Optional[str] = None,
223
- format: Optional[str] = None,
224
- *args, **kwargs) -> DataFrame:
225
-
226
- f"""Process a catalog of files containing chemical libraries into a uniform dataframe.
227
-
228
- The catalog table needs to have columns {', '.join(ESSENTIAL_COLUMNS)}:
229
-
230
- - filename is a glob pattern of files to collate
231
- - file_format is one of {', '.join(FILE_READERS.keys())}
232
- - smiles_column contains smiles strings
233
-
234
- Other columns are optional and can have any name, but must contain the name or a pattern
235
- matching a column (for tabular data) or field (for SDF data) in the files
236
- of the `filename` column. In the output DataFrame, the named column data will be mapped.
237
-
238
- Optional column contents can be either concatenated or split using the following
239
- pattern:
240
-
241
- - col1+col2: concatenates the contents of `col1` and `col2`
242
- - col1;-;1:2 : splits the contents of `col1` on the `-` character, and takes splits 1-2 (0-indexed)
243
-
244
- Parameters
245
- ----------
246
- catalog_path : str
247
- Path to catalog file in XLSX, TSV or CSV format. Requires
248
- columns {', '.join(ESSENTIAL_COLUMNS)}.
249
- format : str, optional
250
- Format of catalog file. Default: infer from file extension.
251
- root_dir : str, optional
252
- Path to look for data files. Default: use directory containing
253
- the catalog.
254
-
255
- Returns
256
- -------
257
- pd.DataFrame
258
- Collated chemical data.
259
-
260
- """
261
-
262
- root_dir = root_dir or os.path.dirname(catalog_path)
263
-
264
- data_catalog = read_table(catalog_path, format=format)
265
-
266
- return collate_inventory(catalog=data_catalog,
267
- root_dir=root_dir,
268
- *args, **kwargs)
269
-
270
-
271
- def deduplicate(df: DataFrame,
272
- column: str = 'smiles',
273
- input_representation: str = 'smiles',
274
- index_columns: Optional[List[str]] = None,
275
- drop_inchikey: bool = False) -> DataFrame:
276
-
277
- index_columns = index_columns or []
278
-
279
- inchikey_converter = partial(convert_string_representation,
280
- input_representation=input_representation,
281
- output_representation='inchikey')
282
-
283
- df = df.assign(inchikey=lambda x: inchikey_converter(x[column]))
284
-
285
- structure_columns = [column, 'inchikey']
286
- df_unique = []
287
-
288
- for (string_rep, inchikey), structure_df in df.groupby(structure_columns):
289
-
290
- collapsed_indexes = {col: [';'.join(sorted(map(str, set(structure_df[col].tolist()))))]
291
- for col in structure_df if col in index_columns}
292
- collapsed_indexes.update({column: [string_rep],
293
- 'inchikey': [inchikey],
294
- 'instance_count': [structure_df.shape[0]]})
295
-
296
- df_unique.append(DataFrame(collapsed_indexes))
297
-
298
- df_unique = concat(df_unique, axis=0)
299
-
300
- if drop_inchikey:
301
-
302
- df_unique = df_unique.drop(columns=['inchikey'])
303
-
304
- report = {'starting rows:': df.shape[0],
305
- 'ending_rows': df_unique.shape[0]}
306
-
307
- return report, df_unique
308
-
309
-
310
- def deduplicate_file(filename: Union[str, TextIO],
311
- format: Optional[str] = None,
312
- *args, **kwargs) -> DataFrame:
313
-
314
- table = read_table(filename)
315
-
316
- return deduplicate(table, *args, **kwargs)
317
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/converting.py DELETED
@@ -1,369 +0,0 @@
1
- """Converting between chemical representation formats."""
2
-
3
- from typing import Any, Callable, Dict, Iterable, List, Optional, Union
4
-
5
- from functools import wraps
6
-
7
- from carabiner import print_err
8
- from carabiner.cast import cast, flatten
9
- from carabiner.decorators import return_none_on_error, vectorize
10
- from carabiner.itertools import batched
11
-
12
- from datamol import sanitize_smiles
13
- import nemony as nm
14
- from pandas import DataFrame
15
- from rdkit.Chem import (Crippen, Descriptors, rdMolDescriptors,
16
- Mol, MolFromInchi, MolFromHELM, MolFromSequence,
17
- MolFromSmiles, MolToInchi, MolToInchiKey,
18
- MolToSmiles)
19
- from rdkit.Chem.Scaffolds.MurckoScaffold import MurckoScaffoldSmiles
20
- from requests import Session
21
- import selfies as sf
22
-
23
- from .rest_lookup import _inchikey2pubchem_name_id, _inchikey2cactus_name
24
-
25
- @vectorize
26
- @return_none_on_error
27
- def _seq2mol(s: str) -> Union[Mol, None]:
28
-
29
- return MolFromSequence(s, sanitize=True)
30
-
31
-
32
- @vectorize
33
- @return_none_on_error
34
- def _helm2mol(s: str) -> Union[Mol, None]:
35
-
36
- return MolFromHELM(s, sanitize=True)
37
-
38
-
39
- def mini_helm2helm(s: str) -> List[str]:
40
-
41
- new_s = []
42
- token = ''
43
- between_sq_brackets = False
44
-
45
- for letter in s:
46
-
47
- if letter.islower() and not between_sq_brackets:
48
-
49
- letter = f"[d{letter.upper()}]"
50
-
51
- token += letter
52
-
53
- if letter == '[':
54
- between_sq_brackets = True
55
- elif letter == ']':
56
- between_sq_brackets = False
57
-
58
- if not between_sq_brackets:
59
- new_s.append(token)
60
- token = ''
61
-
62
- return "PEPTIDE1{{{inner_helm}}}$$$$".format(inner_helm='.'.join(new_s))
63
-
64
-
65
- @vectorize
66
- @return_none_on_error
67
- def _mini_helm2mol(s: str) -> Mol:
68
-
69
- s = mini_helm2helm(s)
70
-
71
- return MolFromHELM(s, sanitize=True)
72
-
73
-
74
- @vectorize
75
- @return_none_on_error
76
- def _inchi2mol(s: str) -> Mol:
77
-
78
- return MolFromInchi(s,
79
- sanitize=True,
80
- removeHs=True)
81
-
82
- @vectorize
83
- @return_none_on_error
84
- def _smiles2mol(s: str) -> Mol:
85
-
86
- return MolFromSmiles(sanitize_smiles(s))
87
-
88
-
89
- @vectorize
90
- @return_none_on_error
91
- def _selfies2mol(s: str) -> Mol:
92
-
93
- return MolFromSmiles(sf.decoder(s))
94
-
95
-
96
- @vectorize
97
- @return_none_on_error
98
- def _mol2clogp(m: Mol,
99
- **kwargs) -> float:
100
-
101
- return Crippen.MolLogP(m)
102
-
103
-
104
- @vectorize
105
- @return_none_on_error
106
- def _mol2nonstandard_inchikey(m: Mol,
107
- **kwargs) -> str:
108
-
109
- return MolToInchiKey(m,
110
- options="/FixedH /SUU /RecMet /KET /15T")
111
-
112
-
113
- @vectorize
114
- @return_none_on_error
115
- def _mol2hash(m: Mol,
116
- **kwargs) -> str:
117
-
118
- nonstandard_inchikey = _mol2nonstandard_inchikey(m)
119
-
120
- return nm.hash(nonstandard_inchikey)
121
-
122
-
123
- @vectorize
124
- @return_none_on_error
125
- def _mol2id(m: Mol,
126
- n: int = 8,
127
- prefix: str = '',
128
- **kwargs) -> str:
129
-
130
- return prefix + str(int(_mol2hash(m), 16))[:n]
131
-
132
-
133
- @vectorize
134
- @return_none_on_error
135
- def _mol2isomeric_canonical_smiles(m: Mol,
136
- **kwargs) -> str:
137
-
138
- return MolToSmiles(m,
139
- isomericSmiles=True,
140
- canonical=True)
141
-
142
-
143
- @vectorize
144
- @return_none_on_error
145
- def _mol2inchi(m: Mol,
146
- **kwargs) -> str:
147
-
148
- return MolToInchi(m)
149
-
150
-
151
- @vectorize
152
- @return_none_on_error
153
- def _mol2inchikey(m: Mol,
154
- **kwargs) -> str:
155
-
156
- return MolToInchiKey(m)
157
-
158
-
159
- @vectorize
160
- @return_none_on_error
161
- def _mol2random_smiles(m: Mol,
162
- **kwargs) -> str:
163
-
164
- return MolToSmiles(m,
165
- isomericSmiles=True,
166
- doRandom=True)
167
-
168
-
169
- @vectorize
170
- @return_none_on_error
171
- def _mol2mnemonic(m: Mol,
172
- **kwargs) -> str:
173
-
174
- nonstandard_inchikey = _mol2nonstandard_inchikey(m)
175
-
176
- return nm.encode(nonstandard_inchikey)
177
-
178
-
179
- @vectorize
180
- @return_none_on_error
181
- def _mol2mwt(m: Mol,
182
- **kwargs) -> float:
183
-
184
- return Descriptors.ExactMolWt(m)
185
-
186
-
187
- @vectorize
188
- @return_none_on_error
189
- def _mol2min_charge(m: Mol,
190
- **kwargs) -> float:
191
-
192
- return Descriptors.MinPartialCharge(m)
193
-
194
-
195
- @vectorize
196
- @return_none_on_error
197
- def _mol2max_charge(m: Mol,
198
- **kwargs) -> float:
199
-
200
- return Descriptors.MaxPartialCharge(m)
201
-
202
-
203
- @vectorize
204
- @return_none_on_error
205
- def _mol2tpsa(m: Mol,
206
- **kwargs) -> float:
207
-
208
- return rdMolDescriptors.CalcTPSA(m)
209
-
210
-
211
- def _mol2pubchem(m: Union[Mol, Iterable[Mol]],
212
- session: Optional[Session] = None,
213
- chunksize: int = 32) -> List[Dict[str, Union[None, int, str]]]:
214
-
215
- inchikeys = cast(_mol2inchikey(m), to=list)
216
- pubchem_ids = []
217
-
218
- for _inchikeys in batched(inchikeys, chunksize):
219
-
220
- these_ids = _inchikey2pubchem_name_id(_inchikeys,
221
- session=session)
222
- pubchem_ids += these_ids
223
-
224
- return pubchem_ids
225
-
226
-
227
- @return_none_on_error
228
- def _mol2pubchem_id(m: Union[Mol, Iterable[Mol]],
229
- session: Optional[Session] = None,
230
- chunksize: int = 32,
231
- **kwargs) -> Union[str, List[str]]:
232
-
233
- return flatten([val['pubchem_id']
234
- for val in _mol2pubchem(m,
235
- session=session,
236
- chunksize=chunksize)])
237
-
238
-
239
- @return_none_on_error
240
- def _mol2pubchem_name(m: Union[Mol, Iterable[Mol]],
241
- session: Optional[Session] = None,
242
- chunksize: int = 32,
243
- **kwargs) -> Union[str, List[str]]:
244
-
245
- return flatten([val['pubchem_name']
246
- for val in _mol2pubchem(m,
247
- session=session,
248
- chunksize=chunksize)])
249
-
250
- @return_none_on_error
251
- def _mol2cactus_name(m: Union[Mol, Iterable[Mol]],
252
- session: Optional[Session] = None,
253
- **kwargs) -> Union[str, List[str]]:
254
-
255
- return _inchikey2cactus_name(_mol2inchikey(m),
256
- session=session)
257
-
258
-
259
- @vectorize
260
- @return_none_on_error
261
- def _mol2scaffold(m: Mol,
262
- chiral: bool = True,
263
- **kwargs) -> str:
264
-
265
- return MurckoScaffoldSmiles(mol=m,
266
- includeChirality=chiral)
267
-
268
-
269
- @vectorize
270
- @return_none_on_error
271
- def _mol2selfies(m: Mol,
272
- **kwargs) -> str:
273
-
274
- s = sf.encoder(_mol2isomeric_canonical_smiles(m))
275
-
276
- return s if s != -1 else None
277
-
278
-
279
- _TO_FUNCTIONS = {"smiles": _mol2isomeric_canonical_smiles,
280
- "selfies": _mol2selfies,
281
- "inchi": _mol2inchi,
282
- "inchikey": _mol2inchikey,
283
- "nonstandard_inchikey": _mol2nonstandard_inchikey,
284
- "hash": _mol2hash,
285
- "mnemonic": _mol2mnemonic,
286
- "id": _mol2id,
287
- "scaffold": _mol2scaffold,
288
- "permuted_smiles": _mol2random_smiles,
289
- "pubchem_id": _mol2pubchem_id,
290
- "pubchem_name": _mol2pubchem_name,
291
- "cactus_name": _mol2cactus_name,
292
- "clogp": _mol2clogp,
293
- "tpsa": _mol2tpsa,
294
- "mwt": _mol2mwt,
295
- "min_charge": _mol2min_charge,
296
- "max_charge": _mol2max_charge}
297
-
298
- _FROM_FUNCTIONS = {"smiles": _smiles2mol,
299
- "selfies": _selfies2mol,
300
- "inchi": _inchi2mol,
301
- "aa_seq": _seq2mol,
302
- "helm": _helm2mol,
303
- "minihelm": _mini_helm2mol}
304
-
305
-
306
- def _x2mol(
307
- strings: Union[Iterable[str], str],
308
- input_representation: str = 'smiles'
309
- ) -> Union[Mol, None, Iterable[Union[Mol, None]]]:
310
-
311
- from_function = _FROM_FUNCTIONS[input_representation.casefold()]
312
- return from_function(strings)
313
-
314
-
315
- def _mol2x(
316
- mols: Union[Iterable[Mol], Mol],
317
- output_representation: str = 'smiles',
318
- **kwargs
319
- ) -> Union[str, None, Iterable[Union[str, None]]]:
320
-
321
- to_function = _TO_FUNCTIONS[output_representation.casefold()]
322
-
323
- return to_function(mols, **kwargs)
324
-
325
-
326
- def convert_string_representation(
327
- strings: Union[Iterable[str], str],
328
- input_representation: str = 'smiles',
329
- output_representation: Union[Iterable[str], str] = 'smiles',
330
- **kwargs
331
- ) -> Union[str, None, Iterable[Union[str, None]], Dict[str, Union[str, None, Iterable[Union[str, None]]]]]:
332
-
333
- """Convert between string representations of chemical structures.
334
-
335
- """
336
-
337
- mols = _x2mol(cast(strings, to=list), input_representation)
338
- # print_err(mols)
339
-
340
- if not isinstance(output_representation, str) and isinstance(output_representation, Iterable):
341
- mols = cast(mols, to=list)
342
- outstrings = {rep_name: _mol2x(mols, rep_name, **kwargs)
343
- for rep_name in output_representation}
344
- elif isinstance(output_representation, str):
345
- outstrings = _mol2x(mols, output_representation, **kwargs)
346
- else:
347
- raise TypeError(f"Specified output representation must be a string or iterable")
348
- # print_err(outstrings)
349
-
350
- return outstrings
351
-
352
-
353
- def _convert_input_to_smiles(f: Callable) -> Callable:
354
-
355
- @wraps(f)
356
- def _f(
357
- strings: Union[Iterable[str], str],
358
- input_representation: str = 'smiles',
359
- *args, **kwargs
360
- ) -> Union[str, None, Iterable[Union[str, None]]]:
361
-
362
- smiles = convert_string_representation(
363
- cast(strings, to=list),
364
- output_representation='smiles',
365
- input_representation=input_representation
366
- )
367
- return f(strings=smiles, *args, **kwargs)
368
-
369
- return _f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/features.py DELETED
@@ -1,271 +0,0 @@
1
- """Tools for generating chemical features."""
2
-
3
- from typing import Any, Callable, Iterable, List, Optional, Tuple, Union
4
- from functools import wraps
5
-
6
- from carabiner.cast import cast
7
- from descriptastorus.descriptors import MakeGenerator
8
- from pandas import DataFrame, Series
9
- import numpy as np
10
- from rdkit import RDLogger
11
- RDLogger.DisableLog('rdApp.*')
12
- from rdkit.Chem.AllChem import FingeprintGenerator64, GetMorganGenerator, Mol
13
-
14
- from .converting import _smiles2mol, _convert_input_to_smiles
15
-
16
- def _feature_matrix(f: Callable[[Any], DataFrame]) -> Callable[[Any], Union[DataFrame, Tuple[np.ndarray, np.ndarray]]]:
17
-
18
- @wraps(f)
19
- def _f(prefix: Optional[str] = None,
20
- *args, **kwargs) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
21
-
22
- feature_matrix = f(*args, **kwargs)
23
-
24
- if prefix is not None and isinstance(feature_matrix, DataFrame):
25
- new_cols = {col: f"{prefix}_{col}"
26
- for col in feature_matrix.columns
27
- if not col.startswith('_meta')}
28
- feature_matrix = feature_matrix.rename(columns=new_cols)
29
-
30
- return feature_matrix
31
-
32
- return _f
33
-
34
-
35
- def _get_descriptastorus_features(
36
- smiles: Iterable[str],
37
- generator: str
38
- ) -> Union[DataFrame, Tuple[np.ndarray, List[str]]]:
39
-
40
- generator = MakeGenerator((generator, ))
41
- features = list(map(generator.process, smiles))
42
- return np.stack(features, axis=0), [col for col, _ in generator.GetColumns()]
43
-
44
-
45
- @_feature_matrix
46
- @_convert_input_to_smiles
47
- def calculate_2d_features(
48
- strings: Union[Iterable[str], str],
49
- normalized: bool = True,
50
- histogram_normalized: bool = True,
51
- return_dataframe: bool = False
52
- ) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
53
-
54
- """Calculate 2d features from string representation.
55
-
56
- Parameters
57
- ----------
58
- strings : str
59
- Input string representation(s).
60
- input_representation : str
61
- Representation type
62
- normalized : bool, optional
63
- Whether to return normalized features. Default: `True`.
64
- histogram_normalized : bool, optional
65
- Whether to return histogram normalized features (faster). Default: `True`.
66
- return_dataframe : bool, optional
67
- Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: `False`.
68
-
69
- Returns
70
- -------
71
- DataFrame, Tuple of numpy Arrays
72
- If `return_dataframe = True`, a DataFrame with named feature columns, and
73
- the final column called `"meta_feature_valid"` being the validity indicator.
74
- Otherwise returns a tuple of Arrays with the first being the matrix of
75
- features and the second being the vector of validity indicators.
76
-
77
- Examples
78
- --------
79
- >>> features, validity = calculate_2d_features(strings='CCC')
80
- >>> features[:,:3]
81
- array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05]])
82
- >>> validity
83
- array([1.])
84
- >>> features, validity = calculate_2d_features(strings=['CCC', 'CCCO'])
85
- >>> features[:,:3]
86
- array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05],
87
- [7.38891722e-01, 6.00042003e-04, 5.00035002e-05]])
88
- >>> validity
89
- array([1., 1.])
90
- >>> calculate_2d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
91
- CCC True
92
- CCCO True
93
- Name: meta_feature_valid, dtype: bool
94
-
95
- """
96
-
97
- if normalized:
98
- if histogram_normalized:
99
- generator_name = "RDKit2DHistogramNormalized"
100
- else:
101
- generator_name = "RDKit2DNormalized"
102
- else:
103
- generator_name = "RDKit2D"
104
-
105
- strings = cast(strings, to=list)
106
- feature_matrix, columns = _get_descriptastorus_features(
107
- strings,
108
- generator=generator_name,
109
- )
110
-
111
- if return_dataframe:
112
- feature_matrix = DataFrame(
113
- feature_matrix,
114
- index=strings,
115
- columns=columns,
116
- )
117
-
118
- feature_matrix = (
119
- feature_matrix
120
- .rename(columns={f"{generator_name}_calculated": "meta_feature_valid0"})
121
- .assign(meta_feature_type=generator_name,
122
- meta_feature_valid=lambda x: (x['meta_feature_valid0'] == 1.))
123
- .drop(columns=['meta_feature_valid0'])
124
- )
125
- return feature_matrix
126
- else:
127
- return feature_matrix[:,1:], feature_matrix[:,0]
128
-
129
-
130
- def _fast_fingerprint(generator: FingeprintGenerator64,
131
- mol: Mol,
132
- to_np: bool = True) -> Union[str, np.ndarray]:
133
-
134
- try:
135
- fp_string = generator.GetFingerprint(mol).ToBitString()
136
- except:
137
- return None
138
- else:
139
- if to_np:
140
- return np.frombuffer(fp_string.encode(), 'u1') - ord('0')
141
- else:
142
- return fp_string
143
-
144
-
145
- @_feature_matrix
146
- @_convert_input_to_smiles
147
- def calculate_fingerprints(
148
- strings: Union[Iterable[str], str],
149
- fp_type: str = 'morgan',
150
- radius: int = 2,
151
- chiral: bool = True,
152
- on_bits: bool = True,
153
- return_dataframe: bool = False
154
- ) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
155
-
156
- """Calculate the binary fingerprint of string representation(s).
157
-
158
- Only Morgan fingerprints are allowed.
159
-
160
- Parameters
161
- ----------
162
- strings : str
163
- Input string representation(s).
164
- input_representation : str
165
- Representation type
166
- fp_type : str, opional
167
- Which fingerprint type to calculate. Default: `'morgan'`.
168
- radius : int, optional
169
- Atom radius for fingerprints. Default: `2`.
170
- chiral : bool, optional
171
- Whether to take chirality into account. Default: `True`.
172
- on_bits : bool, optional
173
- Whether to return the non-zero indices instead of the full binary vector. Default: `True`.
174
- return_dataframe : bool, optional
175
- Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: `False`.
176
-
177
- Returns
178
- -------
179
- DataFrame, Tuple of numpy Arrays
180
- If `return_dataframe = True`, a DataFrame with named feature columns, and
181
- the final column called `"meta_feature_valid"` being the validity indicator.
182
- Otherwise returns a tuple of Arrays with the first being the matrix of
183
- features and the second being the vector of validity indicators.
184
-
185
- Raises
186
- ------
187
- NotImplementedError
188
- If `fp_type` is not `'morgan'`.
189
-
190
- Examples
191
- --------
192
- >>> bits, validity = calculate_fingerprints(strings='CCC')
193
- >>> bits.tolist()
194
- ['80;294;1057;1344']
195
- >>> sum(validity) # doctest: +NORMALIZE_WHITESPACE
196
- 1
197
- >>> bits, validity = calculate_fingerprints(strings=['CCC', 'CCCO'])
198
- >>> bits.tolist()
199
- ['80;294;1057;1344', '80;222;294;473;794;807;1057;1277']
200
- >>> sum(validity) # doctest: +NORMALIZE_WHITESPACE
201
- 2
202
- >>> np.sum(calculate_fingerprints(strings=['CCC', 'CCCO'], on_bits=False)[0], axis=-1)
203
- array([4, 8])
204
- >>> calculate_fingerprints(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
205
- CCC True
206
- CCCO True
207
- Name: meta_feature_valid, dtype: bool
208
-
209
- """
210
-
211
- if fp_type.casefold() == 'morgan':
212
- generator_class = GetMorganGenerator
213
- else:
214
- raise NotImplementedError(f"Fingerprint type {fp_type} not supported!")
215
-
216
- fp_generator = generator_class(radius=radius,
217
- includeChirality=chiral)
218
- strings = cast(strings, to=list)
219
- mols = (_smiles2mol(s) for s in strings)
220
- fp_strings = (_fast_fingerprint(fp_generator, mol, to_np=on_bits)
221
- for mol in mols)
222
-
223
- if on_bits:
224
-
225
- fingerprints = (map(str, np.flatnonzero(fp_string).tolist())
226
- for fp_string in fp_strings)
227
- fingerprints = [';'.join(fp) for fp in fingerprints]
228
- validity = [len(fp) > 0 for fp in fingerprints]
229
-
230
- else:
231
-
232
- fingerprints = [np.array([int(digit) for digit in fp_string])
233
- if fp_string is not None
234
- else (-np.ones((fp_generator.GetOptions().fpSize, )))
235
- for fp_string in fp_strings]
236
- validity = [np.all(fp >= 0) for fp in fingerprints]
237
-
238
- feature_matrix = np.stack(fingerprints, axis=0)
239
-
240
- if return_dataframe:
241
- if feature_matrix.ndim == 1: # on_bits only
242
- feature_matrix = DataFrame(
243
- feature_matrix,
244
- columns=['fp_bits'],
245
- index=strings,
246
- )
247
- else:
248
- feature_matrix = DataFrame(feature_matrix,
249
- columns=[f"fp_{i}" for i, _ in enumerate(feature_matrix.T)])
250
- return feature_matrix.assign(meta_feature_type=fp_type.casefold(),
251
- meta_feature_valid=validity)
252
- else:
253
- return feature_matrix, validity
254
-
255
-
256
- _FEATURE_CALCULATORS = {
257
- "2d": calculate_2d_features,
258
- "fp": calculate_fingerprints,
259
- }
260
-
261
- def calculate_feature(
262
- feature_type: str,
263
- return_dataframe: bool = False,
264
- *args, **kwargs) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
265
-
266
- """Calculate the binary fingerprint or descriptor vector of string representation(s).
267
-
268
- """
269
-
270
- featurizer = _FEATURE_CALCULATORS[feature_type]
271
- return featurizer(*args, **kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/generating.py DELETED
@@ -1,262 +0,0 @@
1
- """Tools for enumerating compounds. Currently only works with peptides."""
2
-
3
- from typing import Callable, Iterable, Optional, Tuple, Union
4
-
5
- from functools import partial
6
- from itertools import chain, islice, product, repeat
7
- from math import ceil, expm1, floor
8
- from random import choice, choices, random, seed
9
-
10
- from carabiner import print_err
11
- from carabiner.decorators import vectorize, return_none_on_error
12
- from carabiner.random import sample_iter
13
- from rdkit.Chem import Mol, rdChemReactions
14
- import numpy as np
15
-
16
- from .converting import (_x2mol, _mol2x,
17
- _convert_input_to_smiles)
18
-
19
- AA = tuple('GALVITSMCPFYWHKRDENQ')
20
- dAA = tuple(aa.casefold() for aa in AA)
21
-
22
- REACTIONS = {'N_to_C_cyclization': '([N;H1:5][C:1][C:2](=[O:6])[O:3].[N;H2:4][C:7][C:8](=[O:9])[N;H1:10])>>[N;H1:5][C:1][C:2](=[O:6])[N;H1:4][C:7][C:8](=[O:9])[N;H1:10].[O;H2:3]',
23
- 'cysteine_to_chloroacetyl_cyclization': '([N;H1:5][C:2](=[O:6])[C:1][Cl:3].[S;H1:4][C;H2:7][C:8])>>[N;H1:5][C:2](=[O:6])[C:1][S:4][C;H2:7][C:8]',
24
- 'cysteine_to_N_cyclization':'([N;H1:5][C:2](=[O:6])[C:1][N;H2:3].[S;H1:4][C;H2:7][C:8])>>[N;H1:5][C:2](=[O:6])[C:1][S:4][C;H2:7][C:8].[N;H3:3]'}
25
-
26
- def _get_alphabet(alphabet: Optional[Iterable[str]] = None,
27
- d_aa_only: bool = False,
28
- include_d_aa: bool = False) -> Tuple[str]:
29
-
30
- alphabet = alphabet or AA
31
- alphabet_lower = tuple(set(aa.casefold() for aa in AA))
32
-
33
- if d_aa_only:
34
- alphabet = alphabet_lower
35
- elif include_d_aa:
36
- alphabet = tuple(set(chain(alphabet, alphabet_lower)))
37
-
38
- return alphabet
39
-
40
-
41
-
42
- def all_peptides_of_one_length(length: int,
43
- alphabet: Optional[Iterable[str]] = None,
44
- d_aa_only: bool = False,
45
- include_d_aa: bool = False) -> Iterable[str]:
46
-
47
- """
48
-
49
- """
50
-
51
- alphabet = _get_alphabet(alphabet=alphabet,
52
- d_aa_only=d_aa_only,
53
- include_d_aa=include_d_aa)
54
-
55
- return (''.join(peptide)
56
- for peptide in product(alphabet, repeat=length))
57
-
58
-
59
- def all_peptides_in_length_range(max_length: int,
60
- min_length: int = 1,
61
- by: int = 1,
62
- alphabet: Optional[Iterable[str]] = None,
63
- d_aa_only: bool = False,
64
- include_d_aa: bool = False,
65
- *args, **kwargs) -> Iterable[str]:
66
-
67
- """
68
-
69
- """
70
-
71
- length_range = range(*sorted([min_length, max_length + 1]), by)
72
- peptide_maker = partial(all_peptides_of_one_length,
73
- alphabet=alphabet,
74
- d_aa_only=d_aa_only,
75
- include_d_aa=include_d_aa,
76
- *args, **kwargs)
77
-
78
- return chain.from_iterable(peptide_maker(length=length)
79
- for length in length_range)
80
-
81
-
82
- def _number_of_peptides(max_length: int,
83
- min_length: int = 1,
84
- by: int = 1,
85
- alphabet: Optional[Iterable[str]] = None,
86
- d_aa_only: bool = False,
87
- include_d_aa: bool = False):
88
-
89
- alphabet = _get_alphabet(alphabet=alphabet,
90
- d_aa_only=d_aa_only,
91
- include_d_aa=include_d_aa)
92
- n_peptides = [len(alphabet) ** length
93
- for length in range(*sorted([min_length, max_length + 1]), by)]
94
-
95
- return n_peptides
96
-
97
-
98
- def _naive_sample_peptides_in_length_range(max_length: int,
99
- min_length: int = 1,
100
- by: int = 1,
101
- n: Optional[Union[float, int]] = None,
102
- alphabet: Optional[Iterable[str]] = None,
103
- d_aa_only: bool = False,
104
- include_d_aa: bool = False,
105
- set_seed: Optional[int] = None):
106
-
107
- alphabet = _get_alphabet(alphabet=alphabet,
108
- d_aa_only=d_aa_only,
109
- include_d_aa=include_d_aa)
110
- n_peptides = _number_of_peptides(max_length=max_length,
111
- min_length=min_length,
112
- by=by,
113
- alphabet=alphabet,
114
- d_aa_only=d_aa_only,
115
- include_d_aa=include_d_aa)
116
- lengths = list(range(*sorted([min_length, max_length + 1]), by))
117
- weight_per_length = [n / min(n_peptides) for n in n_peptides]
118
- weighted_lengths = list(chain.from_iterable(repeat(l, ceil(w)) for l, w in zip(lengths, weight_per_length)))
119
-
120
- lengths_sample = (choice(weighted_lengths) for _ in range(n))
121
- return (''.join(choices(list(alphabet), k=k)) for k in lengths_sample)
122
-
123
-
124
- def sample_peptides_in_length_range(max_length: int,
125
- min_length: int = 1,
126
- by: int = 1,
127
- n: Optional[Union[float, int]] = None,
128
- alphabet: Optional[Iterable[str]] = None,
129
- d_aa_only: bool = False,
130
- include_d_aa: bool = False,
131
- naive_sampling_cutoff: float = 5e-3,
132
- reservoir_sampling: bool = True,
133
- indexes: Optional[Iterable[int]] = None,
134
- set_seed: Optional[int] = None,
135
- *args, **kwargs) -> Iterable[str]:
136
-
137
- """
138
-
139
- """
140
-
141
- seed(set_seed)
142
-
143
- alphabet = _get_alphabet(alphabet=alphabet,
144
- d_aa_only=d_aa_only,
145
- include_d_aa=include_d_aa)
146
-
147
- n_peptides = sum(len(alphabet) ** length
148
- for length in range(*sorted([min_length, max_length + 1]), by))
149
- if n is None:
150
- n_requested = n_peptides
151
- elif n >= 1.:
152
- n_requested = min(floor(n), n_peptides)
153
- elif n < 1.:
154
- n_requested = floor(n * n_peptides)
155
-
156
- frac_requested = n_requested / n_peptides
157
-
158
- # approximation of birthday problem
159
- p_any_collision = -expm1(-n_requested * (n_requested - 1.) / (2. * n_peptides))
160
- n_collisons = n_requested * (1. - ((n_peptides - 1.) / n_peptides) ** (n_requested - 1.))
161
- frac_collisions = n_collisons / n_requested
162
-
163
- print_err(f"Sampling {n_requested} ({frac_requested * 100.} %) peptides from "
164
- f"length {min_length} to {max_length} ({n_peptides} combinations). "
165
- f"Probability of collision if drawing randomly is {p_any_collision}, "
166
- f"with {n_collisons} ({100. * frac_collisions} %) collisions on average.")
167
-
168
- if frac_collisions < naive_sampling_cutoff and n_peptides > 2e9:
169
-
170
- print_err("> Executing naive sampling. ")
171
-
172
- peptides = _naive_sample_peptides_in_length_range(max_length, min_length, by,
173
- n=n_requested,
174
- alphabet=alphabet,
175
- d_aa_only=d_aa_only,
176
- include_d_aa=include_d_aa)
177
-
178
- else:
179
-
180
- print_err("> Executing exhaustive sampling.")
181
-
182
- all_peptides = all_peptides_in_length_range(max_length, min_length, by,
183
- alphabet=alphabet,
184
- d_aa_only=d_aa_only,
185
- include_d_aa=include_d_aa,
186
- *args, **kwargs)
187
-
188
- if n is None:
189
-
190
- peptides = all_peptides
191
-
192
- elif n >= 1.:
193
-
194
- if reservoir_sampling:
195
- peptides = sample_iter(all_peptides, k=n_requested,
196
- shuffle_output=False)
197
- else:
198
- peptides = (pep for pep in all_peptides
199
- if random() <= frac_requested)
200
-
201
- elif n < 1.:
202
-
203
- peptides = (pep for pep in all_peptides
204
- if random() <= n)
205
-
206
- if indexes is not None:
207
-
208
- indexes = (int(ix) if (isinstance(ix, str) and ix.isdigit()) or isinstance(ix, int) or isinstance(ix, float)
209
- else None
210
- for ix in islice(indexes, 3))
211
- indexes = [ix if (ix is None or ix >= 0) else None
212
- for ix in indexes]
213
-
214
- if len(indexes) > 1:
215
- if n is not None and n >=1. and indexes[0] > n:
216
- raise ValueError(f"Minimum slice ({indexes[0]}) is higher than number of items ({n}).")
217
-
218
- peptides = islice(peptides, *indexes)
219
-
220
- return peptides
221
-
222
-
223
- def _reactor(smarts: str) -> Callable[[Mol], Union[Mol, None]]:
224
-
225
- rxn = rdChemReactions.ReactionFromSmarts(smarts)
226
- reaction_function = rxn.RunReactants
227
-
228
- @vectorize
229
- @return_none_on_error
230
- def reactor(s: Mol) -> Mol:
231
-
232
- return reaction_function([s])[0][0]
233
-
234
- return reactor
235
-
236
-
237
- @_convert_input_to_smiles
238
- def react(strings: Union[str, Iterable[str]],
239
- reaction: str = 'N_to_C_cyclization',
240
- output_representation: str = 'smiles',
241
- **kwargs) -> Union[str, Iterable[str]]:
242
-
243
- """
244
-
245
- """
246
-
247
- try:
248
- _this_reaction = REACTIONS[reaction]
249
- except KeyError:
250
- raise KeyError(f"Reaction {reaction} is not available. Try: " +
251
- ", ".join(list(REACTIONS)))
252
-
253
- # strings = cast(strings, to=list)
254
- # print_err((strings))
255
-
256
- reactor = _reactor(_this_reaction)
257
- mols = _x2mol(strings)
258
- mols = reactor(mols)
259
-
260
- return _mol2x(mols,
261
- output_representation=output_representation,
262
- **kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/io.py DELETED
@@ -1,149 +0,0 @@
1
- """Tools to facilitate input and output."""
2
-
3
- from typing import Any, Callable, List, Optional, TextIO, Tuple, Union
4
-
5
- from collections import defaultdict
6
- from functools import partial
7
- from string import printable
8
- from tempfile import NamedTemporaryFile
9
- from xml.etree import ElementTree
10
-
11
- from carabiner import print_err
12
- from carabiner.cast import cast
13
- from carabiner.itertools import tenumerate
14
- from carabiner.pd import read_table, write_stream
15
-
16
- from pandas import DataFrame, read_excel
17
- from rdkit.Chem import SDMolSupplier
18
-
19
- from .converting import _mol2isomeric_canonical_smiles
20
-
21
- def _mutate_df_stream(input_file: Union[str, TextIO],
22
- output_file: Union[str, TextIO],
23
- function: Callable[[DataFrame], Tuple[Any, DataFrame]],
24
- file_format: Optional[str] = None,
25
- chunksize: int = 1000) -> List[Any]:
26
-
27
- carries = []
28
-
29
- for i, chunk in tenumerate(read_table(input_file,
30
- format=file_format,
31
- progress=False,
32
- chunksize=chunksize)):
33
-
34
- result = function(chunk)
35
-
36
- try:
37
- carry, df = result
38
- except ValueError:
39
- df = result
40
- carry = 0
41
-
42
- write_stream(df,
43
- output=output_file,
44
- format=file_format,
45
- header=i == 0,
46
- mode='w' if i == 0 else 'a')
47
-
48
- carries.append(carry)
49
-
50
- return carries
51
-
52
-
53
- def read_weird_xml(filename: Union[str, TextIO],
54
- header: bool = True,
55
- namespace: str = '{urn:schemas-microsoft-com:office:spreadsheet}') -> DataFrame:
56
-
57
- """
58
-
59
- """
60
-
61
- with cast(filename, TextIO, mode='r') as f:
62
-
63
- xml_string = ''.join(filter(printable.__contains__, f.read()))
64
-
65
- try:
66
-
67
- root = ElementTree.fromstring(xml_string)
68
-
69
- except Exception as e:
70
-
71
- print_err('\n!!! ' + xml_string.split('\n')[1184][377:380])
72
-
73
- raise e
74
-
75
- for i, row in enumerate(root.iter(f'{namespace}Row') ):
76
-
77
- this_row = [datum.text for datum in row.iter(f'{namespace}Data')]
78
-
79
- if i == 0:
80
-
81
- if header:
82
-
83
- heading = this_row
84
- df = {colname: [] for colname in heading}
85
-
86
- else:
87
-
88
- heading = [f'X{j}' for j, _ in enumerate(this_row)]
89
- df = {colname: [datum] for colname, datum in zip(heading, this_row)}
90
-
91
- else:
92
-
93
- for colname, datum in zip(heading, this_row):
94
-
95
- df[colname].append(datum)
96
-
97
- return DataFrame(df)
98
-
99
-
100
- def read_sdf(filename: Union[str, TextIO]):
101
-
102
- """
103
-
104
- """
105
-
106
- filename = cast(filename, str)
107
-
108
- with open(filename, 'r', errors='replace') as f:
109
- with NamedTemporaryFile("w") as o:
110
-
111
- o.write(f.read())
112
- o.seek(0)
113
-
114
- df = defaultdict(list)
115
-
116
- for i, mol in enumerate(SDMolSupplier(o.name)):
117
-
118
- if mol is None:
119
-
120
- continue
121
-
122
- propdict = mol.GetPropsAsDict()
123
- propdict['SMILES'] = _mol2isomeric_canonical_smiles(mol)
124
-
125
- for colname in propdict:
126
-
127
- df[colname].append(propdict[colname])
128
-
129
- for colname in df:
130
-
131
- if colname not in propdict:
132
-
133
- df[colname].append(None)
134
-
135
- col_lengths = {col: len(val) for col, val in df.items()}
136
-
137
- if len(set(col_lengths.values())) > 1:
138
-
139
- raise ValueError(f"Column lengths not all the same:\n\t" +
140
- '\n\t'.join(f"{key}:{val}" for key, val in col_lengths.items()))
141
-
142
- return DataFrame(df)
143
-
144
-
145
- FILE_READERS = {
146
- 'bad_xml': read_weird_xml,
147
- 'xlsx': partial(read_excel, engine='openpyxl'),
148
- 'sdf': read_sdf
149
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/rest_lookup.py DELETED
@@ -1,118 +0,0 @@
1
- """Tools for querying PubChem."""
2
-
3
- from typing import Dict, Iterable, List, Optional, Union
4
- from time import sleep
5
- from xml.etree import ElementTree
6
-
7
- from carabiner import print_err
8
- from carabiner.cast import cast
9
- from carabiner.decorators import vectorize
10
- from requests import Response, Session
11
-
12
- _PUBCHEM_URL = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/{inchikey}/property/{get}/{format}"
13
- _CACTUS_URL = "https://cactus.nci.nih.gov/chemical/structure/{inchikey}/{get}"
14
-
15
- _OVERLOAD_CODES = {500, 501, 503, 504}
16
-
17
-
18
- def _url_request(inchikeys: Union[str, Iterable[str]],
19
- url: str,
20
- session: Optional[Session] = None,
21
- **kwargs) -> Response:
22
-
23
- if session is None:
24
- session = Session()
25
-
26
- inchikeys = cast(inchikeys, to=list)
27
-
28
- return session.get(url.format(inchikey=','.join(inchikeys), **kwargs))
29
-
30
-
31
- def _inchikey2pubchem_name_id(inchikeys: Union[str, Iterable[str]],
32
- session: Optional[Session] = None,
33
- counter: int = 0,
34
- max_tries: int = 10,
35
- namespace: str = "{http://pubchem.ncbi.nlm.nih.gov/pug_rest}") -> List[Dict[str, Union[None, int, str]]]:
36
-
37
- r = _url_request(inchikeys, url=_PUBCHEM_URL,
38
- session=session,
39
- get="Title,InchiKey", format="XML")
40
-
41
- if r.status_code == 200:
42
-
43
- root = ElementTree.fromstring(r.text)
44
- compounds = root.iter(f'{namespace}Properties')
45
-
46
- result_dict = dict()
47
-
48
- for cmpd in compounds:
49
-
50
- cmpd_dict = dict()
51
-
52
- for child in cmpd:
53
- cmpd_dict[child.tag.split(namespace)[1]] = child.text
54
-
55
- try:
56
- inchikey, name, pcid = cmpd_dict['InChIKey'], cmpd_dict['Title'], cmpd_dict['CID']
57
- except KeyError:
58
- print(cmpd_dict)
59
- else:
60
- result_dict[inchikey] = {'pubchem_name': name.casefold(),
61
- 'pubchem_id': pcid}
62
-
63
- print_err(f'PubChem: Looked up InchiKeys: {",".join(inchikeys)}')
64
-
65
- result_list = [result_dict[inchikey]
66
- if inchikey in result_dict
67
- else {'pubchem_name': None, 'pubchem_id': None}
68
- for inchikey in inchikeys]
69
-
70
- return result_list
71
-
72
- elif r.status_code in _OVERLOAD_CODES and counter < max_tries:
73
-
74
- sleep(1.)
75
-
76
- return _inchikey2pubchem_name_id(inchikeys,
77
- session=session,
78
- counter=counter + 1,
79
- max_tries=max_tries,
80
- namespace=namespace)
81
-
82
- else:
83
-
84
- print_err(f'PubChem: InchiKey {",".join(inchikeys)} gave status {r.status_code}')
85
-
86
- return [{'pubchem_name': None, 'pubchem_id': None}
87
- for _ in range(len(inchikeys))]
88
-
89
-
90
- @vectorize
91
- def _inchikey2cactus_name(inchikeys: str,
92
- session: Optional[Session] = None,
93
- counter: int = 0,
94
- max_tries: int = 10):
95
-
96
- r = _url_request(inchikeys, url=_CACTUS_URL,
97
- session=session,
98
- get="names")
99
-
100
- if r.status_code == 200:
101
-
102
- return r.text.split('\n')[0].casefold()
103
-
104
- elif r.status_code in _OVERLOAD_CODES and counter < max_tries:
105
-
106
- sleep(1.)
107
-
108
- return _inchikey2cactus_name(inchikeys,
109
- session=session,
110
- counter=counter + 1,
111
- max_tries=max_tries)
112
-
113
- else:
114
-
115
- print_err(f'Cactus: InchiKey {",".join(inchikeys)} gave status {r.status_code}')
116
-
117
- return None
118
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/splitting.py DELETED
@@ -1,205 +0,0 @@
1
- """Tools for splitting tabular datasets, optionally based on chemical features."""
2
-
3
- from typing import Dict, Iterable, List, Optional, Tuple, Union
4
- from collections import defaultdict
5
- from math import ceil
6
- from random import random, seed
7
-
8
- try:
9
- from itertools import batched
10
- except ImportError:
11
- from carabiner.itertools import batched
12
-
13
- from tqdm.auto import tqdm
14
-
15
- from .converting import convert_string_representation, _convert_input_to_smiles
16
- from .typing import DataSplits
17
-
18
- # def _train_test_splits
19
-
20
- def _train_test_val_sizes(total: int,
21
- train: float = 1.,
22
- test: float = 0.) -> Tuple[int]:
23
-
24
- n_train = int(ceil(train * total))
25
- n_test = int(ceil(test * total))
26
- n_val = total - n_train - n_test
27
-
28
- return n_train, n_test, n_val
29
-
30
-
31
- def _random_chunk(strings: str,
32
- train: float = 1.,
33
- test: float = 0.,
34
- carry: Optional[Dict[str, List[int]]] = None,
35
- start_from: int = 0) -> Dict[str, List[int]]:
36
-
37
- carry = carry or defaultdict(list)
38
-
39
- train_test: float = train + test
40
-
41
- for i, _ in enumerate(strings):
42
-
43
- random_number: float = random()
44
-
45
- if random_number < train:
46
-
47
- key = 'train'
48
-
49
- elif random_number < train_test:
50
-
51
- key = 'test'
52
-
53
- else:
54
-
55
- key = 'validation'
56
-
57
- carry[key].append(start_from + i)
58
-
59
- return carry
60
-
61
-
62
- def split_random(strings: Union[str, Iterable[str]],
63
- train: float = 1.,
64
- test: float = 0.,
65
- chunksize: Optional[int] = None,
66
- set_seed: Optional[int] = None,
67
- *args, **kwargs) -> DataSplits:
68
-
69
- """
70
-
71
- """
72
-
73
- if set_seed is not None:
74
-
75
- seed(set_seed)
76
-
77
-
78
- if chunksize is None:
79
-
80
- idx = _random_chunk(strings=strings,
81
- train=train,
82
- test=test)
83
-
84
- else:
85
-
86
- idx = defaultdict(list)
87
-
88
- for i, chunk in enumerate(batched(strings, chunksize)):
89
-
90
- idx = _random_chunk(strings=chunk,
91
- train=train,
92
- test=test,
93
- carry=idx,
94
- start_from=i * chunksize)
95
-
96
- seed(None)
97
-
98
- return DataSplits(**idx)
99
-
100
-
101
- @_convert_input_to_smiles
102
- def _scaffold_chunk(strings: str,
103
- carry: Optional[Dict[str, List[int]]] = None,
104
- start_from: int = 0) -> Dict[str, List[int]]:
105
-
106
- carry = carry or defaultdict(list)
107
-
108
- these_scaffolds = convert_string_representation(strings=strings,
109
- output_representation='scaffold')
110
-
111
- for j, scaff in enumerate(these_scaffolds):
112
- carry[scaff].append(start_from + j)
113
-
114
- return carry
115
-
116
-
117
- def _scaffold_aggregator(scaffold_sets: Dict[str, List[int]],
118
- train: float = 1.,
119
- test: float = 0.,
120
- progress: bool = False) -> DataSplits:
121
-
122
- scaffold_sets = {key: sorted(value)
123
- for key, value in scaffold_sets.items()}
124
- scaffold_sets = sorted(scaffold_sets.items(),
125
- key=lambda x: (len(x[1]), x[1][0]),
126
- reverse=True)
127
- nrows = sum(len(idx) for _, idx in scaffold_sets)
128
- n_train, n_test, n_val = _train_test_val_sizes(nrows,
129
- train,
130
- test)
131
- idx = defaultdict(list)
132
-
133
- iterator = tqdm(scaffold_sets) if progress else scaffold_sets
134
- for _, scaffold_idx in iterator:
135
-
136
- if (len(idx['train']) + len(scaffold_idx)) > n_train:
137
-
138
- if (len(idx['test']) + len(scaffold_idx)) > n_test:
139
-
140
- key = 'validation'
141
-
142
- else:
143
-
144
- key = 'test'
145
- else:
146
-
147
- key = 'train'
148
-
149
- idx[key] += scaffold_idx
150
-
151
- return DataSplits(**idx)
152
-
153
-
154
- def split_scaffold(strings: Union[str, Iterable[str]],
155
- train: float = 1.,
156
- test: float = 0.,
157
- chunksize: Optional[int] = None,
158
- progress: bool = True,
159
- *args, **kwargs) -> DataSplits:
160
-
161
- """
162
-
163
- """
164
-
165
- if chunksize is None:
166
-
167
- scaffold_sets = _scaffold_chunk(strings)
168
-
169
- else:
170
-
171
- scaffold_sets = defaultdict(list)
172
-
173
- for i, chunk in enumerate(batched(strings, chunksize)):
174
-
175
- scaffold_sets = _scaffold_chunk(chunk,
176
- carry=scaffold_sets,
177
- start_from=i * chunksize)
178
-
179
- return _scaffold_aggregator(scaffold_sets,
180
- train=train, test=test,
181
- progress=progress)
182
-
183
-
184
- _SPLITTERS = {#'simpd': split_simpd,
185
- 'scaffold': split_scaffold,
186
- 'random': split_random}
187
-
188
- # _SPLIT_SUPERTYPES = {'scaffold': 'grouped',
189
- # 'random': 'independent'}
190
-
191
- _GROUPED_SPLITTERS = {'scaffold': (_scaffold_chunk, _scaffold_aggregator)}
192
-
193
- assert all(_type in _SPLITTERS
194
- for _type in _GROUPED_SPLITTERS) ## Should never fail!
195
-
196
- def split(split_type: str,
197
- *args, **kwargs) -> DataSplits:
198
-
199
- """
200
-
201
- """
202
-
203
- splitter = _SPLITTERS[split_type]
204
-
205
- return splitter(*args, **kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/tables.py DELETED
@@ -1,228 +0,0 @@
1
- """Tools for processing tabular data."""
2
-
3
- from typing import Any, Callable, Dict, Iterable, List, Mapping, Optional, Tuple, Union
4
- from functools import partial
5
-
6
- try:
7
- from itertools import batched
8
- except ImportError:
9
- from carabiner.itertools import batched
10
-
11
- from carabiner.cast import cast
12
- from pandas import DataFrame, concat
13
-
14
- from .cleaning import clean_smiles, clean_selfies
15
- from .converting import convert_string_representation
16
- from .features import calculate_feature
17
- from .generating import sample_peptides_in_length_range, react
18
- from .splitting import split
19
- from .typing import DataSplits
20
-
21
- def _get_column_values(df: DataFrame,
22
- column: Union[str, List[str]]):
23
-
24
- try:
25
- column_values = df[column]
26
- except KeyError:
27
- raise KeyError(f"Column {column} does not appear to be in the data: {', '.join(df.columns)}")
28
- else:
29
- return column_values
30
-
31
-
32
- def _get_error_tally(df: DataFrame,
33
- cols: Union[str, List[str]]) -> Dict[str, int]:
34
-
35
- cols = cast(cols, to=list)
36
-
37
- try:
38
- tally = {col: (df[col].isna() | ~df[col]).sum() for col in cols}
39
- except TypeError:
40
- tally = {col: df[col].isna().sum() for col in cols}
41
-
42
- return tally
43
-
44
-
45
- def converter(df: DataFrame,
46
- column: str = 'smiles',
47
- input_representation: str = 'smiles',
48
- output_representation: Union[str, Iterable[str]] = 'smiles',
49
- prefix: Optional[str] = None,
50
- options: Optional[Mapping[str, Any]] = None) -> Tuple[Dict[str, int], DataFrame]:
51
-
52
- """
53
-
54
- """
55
-
56
- prefix = prefix or ''
57
- options = options or {}
58
-
59
- column_values = _get_column_values(df, column)
60
-
61
- output_representation = cast(output_representation, to=list)
62
- converters = convert_string_representation(
63
- column_values,
64
- output_representation=output_representation,
65
- input_representation=input_representation,
66
- **options,
67
- )
68
- converted = {f"{prefix}{conversion_name}": cast(conversion, to=list)
69
- for conversion_name, conversion in converters.items()}
70
- df = df.assign(**converted)
71
-
72
- return _get_error_tally(df, list(converted)), df
73
-
74
-
75
- def cleaner(df: DataFrame,
76
- column: str = 'smiles',
77
- input_representation: str = 'smiles',
78
- prefix: Optional[str] = None) -> Tuple[Dict[str, int], DataFrame]:
79
-
80
- """
81
-
82
- """
83
-
84
- if input_representation.casefold() == 'smiles':
85
- cleaner = clean_smiles
86
- elif input_representation.casefold() == 'selfies':
87
- cleaner = clean_selfies
88
- else:
89
- raise ValueError(f"Representation {input_representation} is not supported for cleaning.")
90
-
91
- prefix = prefix or ''
92
- new_column = f"{prefix}{column}"
93
-
94
- df = df.assign(**{new_column: lambda x: cast(cleaner(_get_column_values(x, column)), to=list)})
95
-
96
- return _get_error_tally(df, new_column), df
97
-
98
-
99
- def featurizer(df: DataFrame,
100
- feature_type: str,
101
- column: str = 'smiles',
102
- ids: Optional[Union[str, List[str]]] = None,
103
- input_representation: str = 'smiles',
104
- prefix: Optional[str] = None) -> Tuple[Dict[str, int], DataFrame]:
105
-
106
- """
107
-
108
- """
109
-
110
- if ids is None:
111
- ids = df.columns.tolist()
112
- else:
113
- ids = cast(ids, to=list)
114
-
115
- feature_df = calculate_feature(feature_type=feature_type,
116
- strings=_get_column_values(df, column),
117
- prefix=prefix,
118
- input_representation=input_representation,
119
- return_dataframe=True)
120
-
121
- if len(ids) > 0:
122
- df = concat([df[ids], feature_df], axis=1)
123
-
124
- return _get_error_tally(feature_df, 'meta_feature_valid'), df
125
-
126
-
127
- def assign_groups(df: DataFrame,
128
- grouper: Callable[[Union[str, Iterable[str]]], Dict[str, Tuple[int]]],
129
- group_name: str = 'group',
130
- column: str = 'smiles',
131
- input_representation: str = 'smiles',
132
- *args, **kwargs) -> Tuple[Dict[str, Tuple[int]], DataFrame]:
133
-
134
- group_idx = grouper(strings=_get_column_values(df, column),
135
- input_representation=input_representation,
136
- *args, **kwargs)
137
-
138
- inv_group_idx = {i: group for group, idx in group_idx.items() for i in idx}
139
- groups = [inv_group_idx[i] for i in range(len(inv_group_idx))]
140
-
141
- return group_idx, df.assign(**{group_name: groups})
142
-
143
-
144
- def _assign_splits(df: DataFrame,
145
- split_idx: DataSplits,
146
- use_df_index: bool = False) -> DataFrame:
147
-
148
- row_index = df.index if use_df_index else tuple(range(df.shape[0]))
149
-
150
- df = df.assign(**{f'is_{key}': [i in getattr(split_idx, key) for i in row_index]
151
- for key in split_idx._fields})
152
- split_counts = {key: sum(df[f'is_{key}'].values) for key in split_idx._fields}
153
-
154
- return split_counts, df
155
-
156
-
157
- def splitter(df: DataFrame,
158
- split_type: str = 'random',
159
- column: str = 'smiles',
160
- input_representation: str = 'smiles',
161
- *args, **kwargs) -> Tuple[Dict[str, int], DataFrame]:
162
-
163
- """
164
-
165
- """
166
-
167
- split_idx = split(split_type=split_type,
168
- strings=_get_column_values(df, column),
169
- input_representation=input_representation,
170
- *args, **kwargs)
171
-
172
- return _assign_splits(df, split_idx=split_idx)
173
-
174
-
175
- def reactor(df: DataFrame,
176
- column: str = 'smiles',
177
- reaction: Union[str, Iterable[str]] = 'N_to_C_cyclization',
178
- prefix: Optional[str] = None,
179
- *args, **kwargs) -> Tuple[Dict[str, int], DataFrame]:
180
-
181
- """
182
-
183
- """
184
-
185
- prefix = prefix or ''
186
-
187
- reactors = {col: partial(react, reaction=col)
188
- for col in cast(reaction, to=list)}
189
-
190
- column_values = _get_column_values(df, column)
191
-
192
- new_columns = {f"{prefix}{col}": list(_reactor(strings=column_values, *args, **kwargs))
193
- for col, _reactor in reactors.items()}
194
-
195
- df = df.assign(**new_columns)
196
-
197
- return _get_error_tally(df, reaction), df
198
-
199
-
200
- def _peptide_table(max_length: int,
201
- min_length: Optional[int] = None,
202
- by: int = 1,
203
- n: Optional[Union[float, int]] = None,
204
- prefix: str = '',
205
- suffix: str = '',
206
- generator: bool = False,
207
- batch_size: int = 1000,
208
- *args, **kwargs) -> Union[DataFrame, Iterable]:
209
-
210
- min_length = min_length or max_length
211
-
212
- peptides = sample_peptides_in_length_range(max_length=max_length,
213
- min_length=min_length,
214
- by=by,
215
- n=n,
216
- *args, **kwargs)
217
-
218
- if generator:
219
-
220
- return (DataFrame(dict(peptide_sequence=[f"{prefix}{pep}{suffix}" for pep in peps]))
221
- for peps in batched(peptides, batch_size))
222
-
223
- else:
224
-
225
- peps = [f"{prefix}{pep}{suffix}"
226
- for pep in peptides]
227
-
228
- return DataFrame(dict(peptide_sequence=peps))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemist/typing.py DELETED
@@ -1,7 +0,0 @@
1
- """Types used in schemist."""
2
-
3
- from collections import namedtuple
4
-
5
- DataSplits = namedtuple('DataSplits',
6
- ['train', 'test', 'validation'],
7
- defaults=[tuple(), tuple(), tuple()])
 
 
 
 
 
 
 
 
schemist/utils.py DELETED
@@ -1 +0,0 @@
1
- """Miscellaneous utilities for schemist."""
 
 
test/data/AmpC_screen_table_10k.csv.gz DELETED
Binary file (171 kB)
 
test/tests.py DELETED
@@ -1,6 +0,0 @@
1
- import doctest
2
- import schemist as sch
3
-
4
- if __name__ == '__main__':
5
-
6
- doctest.testmod(sch)