Spaces:
Sleeping
Sleeping
eachanjohnson
commited on
Commit
·
efe6d99
1
Parent(s):
bb77f03
Sat Oct 12 17:25:16 UTC 2024 :: HF Spaces deployment
Browse files- LICENSE +0 -21
- README.md +30 -79
- app/app.py → app.py +0 -0
- app/README.md +0 -38
- docs/requirements.txt +0 -8
- docs/source/conf.py +0 -45
- docs/source/index.md +0 -25
- docs/source/installation.md +0 -17
- docs/source/modules.rst +0 -7
- docs/source/schemist.rst +0 -109
- docs/source/usage.md +0 -55
- pyproject.toml +0 -61
- app/requirements.txt → requirements.txt +0 -0
- schemist/__init__.py +0 -3
- schemist/cleaning.py +0 -27
- schemist/cli.py +0 -535
- schemist/collating.py +0 -317
- schemist/converting.py +0 -369
- schemist/features.py +0 -271
- schemist/generating.py +0 -262
- schemist/io.py +0 -149
- schemist/rest_lookup.py +0 -118
- schemist/splitting.py +0 -205
- schemist/tables.py +0 -228
- schemist/typing.py +0 -7
- schemist/utils.py +0 -1
- test/data/AmpC_screen_table_10k.csv.gz +0 -0
- test/tests.py +0 -6
LICENSE
DELETED
@@ -1,21 +0,0 @@
|
|
1 |
-
MIT License
|
2 |
-
|
3 |
-
Copyright (c) [year] [fullname]
|
4 |
-
|
5 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
-
of this software and associated documentation files (the "Software"), to deal
|
7 |
-
in the Software without restriction, including without limitation the rights
|
8 |
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
-
copies of the Software, and to permit persons to whom the Software is
|
10 |
-
furnished to do so, subject to the following conditions:
|
11 |
-
|
12 |
-
The above copyright notice and this permission notice shall be included in all
|
13 |
-
copies or substantial portions of the Software.
|
14 |
-
|
15 |
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
-
SOFTWARE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
@@ -1,87 +1,38 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-

|
4 |
-

|
5 |
-

|
6 |
[](https://huggingface.co/spaces/scbirlab/chem-converter)
|
7 |
|
8 |
-
|
9 |
|
10 |
-
|
11 |
-
- [Command-line usage](#command-line-usage)
|
12 |
-
- [Python API](#python-api)
|
13 |
-
- [Documentation](#documentation)
|
14 |
|
15 |
-
|
|
|
|
|
|
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
-
|
22 |
-
pip install schemist
|
23 |
-
```
|
24 |
-
|
25 |
-
### From source
|
26 |
-
|
27 |
-
Clone the repository, then `cd` into it. Then run:
|
28 |
-
|
29 |
-
```bash
|
30 |
-
pip install -e .
|
31 |
-
```
|
32 |
-
|
33 |
-
## Command-line usage
|
34 |
-
|
35 |
-
**schemist** provides command-line utlities. The list of commands can be checked like so:
|
36 |
-
|
37 |
-
```bash
|
38 |
-
$ schemist --help
|
39 |
-
usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...
|
40 |
-
|
41 |
-
Tools for cleaning, collating, and augmenting chemical datasets.
|
42 |
-
|
43 |
-
options:
|
44 |
-
-h, --help show this help message and exit
|
45 |
-
--version, -v show program's version number and exit
|
46 |
-
|
47 |
-
Sub-commands:
|
48 |
-
{clean,convert,featurize,collate,dedup,enumerate,react,split}
|
49 |
-
Use these commands to specify the tool you want to use.
|
50 |
-
clean Clean and normalize SMILES column of a table.
|
51 |
-
convert Convert between string representations of chemical structures.
|
52 |
-
featurize Convert between string representations of chemical structures.
|
53 |
-
collate Collect disparate tables or SDF files of libraries into a single table.
|
54 |
-
dedup Deduplicate chemical structures and retain references.
|
55 |
-
enumerate Enumerate bio-chemical structures within length and sequence constraints.
|
56 |
-
react React compounds in silico in indicated columns using a named reaction.
|
57 |
-
split Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
|
58 |
-
```
|
59 |
-
|
60 |
-
Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.
|
61 |
-
|
62 |
-
All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.
|
63 |
-
|
64 |
-
The tools complete specific tasks which
|
65 |
-
can be easily composed into analysis pipelines, because the TSV table output goes to
|
66 |
-
`stdout` by default so they can be piped from one tool to another.
|
67 |
-
|
68 |
-
To get help for a specific command, do
|
69 |
-
|
70 |
-
```bash
|
71 |
-
schemist <command> --help
|
72 |
-
```
|
73 |
-
|
74 |
-
For the Python API, [see below](#python-api).
|
75 |
-
|
76 |
-
|
77 |
-
## Python API
|
78 |
-
|
79 |
-
**schemist** can be imported into Python to help make custom analyses.
|
80 |
-
|
81 |
-
```python
|
82 |
-
>>> import schemist as sch
|
83 |
-
```
|
84 |
-
|
85 |
-
## Documentation
|
86 |
-
|
87 |
-
Full API documentation is at [ReadTheDocs](https://schemist.readthedocs.org).
|
|
|
1 |
+
---
|
2 |
+
title: Chemical string format converter
|
3 |
+
emoji: ⚗️
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: green
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: "5.0.2"
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
short_description: Trivial batch interconversion of 1D chemical formats.
|
11 |
+
---
|
12 |
+
# Chemical string format converter
|
13 |
|
|
|
|
|
|
|
14 |
[](https://huggingface.co/spaces/scbirlab/chem-converter)
|
15 |
|
16 |
+
Trivial batch interconversion of 1D chemical formats.
|
17 |
|
18 |
+
Frontend for [schemist](https://github.com/scbirlab/schemist) to allow interconversion from:
|
|
|
|
|
|
|
19 |
|
20 |
+
- SMILES
|
21 |
+
- SELFIES
|
22 |
+
- Amino acid sequences
|
23 |
+
- HELM
|
24 |
|
25 |
+
to...
|
26 |
|
27 |
+
- Strucure image
|
28 |
+
- SMILES
|
29 |
+
- SELFIES
|
30 |
+
- InChI
|
31 |
+
- InChIKey
|
32 |
+
- Name
|
33 |
+
- cLogP
|
34 |
+
- TPSA
|
35 |
+
- molecular weight
|
36 |
+
- charge
|
37 |
|
38 |
+
... and several others!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/app.py → app.py
RENAMED
File without changes
|
app/README.md
DELETED
@@ -1,38 +0,0 @@
|
|
1 |
-
---
|
2 |
-
title: Chemical string format converter
|
3 |
-
emoji: ⚗️
|
4 |
-
colorFrom: blue
|
5 |
-
colorTo: green
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 5.0.2
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
short_description: Trivial batch interconversion of 1D chemical formats.
|
11 |
-
---
|
12 |
-
# Chemical string format converter
|
13 |
-
|
14 |
-
[](https://huggingface.co/spaces/scbirlab/chem-converter)
|
15 |
-
|
16 |
-
Trivial batch interconversion of 1D chemical formats.
|
17 |
-
|
18 |
-
Frontend for [schemist](https://github.com/scbirlab/schemist) to allow interconversion from:
|
19 |
-
|
20 |
-
- SMILES
|
21 |
-
- SELFIES
|
22 |
-
- Amino acid sequences
|
23 |
-
- HELM
|
24 |
-
|
25 |
-
to...
|
26 |
-
|
27 |
-
- Strucure image
|
28 |
-
- SMILES
|
29 |
-
- SELFIES
|
30 |
-
- InChI
|
31 |
-
- InChIKey
|
32 |
-
- Name
|
33 |
-
- cLogP
|
34 |
-
- TPSA
|
35 |
-
- molecular weight
|
36 |
-
- charge
|
37 |
-
|
38 |
-
... and several others!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/requirements.txt
DELETED
@@ -1,8 +0,0 @@
|
|
1 |
-
myst_parser
|
2 |
-
matplotlib
|
3 |
-
numpy
|
4 |
-
openpyxl==3.1.0
|
5 |
-
pandas
|
6 |
-
scipy
|
7 |
-
sphinx_rtd_theme
|
8 |
-
./
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/conf.py
DELETED
@@ -1,45 +0,0 @@
|
|
1 |
-
# Configuration file for the Sphinx documentation builder.
|
2 |
-
#
|
3 |
-
# For the full list of built-in configuration values, see the documentation:
|
4 |
-
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
5 |
-
|
6 |
-
# -- Project information -----------------------------------------------------
|
7 |
-
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
|
8 |
-
|
9 |
-
project = 'schemist'
|
10 |
-
copyright = '2024, Eachan Johnson'
|
11 |
-
author = 'Eachan Johnson'
|
12 |
-
release = '0.0.1'
|
13 |
-
|
14 |
-
# -- General configuration ---------------------------------------------------
|
15 |
-
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
16 |
-
|
17 |
-
extensions = ['sphinx.ext.doctest',
|
18 |
-
'sphinx.ext.autodoc',
|
19 |
-
'sphinx.ext.autosummary',
|
20 |
-
'sphinx.ext.napoleon',
|
21 |
-
'sphinx.ext.viewcode',
|
22 |
-
'myst_parser']
|
23 |
-
|
24 |
-
myst_enable_extensions = [
|
25 |
-
"amsmath",
|
26 |
-
"dollarmath",
|
27 |
-
]
|
28 |
-
|
29 |
-
source_suffix = {
|
30 |
-
'.rst': 'restructuredtext',
|
31 |
-
'.txt': 'markdown',
|
32 |
-
'.md': 'markdown',
|
33 |
-
}
|
34 |
-
|
35 |
-
|
36 |
-
templates_path = ['_templates']
|
37 |
-
exclude_patterns = []
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
# -- Options for HTML output -------------------------------------------------
|
42 |
-
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
|
43 |
-
|
44 |
-
html_theme = 'sphinx_rtd_theme'
|
45 |
-
html_static_path = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/index.md
DELETED
@@ -1,25 +0,0 @@
|
|
1 |
-
# ⬢⬢⬢ schemist
|
2 |
-
|
3 |
-

|
4 |
-

|
5 |
-

|
6 |
-
|
7 |
-
Organizing and processing tables of chemical structures.
|
8 |
-
|
9 |
-
```{toctree}
|
10 |
-
:maxdepth: 2
|
11 |
-
:caption: Contents:
|
12 |
-
|
13 |
-
installation
|
14 |
-
usage
|
15 |
-
python
|
16 |
-
modules
|
17 |
-
```
|
18 |
-
|
19 |
-
## Issues, problems, suggestions
|
20 |
-
|
21 |
-
Add to the [issue tracker](https://www.github.com/schemist/issues).
|
22 |
-
|
23 |
-
## Source
|
24 |
-
|
25 |
-
View source at [GitHub](https://github.com/scbirlab/schemist).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/installation.md
DELETED
@@ -1,17 +0,0 @@
|
|
1 |
-
# Installation
|
2 |
-
|
3 |
-
## The easy way
|
4 |
-
|
5 |
-
Install the pre-compiled version from GitHub:
|
6 |
-
|
7 |
-
```bash
|
8 |
-
$ pip install schemist
|
9 |
-
```
|
10 |
-
|
11 |
-
## From source
|
12 |
-
|
13 |
-
Clone the [repository](https://www.github.com/schemist), then `cd` into it. Then run:
|
14 |
-
|
15 |
-
```bash
|
16 |
-
pip install -e .
|
17 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/modules.rst
DELETED
@@ -1,7 +0,0 @@
|
|
1 |
-
schemist
|
2 |
-
========
|
3 |
-
|
4 |
-
.. toctree::
|
5 |
-
:maxdepth: 4
|
6 |
-
|
7 |
-
schemist
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/schemist.rst
DELETED
@@ -1,109 +0,0 @@
|
|
1 |
-
schemist package
|
2 |
-
================
|
3 |
-
|
4 |
-
Submodules
|
5 |
-
----------
|
6 |
-
|
7 |
-
schemist.cleaning module
|
8 |
-
------------------------
|
9 |
-
|
10 |
-
.. automodule:: schemist.cleaning
|
11 |
-
:members:
|
12 |
-
:undoc-members:
|
13 |
-
:show-inheritance:
|
14 |
-
|
15 |
-
schemist.cli module
|
16 |
-
-------------------
|
17 |
-
|
18 |
-
.. automodule:: schemist.cli
|
19 |
-
:members:
|
20 |
-
:undoc-members:
|
21 |
-
:show-inheritance:
|
22 |
-
|
23 |
-
schemist.collating module
|
24 |
-
-------------------------
|
25 |
-
|
26 |
-
.. automodule:: schemist.collating
|
27 |
-
:members:
|
28 |
-
:undoc-members:
|
29 |
-
:show-inheritance:
|
30 |
-
|
31 |
-
schemist.converting module
|
32 |
-
--------------------------
|
33 |
-
|
34 |
-
.. automodule:: schemist.converting
|
35 |
-
:members:
|
36 |
-
:undoc-members:
|
37 |
-
:show-inheritance:
|
38 |
-
|
39 |
-
schemist.features module
|
40 |
-
------------------------
|
41 |
-
|
42 |
-
.. automodule:: schemist.features
|
43 |
-
:members:
|
44 |
-
:undoc-members:
|
45 |
-
:show-inheritance:
|
46 |
-
|
47 |
-
schemist.generating module
|
48 |
-
--------------------------
|
49 |
-
|
50 |
-
.. automodule:: schemist.generating
|
51 |
-
:members:
|
52 |
-
:undoc-members:
|
53 |
-
:show-inheritance:
|
54 |
-
|
55 |
-
schemist.io module
|
56 |
-
------------------
|
57 |
-
|
58 |
-
.. automodule:: schemist.io
|
59 |
-
:members:
|
60 |
-
:undoc-members:
|
61 |
-
:show-inheritance:
|
62 |
-
|
63 |
-
schemist.rest\_lookup module
|
64 |
-
----------------------------
|
65 |
-
|
66 |
-
.. automodule:: schemist.rest_lookup
|
67 |
-
:members:
|
68 |
-
:undoc-members:
|
69 |
-
:show-inheritance:
|
70 |
-
|
71 |
-
schemist.splitting module
|
72 |
-
-------------------------
|
73 |
-
|
74 |
-
.. automodule:: schemist.splitting
|
75 |
-
:members:
|
76 |
-
:undoc-members:
|
77 |
-
:show-inheritance:
|
78 |
-
|
79 |
-
schemist.tables module
|
80 |
-
----------------------
|
81 |
-
|
82 |
-
.. automodule:: schemist.tables
|
83 |
-
:members:
|
84 |
-
:undoc-members:
|
85 |
-
:show-inheritance:
|
86 |
-
|
87 |
-
schemist.typing module
|
88 |
-
----------------------
|
89 |
-
|
90 |
-
.. automodule:: schemist.typing
|
91 |
-
:members:
|
92 |
-
:undoc-members:
|
93 |
-
:show-inheritance:
|
94 |
-
|
95 |
-
schemist.utils module
|
96 |
-
---------------------
|
97 |
-
|
98 |
-
.. automodule:: schemist.utils
|
99 |
-
:members:
|
100 |
-
:undoc-members:
|
101 |
-
:show-inheritance:
|
102 |
-
|
103 |
-
Module contents
|
104 |
-
---------------
|
105 |
-
|
106 |
-
.. automodule:: schemist
|
107 |
-
:members:
|
108 |
-
:undoc-members:
|
109 |
-
:show-inheritance:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/source/usage.md
DELETED
@@ -1,55 +0,0 @@
|
|
1 |
-
# Usage
|
2 |
-
|
3 |
-
**schemist** has a variety of utilities which can be used through the command-line or the [Python API](#python-api).
|
4 |
-
|
5 |
-
## Command-line usage
|
6 |
-
|
7 |
-
**schemist** provides command-line utlities. The list of commands can be checked like so:
|
8 |
-
|
9 |
-
```bash
|
10 |
-
$ schemist --help
|
11 |
-
usage: schemist [-h] [--version] {clean,convert,featurize,collate,dedup,enumerate,react,split} ...
|
12 |
-
|
13 |
-
Tools for cleaning, collating, and augmenting chemical datasets.
|
14 |
-
|
15 |
-
options:
|
16 |
-
-h, --help show this help message and exit
|
17 |
-
--version, -v show program's version number and exit
|
18 |
-
|
19 |
-
Sub-commands:
|
20 |
-
{clean,convert,featurize,collate,dedup,enumerate,react,split}
|
21 |
-
Use these commands to specify the tool you want to use.
|
22 |
-
clean Clean and normalize SMILES column of a table.
|
23 |
-
convert Convert between string representations of chemical structures.
|
24 |
-
featurize Convert between string representations of chemical structures.
|
25 |
-
collate Collect disparate tables or SDF files of libraries into a single table.
|
26 |
-
dedup Deduplicate chemical structures and retain references.
|
27 |
-
enumerate Enumerate bio-chemical structures within length and sequence constraints.
|
28 |
-
react React compounds in silico in indicated columns using a named reaction.
|
29 |
-
split Split table based on chosen algorithm, optionally taking account of chemical structure during splits.
|
30 |
-
```
|
31 |
-
|
32 |
-
Each command is designed to work on large data files in a streaming fashion, so that the entire file is not held in memory at once. One caveat is that the scaffold-based splits are very slow with tables of millions of rows.
|
33 |
-
|
34 |
-
All commands (except `collate`) take from the input table a named column with a SMILES, SELFIES, amino-acid sequence, HELM, or InChI representation of compounds.
|
35 |
-
|
36 |
-
The tools complete specific tasks which
|
37 |
-
can be easily composed into analysis pipelines, because the TSV table output goes to
|
38 |
-
`stdout` by default so they can be piped from one tool to another.
|
39 |
-
|
40 |
-
To get help for a specific command, do
|
41 |
-
|
42 |
-
```bash
|
43 |
-
schemist <command> --help
|
44 |
-
```
|
45 |
-
|
46 |
-
For the Python API, [see below](#python-api).
|
47 |
-
|
48 |
-
|
49 |
-
## Python API
|
50 |
-
|
51 |
-
You can access the underlying functions of `schemist` to help custom analyses or develop other tools.
|
52 |
-
|
53 |
-
```python
|
54 |
-
>>> import schemist as sch
|
55 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pyproject.toml
DELETED
@@ -1,61 +0,0 @@
|
|
1 |
-
[project]
|
2 |
-
name = "schemist"
|
3 |
-
version = "0.0.1"
|
4 |
-
authors = [
|
5 |
-
{ name="Eachan Johnson", email="[email protected]" },
|
6 |
-
]
|
7 |
-
description = "Organizing and processing tables of chemical structures."
|
8 |
-
readme = "README.md"
|
9 |
-
requires-python = ">=3.8"
|
10 |
-
license = {file = "LICENSE"}
|
11 |
-
keywords = ["science", "chemistry", "SMILES", "SELFIES", "cheminformatics"]
|
12 |
-
|
13 |
-
classifiers = [
|
14 |
-
|
15 |
-
"Development Status :: 3 - Alpha",
|
16 |
-
|
17 |
-
# Indicate who your project is intended for
|
18 |
-
"Intended Audience :: Science/Research",
|
19 |
-
"Topic :: Scientific/Engineering :: Chemistry",
|
20 |
-
|
21 |
-
"License :: OSI Approved :: MIT License",
|
22 |
-
|
23 |
-
"Programming Language :: Python :: 3.8",
|
24 |
-
"Programming Language :: Python :: 3.9",
|
25 |
-
"Programming Language :: Python :: 3.10",
|
26 |
-
"Programming Language :: Python :: 3.11",
|
27 |
-
"Programming Language :: Python :: 3 :: Only",
|
28 |
-
]
|
29 |
-
|
30 |
-
dependencies = [
|
31 |
-
"carabiner-tools[pd]>=0.0.3.post1",
|
32 |
-
"datamol",
|
33 |
-
"descriptastorus==2.6.1",
|
34 |
-
"nemony",
|
35 |
-
"openpyxl==3.1.0",
|
36 |
-
"pandas",
|
37 |
-
"rdkit",
|
38 |
-
"requests",
|
39 |
-
"selfies",
|
40 |
-
]
|
41 |
-
|
42 |
-
[project.urls]
|
43 |
-
"Homepage" = "https://github.com/scbirlab/schemist"
|
44 |
-
"Repository" = "https://github.com/scbirlab/schemist.git"
|
45 |
-
"Bug Tracker" = "https://github.com/scbirlab/schemist/issues"
|
46 |
-
"Documentation" = "https://readthedocs.org/schemist"
|
47 |
-
|
48 |
-
[project.scripts] # Optional
|
49 |
-
schemist = "schemist.cli:main"
|
50 |
-
|
51 |
-
[tool.setuptools]
|
52 |
-
packages = ["schemist"]
|
53 |
-
# If there are data files included in your packages that need to be
|
54 |
-
# installed, specify them here.
|
55 |
-
# package-data = {"" = ["*.yml"]}
|
56 |
-
|
57 |
-
[build-system]
|
58 |
-
# These are the assumed default build requirements from pip:
|
59 |
-
# https://pip.pypa.io/en/stable/reference/pip/#pep-517-and-518-support
|
60 |
-
requires = ["setuptools>=43.0.0", "wheel"]
|
61 |
-
build-backend = "setuptools.build_meta"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/requirements.txt → requirements.txt
RENAMED
File without changes
|
schemist/__init__.py
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
from importlib.metadata import version
|
2 |
-
|
3 |
-
__version__ = version("schemist")
|
|
|
|
|
|
|
|
schemist/cleaning.py
DELETED
@@ -1,27 +0,0 @@
|
|
1 |
-
"""Chemical structure cleaning routines."""
|
2 |
-
|
3 |
-
from carabiner.decorators import vectorize
|
4 |
-
|
5 |
-
from datamol import sanitize_smiles
|
6 |
-
import selfies as sf
|
7 |
-
|
8 |
-
@vectorize
|
9 |
-
def clean_smiles(smiles: str,
|
10 |
-
*args, **kwargs) -> str:
|
11 |
-
|
12 |
-
"""Sanitize a SMILES string or list of SMILES strings.
|
13 |
-
|
14 |
-
"""
|
15 |
-
|
16 |
-
return sanitize_smiles(smiles, *args, **kwargs)
|
17 |
-
|
18 |
-
|
19 |
-
@vectorize
|
20 |
-
def clean_selfies(selfies: str,
|
21 |
-
*args, **kwargs) -> str:
|
22 |
-
|
23 |
-
"""Sanitize a SELFIES string or list of SELFIES strings.
|
24 |
-
|
25 |
-
"""
|
26 |
-
|
27 |
-
return sf.encode(sanitize_smiles(sf.decode(selfies), *args, **kwargs))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/cli.py
DELETED
@@ -1,535 +0,0 @@
|
|
1 |
-
"""Command-line interface for schemist."""
|
2 |
-
|
3 |
-
from typing import Any, Dict, List, Optional
|
4 |
-
|
5 |
-
from argparse import FileType, Namespace
|
6 |
-
from collections import Counter, defaultdict
|
7 |
-
from functools import partial
|
8 |
-
import os
|
9 |
-
import sys
|
10 |
-
from tempfile import NamedTemporaryFile, TemporaryDirectory
|
11 |
-
|
12 |
-
from carabiner import pprint_dict, upper_and_lower
|
13 |
-
from carabiner.cliutils import clicommand, CLIOption, CLICommand, CLIApp
|
14 |
-
from carabiner.itertools import tenumerate
|
15 |
-
from carabiner.pd import get_formats, write_stream
|
16 |
-
|
17 |
-
from . import __version__
|
18 |
-
from .collating import collate_inventory, deduplicate_file
|
19 |
-
from .converting import _TO_FUNCTIONS, _FROM_FUNCTIONS
|
20 |
-
from .generating import AA, REACTIONS
|
21 |
-
from .io import _mutate_df_stream
|
22 |
-
from .tables import (converter, cleaner, featurizer, assign_groups,
|
23 |
-
_assign_splits, splitter, _peptide_table, reactor)
|
24 |
-
from .splitting import _SPLITTERS, _GROUPED_SPLITTERS
|
25 |
-
|
26 |
-
def _option_parser(x: Optional[List[str]]) -> Dict[str, Any]:
|
27 |
-
|
28 |
-
options = {}
|
29 |
-
|
30 |
-
try:
|
31 |
-
for opt in x:
|
32 |
-
|
33 |
-
try:
|
34 |
-
key, value = opt.split('=')
|
35 |
-
except ValueError:
|
36 |
-
raise ValueError(f"Option {opt} is misformatted. It should be in the format keyword=value.")
|
37 |
-
|
38 |
-
try:
|
39 |
-
value = int(value)
|
40 |
-
except ValueError:
|
41 |
-
try:
|
42 |
-
value = float(value)
|
43 |
-
except ValueError:
|
44 |
-
pass
|
45 |
-
|
46 |
-
options[key] = value
|
47 |
-
|
48 |
-
except TypeError:
|
49 |
-
|
50 |
-
pass
|
51 |
-
|
52 |
-
return options
|
53 |
-
|
54 |
-
|
55 |
-
def _sum_tally(tallies: Counter,
|
56 |
-
message: str = "Error counts",
|
57 |
-
use_length: bool = False):
|
58 |
-
|
59 |
-
total_tally = Counter()
|
60 |
-
|
61 |
-
for tally in tallies:
|
62 |
-
|
63 |
-
if use_length:
|
64 |
-
total_tally.update({key: len(value) for key, value in tally.items()})
|
65 |
-
else:
|
66 |
-
total_tally.update(tally)
|
67 |
-
|
68 |
-
if len(tallies) == 0:
|
69 |
-
raise ValueError(f"Nothing generated!")
|
70 |
-
|
71 |
-
pprint_dict(total_tally, message=message)
|
72 |
-
|
73 |
-
return total_tally
|
74 |
-
|
75 |
-
|
76 |
-
@clicommand(message="Cleaning file with the following parameters")
|
77 |
-
def _clean(args: Namespace) -> None:
|
78 |
-
|
79 |
-
error_tallies = _mutate_df_stream(input_file=args.input,
|
80 |
-
output_file=args.output,
|
81 |
-
function=partial(cleaner,
|
82 |
-
column=args.column,
|
83 |
-
input_representation=args.representation,
|
84 |
-
prefix=args.prefix),
|
85 |
-
file_format=args.format)
|
86 |
-
|
87 |
-
_sum_tally(error_tallies)
|
88 |
-
|
89 |
-
return None
|
90 |
-
|
91 |
-
|
92 |
-
@clicommand(message="Converting between string representations with the following parameters")
|
93 |
-
def _convert(args: Namespace) -> None:
|
94 |
-
|
95 |
-
options = _option_parser(args.options)
|
96 |
-
|
97 |
-
error_tallies = _mutate_df_stream(input_file=args.input,
|
98 |
-
output_file=args.output,
|
99 |
-
function=partial(converter,
|
100 |
-
column=args.column,
|
101 |
-
input_representation=args.representation,
|
102 |
-
output_representation=args.to,
|
103 |
-
prefix=args.prefix,
|
104 |
-
options=options),
|
105 |
-
file_format=args.format)
|
106 |
-
|
107 |
-
_sum_tally(error_tallies)
|
108 |
-
|
109 |
-
return None
|
110 |
-
|
111 |
-
|
112 |
-
@clicommand(message="Adding features to files with the following parameters")
|
113 |
-
def _featurize(args: Namespace) -> None:
|
114 |
-
|
115 |
-
error_tallies = _mutate_df_stream(input_file=args.input,
|
116 |
-
output_file=args.output,
|
117 |
-
function=partial(featurizer,
|
118 |
-
feature_type=args.feature,
|
119 |
-
column=args.column,
|
120 |
-
ids=args.id,
|
121 |
-
input_representation=args.representation,
|
122 |
-
prefix=args.prefix),
|
123 |
-
file_format=args.format)
|
124 |
-
|
125 |
-
_sum_tally(error_tallies)
|
126 |
-
|
127 |
-
return None
|
128 |
-
|
129 |
-
|
130 |
-
@clicommand(message="Splitting table with the following parameters")
|
131 |
-
def _split(args: Namespace) -> None:
|
132 |
-
|
133 |
-
split_type = args.type.casefold()
|
134 |
-
|
135 |
-
if split_type in _GROUPED_SPLITTERS:
|
136 |
-
|
137 |
-
chunk_processor, aggregator = _GROUPED_SPLITTERS[split_type]
|
138 |
-
|
139 |
-
with TemporaryDirectory() as dir:
|
140 |
-
|
141 |
-
with NamedTemporaryFile("w", dir=dir, delete=False) as f:
|
142 |
-
|
143 |
-
group_idxs = _mutate_df_stream(input_file=args.input,
|
144 |
-
output_file=f,
|
145 |
-
function=partial(assign_groups,
|
146 |
-
grouper=chunk_processor,
|
147 |
-
group_name=split_type,
|
148 |
-
column=args.column,
|
149 |
-
input_representation=args.representation),
|
150 |
-
file_format=args.format)
|
151 |
-
f.close()
|
152 |
-
new_group_idx = defaultdict(list)
|
153 |
-
|
154 |
-
totals = 0
|
155 |
-
for group_idx in group_idxs:
|
156 |
-
these_totals = 0
|
157 |
-
for key, value in group_idx.items():
|
158 |
-
these_totals += len(value)
|
159 |
-
new_group_idx[key] += [idx + totals for idx in value]
|
160 |
-
totals += these_totals
|
161 |
-
|
162 |
-
group_idx = aggregator(new_group_idx,
|
163 |
-
train=args.train,
|
164 |
-
test=args.test)
|
165 |
-
|
166 |
-
split_tallies = _mutate_df_stream(input_file=f.name,
|
167 |
-
output_file=args.output,
|
168 |
-
function=partial(_assign_splits,
|
169 |
-
split_idx=group_idx,
|
170 |
-
use_df_index=True),
|
171 |
-
file_format=args.format)
|
172 |
-
if os.path.exists(f.name):
|
173 |
-
os.remove(f.name)
|
174 |
-
|
175 |
-
else:
|
176 |
-
|
177 |
-
split_tallies = _mutate_df_stream(input_file=args.input,
|
178 |
-
output_file=args.output,
|
179 |
-
function=partial(splitter,
|
180 |
-
split_type=args.type,
|
181 |
-
column=args.column,
|
182 |
-
input_representation=args.representation,
|
183 |
-
train=args.train,
|
184 |
-
test=args.test,
|
185 |
-
set_seed=args.seed),
|
186 |
-
file_format=args.format)
|
187 |
-
|
188 |
-
_sum_tally(split_tallies,
|
189 |
-
message="Split counts")
|
190 |
-
|
191 |
-
return None
|
192 |
-
|
193 |
-
|
194 |
-
@clicommand(message="Collating files with the following parameters")
|
195 |
-
def _collate(args: Namespace) -> None:
|
196 |
-
|
197 |
-
root_dir = args.data_dir or '.'
|
198 |
-
|
199 |
-
error_tallies = _mutate_df_stream(input_file=args.input,
|
200 |
-
output_file=args.output,
|
201 |
-
function=partial(collate_inventory,
|
202 |
-
root_dir=root_dir,
|
203 |
-
drop_unmapped=not args.keep_extra_columns,
|
204 |
-
catalog_smiles_column=args.column,
|
205 |
-
id_column_name=args.id_column,
|
206 |
-
id_n_digits=args.digits,
|
207 |
-
id_prefix=args.prefix),
|
208 |
-
file_format=args.format)
|
209 |
-
|
210 |
-
_sum_tally(error_tallies,
|
211 |
-
message="Collated chemicals:")
|
212 |
-
|
213 |
-
return None
|
214 |
-
|
215 |
-
|
216 |
-
@clicommand(message="Deduplicating chemical structures with the following parameters")
|
217 |
-
def _dedup(args: Namespace) -> None:
|
218 |
-
|
219 |
-
report, deduped_df = deduplicate_file(args.input,
|
220 |
-
format=args.format,
|
221 |
-
column=args.column,
|
222 |
-
input_representation=args.representation,
|
223 |
-
index_columns=args.indexes)
|
224 |
-
|
225 |
-
if args.prefix is not None and 'inchikey' in deduped_df:
|
226 |
-
deduped_df = deduped_df.rename(columns={'inchikey': f'{args.prefix}inchikey'})
|
227 |
-
|
228 |
-
write_stream(deduped_df,
|
229 |
-
output=args.output,
|
230 |
-
format=args.format)
|
231 |
-
|
232 |
-
pprint_dict(report, message="Finished deduplicating:")
|
233 |
-
|
234 |
-
return None
|
235 |
-
|
236 |
-
|
237 |
-
@clicommand(message="Enumerating peptides with the following parameters")
|
238 |
-
def _enum(args: Namespace) -> None:
|
239 |
-
|
240 |
-
tables = _peptide_table(max_length=args.max_length,
|
241 |
-
min_length=args.min_length,
|
242 |
-
n=args.number,
|
243 |
-
indexes=args.slice,
|
244 |
-
set_seed=args.seed,
|
245 |
-
prefix=args.prefix,
|
246 |
-
suffix=args.suffix,
|
247 |
-
d_aa_only=args.d_aa_only,
|
248 |
-
include_d_aa=args.include_d_aa,
|
249 |
-
generator=True)
|
250 |
-
|
251 |
-
dAA_use = any(aa.islower() for aa in args.prefix + args.suffix)
|
252 |
-
dAA_use = dAA_use or args.include_d_aa or args.d_aa_only
|
253 |
-
|
254 |
-
tallies, error_tallies = [], []
|
255 |
-
options = _option_parser(args.options)
|
256 |
-
_converter = partial(converter,
|
257 |
-
column='peptide_sequence',
|
258 |
-
input_representation='minihelm' if dAA_use else 'aa_seq', ## affects performance
|
259 |
-
output_representation=args.to,
|
260 |
-
options=options)
|
261 |
-
|
262 |
-
for i, table in tenumerate(tables):
|
263 |
-
|
264 |
-
_err_tally, df = _converter(table)
|
265 |
-
|
266 |
-
tallies.append({"Number of peptides": df.shape[0]})
|
267 |
-
error_tallies.append(_err_tally)
|
268 |
-
|
269 |
-
write_stream(df,
|
270 |
-
output=args.output,
|
271 |
-
format=args.format,
|
272 |
-
mode='w' if i == 0 else 'a',
|
273 |
-
header=i == 0)
|
274 |
-
|
275 |
-
_sum_tally(tallies,
|
276 |
-
message="Enumerated peptides")
|
277 |
-
_sum_tally(error_tallies,
|
278 |
-
message="Conversion errors")
|
279 |
-
|
280 |
-
return None
|
281 |
-
|
282 |
-
|
283 |
-
@clicommand(message="Reacting peptides with the following parameters")
|
284 |
-
def _react(args: Namespace) -> None:
|
285 |
-
|
286 |
-
error_tallies = _mutate_df_stream(input_file=args.input,
|
287 |
-
output_file=args.output,
|
288 |
-
function=partial(reactor,
|
289 |
-
column=args.column,
|
290 |
-
input_representation=args.representation,
|
291 |
-
reaction=args.reaction,
|
292 |
-
product_name=args.name),
|
293 |
-
file_format=args.format)
|
294 |
-
|
295 |
-
_sum_tally(error_tallies)
|
296 |
-
|
297 |
-
return None
|
298 |
-
|
299 |
-
|
300 |
-
def main() -> None:
|
301 |
-
|
302 |
-
inputs = CLIOption('input',
|
303 |
-
default=sys.stdin,
|
304 |
-
type=FileType('r'),
|
305 |
-
nargs='?',
|
306 |
-
help='Input columnar Excel, CSV or TSV file. Default: STDIN.')
|
307 |
-
representation = CLIOption('--representation', '-r',
|
308 |
-
type=str,
|
309 |
-
default='SMILES',
|
310 |
-
choices=upper_and_lower(_FROM_FUNCTIONS),
|
311 |
-
help='Chemical representation to use for input. ')
|
312 |
-
column = CLIOption('--column', '-c',
|
313 |
-
default='smiles',
|
314 |
-
type=str,
|
315 |
-
help='Column to use as input string representation. ')
|
316 |
-
prefix = CLIOption('--prefix', '-p',
|
317 |
-
default=None,
|
318 |
-
type=str,
|
319 |
-
help='Prefix to add to new column name. Default: no prefix')
|
320 |
-
to = CLIOption('--to', '-2',
|
321 |
-
type=str,
|
322 |
-
default='SMILES',
|
323 |
-
nargs='*',
|
324 |
-
choices=upper_and_lower(_TO_FUNCTIONS),
|
325 |
-
help='Format to convert to.')
|
326 |
-
options = CLIOption('--options', '-x',
|
327 |
-
type=str,
|
328 |
-
default=None,
|
329 |
-
nargs='*',
|
330 |
-
help='Options to pass to converter, in the format '
|
331 |
-
'"keyword1=value1 keyword2=value2"')
|
332 |
-
output = CLIOption('--output', '-o',
|
333 |
-
type=FileType('w'),
|
334 |
-
default=sys.stdout,
|
335 |
-
help='Output file. Default: STDOUT')
|
336 |
-
formatting = CLIOption('--format', '-f',
|
337 |
-
type=str,
|
338 |
-
default=None,
|
339 |
-
choices=upper_and_lower(get_formats()),
|
340 |
-
help='Override file extensions for input and output. '
|
341 |
-
'Default: infer from file extension.')
|
342 |
-
|
343 |
-
## featurize
|
344 |
-
id_feat = CLIOption('--id', '-i',
|
345 |
-
type=str,
|
346 |
-
default=None,
|
347 |
-
nargs='*',
|
348 |
-
help='Columns to retain in output table. Default: use all')
|
349 |
-
feature = CLIOption('--feature', '-t',
|
350 |
-
type=str,
|
351 |
-
default='2d',
|
352 |
-
choices=['2d', 'fp'], ## TODO: implement 3d
|
353 |
-
help='Which feature type to generate.')
|
354 |
-
|
355 |
-
## split
|
356 |
-
type_ = CLIOption('--type', '-t',
|
357 |
-
type=str,
|
358 |
-
default='random',
|
359 |
-
choices=upper_and_lower(_SPLITTERS),
|
360 |
-
help='Which split type to use.')
|
361 |
-
train = CLIOption('--train', '-a',
|
362 |
-
type=float,
|
363 |
-
default=1.,
|
364 |
-
help='Proportion of data to use for training. ')
|
365 |
-
test = CLIOption('--test', '-b',
|
366 |
-
type=float,
|
367 |
-
default=0.,
|
368 |
-
help='Proportion of data to use for testing. ')
|
369 |
-
|
370 |
-
## collate
|
371 |
-
data_dir = CLIOption('--data-dir', '-d',
|
372 |
-
type=str,
|
373 |
-
default=None,
|
374 |
-
help='Directory containing data files. '
|
375 |
-
'Default: current directory')
|
376 |
-
id_column = CLIOption('--id-column', '-s',
|
377 |
-
default=None,
|
378 |
-
type=str,
|
379 |
-
help='If provided, add a structure ID column with this name. '
|
380 |
-
'Default: don\'t add structure IDs')
|
381 |
-
prefix_collate = CLIOption('--prefix', '-p',
|
382 |
-
default='ID-',
|
383 |
-
type=str,
|
384 |
-
help='Prefix to add to structure IDs. '
|
385 |
-
'Default: no prefix')
|
386 |
-
digits = CLIOption('--digits', '-n',
|
387 |
-
default=8,
|
388 |
-
type=int,
|
389 |
-
help='Number of digits in structure IDs. ')
|
390 |
-
keep_extra_columns = CLIOption('--keep-extra-columns', '-x',
|
391 |
-
action='store_true',
|
392 |
-
help='Whether to keep columns not mentioned in the catalog. '
|
393 |
-
'Default: drop extra columns.')
|
394 |
-
keep_invalid_smiles = CLIOption('--keep-invalid-smiles', '-y',
|
395 |
-
action='store_true',
|
396 |
-
help='Whether to keep rows with invalid SMILES. '
|
397 |
-
'Default: drop invalid rows.')
|
398 |
-
|
399 |
-
## dedup
|
400 |
-
indexes = CLIOption('--indexes', '-x',
|
401 |
-
type=str,
|
402 |
-
default=None,
|
403 |
-
nargs='*',
|
404 |
-
help='Columns to retain and collapse (if multiple values per unique structure). '
|
405 |
-
'Default: retain no other columns than structure and InchiKey.')
|
406 |
-
drop_inchikey = CLIOption('--drop-inchikey', '-d',
|
407 |
-
action='store_true',
|
408 |
-
help='Whether to drop the calculated InchiKey column. '
|
409 |
-
'Default: keep InchiKey.')
|
410 |
-
|
411 |
-
### enum
|
412 |
-
max_length = CLIOption('--max-length', '-l',
|
413 |
-
type=int,
|
414 |
-
help='Maximum length of enumerated peptide. '
|
415 |
-
'Required.')
|
416 |
-
min_length = CLIOption('--min-length', '-m',
|
417 |
-
type=int,
|
418 |
-
default=None,
|
419 |
-
help='Minimum length of enumerated peptide. '
|
420 |
-
'Default: same as maximum, i.e. all peptides same length.')
|
421 |
-
number_to_gen = CLIOption('--number', '-n',
|
422 |
-
type=float,
|
423 |
-
default=None,
|
424 |
-
help='Number of peptides to sample from all possible '
|
425 |
-
'within the constraints. If less than 1, sample '
|
426 |
-
'that fraction of all possible. If greater than 1, '
|
427 |
-
'sample that number. '
|
428 |
-
'Default: return all peptides.')
|
429 |
-
slicer = CLIOption('--slice', '-z',
|
430 |
-
type=str,
|
431 |
-
default=None,
|
432 |
-
nargs='*',
|
433 |
-
help='Subset of (possibly sampled) population to return, in the format <stop> '
|
434 |
-
'or <start> <stop> [<step>]. If "x" is used for <stop>, then it runs to the end. '
|
435 |
-
'For example, 1000 gives the first 1000, 2 600 gives items 2-600, and '
|
436 |
-
'3 500 2 gives every other from 3 to 500. Default: return all.')
|
437 |
-
alphabet = CLIOption('--alphabet', '-b',
|
438 |
-
type=str,
|
439 |
-
default=''.join(AA),
|
440 |
-
help='Alphabet to use in sampling.')
|
441 |
-
suffix = CLIOption('--suffix', '-s',
|
442 |
-
type=str,
|
443 |
-
default='',
|
444 |
-
help='Sequence to add to end. Lowercase for D-amino acids. '
|
445 |
-
'Default: no suffix.')
|
446 |
-
set_seed = CLIOption('--seed', '-e',
|
447 |
-
type=int,
|
448 |
-
default=None,
|
449 |
-
help='Seed to use for reproducible randomness. '
|
450 |
-
'Default: don\'t enable reproducibility.')
|
451 |
-
d_aa_only = CLIOption('--d-aa-only', '-a',
|
452 |
-
action='store_true',
|
453 |
-
help='Whether to only use D-amino acids. '
|
454 |
-
'Default: don\'t include.')
|
455 |
-
include_d_aa = CLIOption('--include-d-aa', '-y',
|
456 |
-
action='store_true',
|
457 |
-
help='Whether to include D-amino acids in enumeration. '
|
458 |
-
'Default: don\'t include.')
|
459 |
-
|
460 |
-
## reaction
|
461 |
-
name = CLIOption('--name', '-n',
|
462 |
-
type=str,
|
463 |
-
default=None,
|
464 |
-
help='Name of column for product. '
|
465 |
-
'Default: same as reaction name.')
|
466 |
-
reaction_opt = CLIOption('--reaction', '-x',
|
467 |
-
type=str,
|
468 |
-
nargs='*',
|
469 |
-
choices=list(REACTIONS),
|
470 |
-
default='N_to_C_cyclization',
|
471 |
-
help='Reaction(s) to apply.')
|
472 |
-
|
473 |
-
clean = CLICommand('clean',
|
474 |
-
description='Clean and normalize SMILES column of a table.',
|
475 |
-
main=_clean,
|
476 |
-
options=[output, formatting, inputs, representation, column, prefix])
|
477 |
-
convert = CLICommand('convert',
|
478 |
-
description='Convert between string representations of chemical structures.',
|
479 |
-
main=_convert,
|
480 |
-
options=[output, formatting, inputs, representation, column, prefix, to, options])
|
481 |
-
featurize = CLICommand('featurize',
|
482 |
-
description='Convert between string representations of chemical structures.',
|
483 |
-
main=_featurize,
|
484 |
-
options=[output, formatting, inputs, representation, column, prefix,
|
485 |
-
id_feat, feature])
|
486 |
-
collate = CLICommand('collate',
|
487 |
-
description='Collect disparate tables or SDF files of libraries into a single table.',
|
488 |
-
main=_collate,
|
489 |
-
options=[output, formatting, inputs, representation,
|
490 |
-
data_dir, column.replace(default='input_smiles'), id_column, prefix_collate,
|
491 |
-
digits, keep_extra_columns, keep_invalid_smiles])
|
492 |
-
dedup = CLICommand('dedup',
|
493 |
-
description='Deduplicate chemical structures and retain references.',
|
494 |
-
main=_dedup,
|
495 |
-
options=[output, formatting, inputs, representation, column, prefix,
|
496 |
-
indexes, drop_inchikey])
|
497 |
-
enum = CLICommand('enumerate',
|
498 |
-
description='Enumerate bio-chemical structures within length and sequence constraints.',
|
499 |
-
main=_enum,
|
500 |
-
options=[output, formatting, to, options,
|
501 |
-
alphabet, max_length, min_length, number_to_gen,
|
502 |
-
slicer, set_seed,
|
503 |
-
prefix.replace(default='',
|
504 |
-
help='Sequence to prepend. Lowercase for D-amino acids. '
|
505 |
-
'Default: no prefix.'),
|
506 |
-
suffix,
|
507 |
-
type_.replace(default='aa',
|
508 |
-
choices=['aa'],
|
509 |
-
help='Type of bio sequence to enumerate. '
|
510 |
-
'Default: %(default)s.'),
|
511 |
-
d_aa_only, include_d_aa])
|
512 |
-
reaction = CLICommand('react',
|
513 |
-
description='React compounds in silico in indicated columns using a named reaction.',
|
514 |
-
main=_react,
|
515 |
-
options=[output, formatting, inputs, representation, column, name,
|
516 |
-
reaction_opt])
|
517 |
-
split = CLICommand('split',
|
518 |
-
description='Split table based on chosen algorithm, optionally taking account of chemical structure during splits.',
|
519 |
-
main=_split,
|
520 |
-
options=[output, formatting, inputs, representation, column, prefix,
|
521 |
-
type_, train, test, set_seed])
|
522 |
-
|
523 |
-
app = CLIApp("schemist",
|
524 |
-
version=__version__,
|
525 |
-
description="Tools for cleaning, collating, and augmenting chemical datasets.",
|
526 |
-
commands=[clean, convert, featurize, collate, dedup, enum, reaction, split])
|
527 |
-
|
528 |
-
app.run()
|
529 |
-
|
530 |
-
return None
|
531 |
-
|
532 |
-
|
533 |
-
if __name__ == "__main__":
|
534 |
-
|
535 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/collating.py
DELETED
@@ -1,317 +0,0 @@
|
|
1 |
-
"""Tools to collate chemical data files."""
|
2 |
-
|
3 |
-
from typing import Callable, Dict, Iterable, List, Optional, Tuple, TextIO, Union
|
4 |
-
|
5 |
-
from collections import Counter
|
6 |
-
from functools import partial
|
7 |
-
from glob import glob
|
8 |
-
import os
|
9 |
-
|
10 |
-
from carabiner.pd import read_table, resolve_delim
|
11 |
-
from carabiner import print_err
|
12 |
-
import numpy as np
|
13 |
-
from pandas import DataFrame, concat
|
14 |
-
|
15 |
-
from .converting import convert_string_representation, _FROM_FUNCTIONS
|
16 |
-
from .io import FILE_READERS
|
17 |
-
|
18 |
-
GROUPING_COLUMNS = ("filename", "file_format", "library_name", "string_representation")
|
19 |
-
ESSENTIAL_COLUMNS = GROUPING_COLUMNS + ("compound_collection", "plate_id", "well_id")
|
20 |
-
|
21 |
-
def _column_mapper(df: DataFrame,
|
22 |
-
cols: Iterable[str]) -> Tuple[Callable, Dict]:
|
23 |
-
|
24 |
-
basic_map = {column: df[column].tolist()[0] for column in cols}
|
25 |
-
inv_basic_map = {value: key for key, value in basic_map.items()}
|
26 |
-
|
27 |
-
def column_mapper(x: DataFrame) -> DataFrame:
|
28 |
-
|
29 |
-
new_df = DataFrame()
|
30 |
-
|
31 |
-
for new_col, old_col in basic_map.items():
|
32 |
-
|
33 |
-
# old_col = str(old_col)
|
34 |
-
|
35 |
-
if old_col is None or str(old_col) in ('None', 'nan', 'NA'):
|
36 |
-
|
37 |
-
new_df[new_col] = None
|
38 |
-
|
39 |
-
elif '+' in old_col:
|
40 |
-
|
41 |
-
splits = old_col.split('+')
|
42 |
-
new_df[new_col] = x[splits[0]].str.cat([x[s].astype(str)
|
43 |
-
for s in splits[1:]])
|
44 |
-
|
45 |
-
elif ';' in old_col:
|
46 |
-
|
47 |
-
col, char, index = old_col.split(';')
|
48 |
-
index = [int(i) for i in index.split(':')]
|
49 |
-
|
50 |
-
if len(index) == 1:
|
51 |
-
index = slice(index[0], index[0] + 1)
|
52 |
-
else:
|
53 |
-
index = slice(*index)
|
54 |
-
|
55 |
-
try:
|
56 |
-
|
57 |
-
new_df[new_col] = (x[col]
|
58 |
-
.str.split(char)
|
59 |
-
.map(lambda y: char.join(y[index] if y is not np.nan else []))
|
60 |
-
.str.strip())
|
61 |
-
|
62 |
-
except TypeError as e:
|
63 |
-
|
64 |
-
print_err(x[col].str.split(char))
|
65 |
-
|
66 |
-
raise e
|
67 |
-
|
68 |
-
else:
|
69 |
-
try:
|
70 |
-
new_df[new_col] = x[old_col].copy()
|
71 |
-
except KeyError:
|
72 |
-
raise KeyError(f"Column {old_col} mapped to {new_col} is not in the input data: " + ", ".join(x.columns))
|
73 |
-
|
74 |
-
return new_df
|
75 |
-
|
76 |
-
return column_mapper, inv_basic_map
|
77 |
-
|
78 |
-
|
79 |
-
def _check_catalog(catalog: DataFrame,
|
80 |
-
catalog_smiles_column: str = 'input_smiles') -> None:
|
81 |
-
|
82 |
-
essential_columns = (catalog_smiles_column, ) + ESSENTIAL_COLUMNS
|
83 |
-
missing_essential_cols = [col for col in essential_columns
|
84 |
-
if col not in catalog]
|
85 |
-
|
86 |
-
if len(missing_essential_cols) > 0:
|
87 |
-
|
88 |
-
print_err(catalog.columns.tolist())
|
89 |
-
|
90 |
-
raise KeyError("Missing required columns from catalog: " +
|
91 |
-
", ".join(missing_essential_cols))
|
92 |
-
|
93 |
-
return None
|
94 |
-
|
95 |
-
|
96 |
-
def collate_inventory(catalog: DataFrame,
|
97 |
-
root_dir: Optional[str] = None,
|
98 |
-
drop_invalid: bool = True,
|
99 |
-
drop_unmapped: bool = False,
|
100 |
-
catalog_smiles_column: str = 'input_smiles',
|
101 |
-
id_column_name: Optional[str] = None,
|
102 |
-
id_n_digits: int = 8,
|
103 |
-
id_prefix: str = '') -> DataFrame:
|
104 |
-
|
105 |
-
f"""Process a catalog of files containing chemical libraries into a uniform dataframe.
|
106 |
-
|
107 |
-
The catalog table needs to have columns {', '.join(ESSENTIAL_COLUMNS)}:
|
108 |
-
|
109 |
-
- filename is a glob pattern of files to collate
|
110 |
-
- file_format is one of {', '.join(FILE_READERS.keys())}
|
111 |
-
- smiles_column contains smiles strings
|
112 |
-
|
113 |
-
Other columns are optional and can have any name, but must contain the name or a pattern
|
114 |
-
matching a column (for tabular data) or field (for SDF data) in the files
|
115 |
-
of the `filename` column. In the output DataFrame, the named column data will be mapped.
|
116 |
-
|
117 |
-
Optional column contents can be either concatenated or split using the following
|
118 |
-
pattern:
|
119 |
-
|
120 |
-
- col1+col2: concatenates the contents of `col1` and `col2`
|
121 |
-
- col1;-;1:2 : splits the contents of `col1` on the `-` character, and takes splits 1-2 (0-indexed)
|
122 |
-
|
123 |
-
Parameters
|
124 |
-
----------
|
125 |
-
catalog : pd.DataFrame
|
126 |
-
Table cataloging locations and format of data. Requires
|
127 |
-
columns {', '.join(ESSENTIAL_COLUMNS)}.
|
128 |
-
root_dir : str, optional
|
129 |
-
Path to look for data files. Default: current directory.
|
130 |
-
drop_invalid : bool, optional
|
131 |
-
Whether to drop rows containing invalid SMILES.
|
132 |
-
|
133 |
-
|
134 |
-
Returns
|
135 |
-
-------
|
136 |
-
pd.DataFrame
|
137 |
-
Collated chemical data.
|
138 |
-
|
139 |
-
"""
|
140 |
-
|
141 |
-
root_dir = root_dir or '.'
|
142 |
-
|
143 |
-
_check_catalog(catalog, catalog_smiles_column)
|
144 |
-
|
145 |
-
nongroup_columns = [col for col in catalog
|
146 |
-
if col not in GROUPING_COLUMNS]
|
147 |
-
loaded_dataframes = []
|
148 |
-
report = Counter({"invalid SMILES": 0,
|
149 |
-
"rows processed": 0})
|
150 |
-
|
151 |
-
grouped_catalog = catalog.groupby(list(GROUPING_COLUMNS))
|
152 |
-
for (this_glob, this_filetype,
|
153 |
-
this_library_name, this_representation), filename_df in grouped_catalog:
|
154 |
-
|
155 |
-
print_err(f'\nProcessing {this_glob}:')
|
156 |
-
|
157 |
-
this_glob = glob(os.path.join(root_dir, this_glob))
|
158 |
-
|
159 |
-
these_filenames = sorted(f for f in this_glob
|
160 |
-
if not os.path.basename(f).startswith('~$'))
|
161 |
-
print_err('\t- ' + '\n\t- '.join(these_filenames))
|
162 |
-
|
163 |
-
column_mapper, mapped_cols = _column_mapper(filename_df,
|
164 |
-
nongroup_columns)
|
165 |
-
|
166 |
-
reader = FILE_READERS.get(this_filetype, read_table)
|
167 |
-
|
168 |
-
for filename in these_filenames:
|
169 |
-
|
170 |
-
this_data0 = reader(filename)
|
171 |
-
|
172 |
-
if not drop_unmapped:
|
173 |
-
unmapped_cols = {col: 'x_' + col.casefold().replace(' ', '_')
|
174 |
-
for col in this_data0 if col not in mapped_cols}
|
175 |
-
this_data = this_data0[list(unmapped_cols)].rename(columns=unmapped_cols)
|
176 |
-
this_data = concat([column_mapper(this_data0), this_data],
|
177 |
-
axis=1)
|
178 |
-
else:
|
179 |
-
this_data = column_mapper(this_data0)
|
180 |
-
|
181 |
-
if this_representation.casefold() not in _FROM_FUNCTIONS:
|
182 |
-
|
183 |
-
raise TypeError(' or '.join(list(set(this_representation, this_representation.casefold()))) +
|
184 |
-
"not a supported string representation. Try one of " + ", ".join(_FROM_FUNCTIONS))
|
185 |
-
|
186 |
-
this_converter = partial(convert_string_representation,
|
187 |
-
input_representation=this_representation.casefold())
|
188 |
-
|
189 |
-
this_data = (this_data
|
190 |
-
.query('compound_collection != "NA"')
|
191 |
-
.assign(library_name=this_library_name,
|
192 |
-
input_file_format=this_filetype,
|
193 |
-
input_string_representation=this_representation,
|
194 |
-
plate_id=lambda x: x['plate_id'].astype(str),
|
195 |
-
plate_loc=lambda x: x['library_name'].str.cat([x['compound_collection'], x['plate_id'].astype(str), x['well_id'].astype(str)], sep=':'),
|
196 |
-
canonical_smiles=lambda x: list(this_converter(x[catalog_smiles_column])),
|
197 |
-
is_valid_smiles=lambda x: [s is not None for s in x['canonical_smiles']]))
|
198 |
-
|
199 |
-
report.update({"invalid SMILES": (~this_data['is_valid_smiles']).sum(),
|
200 |
-
"rows processed": this_data.shape[0]})
|
201 |
-
|
202 |
-
if drop_invalid:
|
203 |
-
|
204 |
-
this_data = this_data.query('is_valid_smiles')
|
205 |
-
|
206 |
-
if id_column_name is not None:
|
207 |
-
|
208 |
-
this_converter = partial(convert_string_representation,
|
209 |
-
output_representation='id',
|
210 |
-
options=dict(n=id_n_digits,
|
211 |
-
prefix=id_prefix))
|
212 |
-
this_data = this_data.assign(**{id_column_name: lambda x: list(this_converter(x['canonical_smiles']))})
|
213 |
-
|
214 |
-
loaded_dataframes.append(this_data)
|
215 |
-
|
216 |
-
collated_df = concat(loaded_dataframes, axis=0)
|
217 |
-
|
218 |
-
return report, collated_df
|
219 |
-
|
220 |
-
|
221 |
-
def collate_inventory_from_file(catalog_path: Union[str, TextIO],
|
222 |
-
root_dir: Optional[str] = None,
|
223 |
-
format: Optional[str] = None,
|
224 |
-
*args, **kwargs) -> DataFrame:
|
225 |
-
|
226 |
-
f"""Process a catalog of files containing chemical libraries into a uniform dataframe.
|
227 |
-
|
228 |
-
The catalog table needs to have columns {', '.join(ESSENTIAL_COLUMNS)}:
|
229 |
-
|
230 |
-
- filename is a glob pattern of files to collate
|
231 |
-
- file_format is one of {', '.join(FILE_READERS.keys())}
|
232 |
-
- smiles_column contains smiles strings
|
233 |
-
|
234 |
-
Other columns are optional and can have any name, but must contain the name or a pattern
|
235 |
-
matching a column (for tabular data) or field (for SDF data) in the files
|
236 |
-
of the `filename` column. In the output DataFrame, the named column data will be mapped.
|
237 |
-
|
238 |
-
Optional column contents can be either concatenated or split using the following
|
239 |
-
pattern:
|
240 |
-
|
241 |
-
- col1+col2: concatenates the contents of `col1` and `col2`
|
242 |
-
- col1;-;1:2 : splits the contents of `col1` on the `-` character, and takes splits 1-2 (0-indexed)
|
243 |
-
|
244 |
-
Parameters
|
245 |
-
----------
|
246 |
-
catalog_path : str
|
247 |
-
Path to catalog file in XLSX, TSV or CSV format. Requires
|
248 |
-
columns {', '.join(ESSENTIAL_COLUMNS)}.
|
249 |
-
format : str, optional
|
250 |
-
Format of catalog file. Default: infer from file extension.
|
251 |
-
root_dir : str, optional
|
252 |
-
Path to look for data files. Default: use directory containing
|
253 |
-
the catalog.
|
254 |
-
|
255 |
-
Returns
|
256 |
-
-------
|
257 |
-
pd.DataFrame
|
258 |
-
Collated chemical data.
|
259 |
-
|
260 |
-
"""
|
261 |
-
|
262 |
-
root_dir = root_dir or os.path.dirname(catalog_path)
|
263 |
-
|
264 |
-
data_catalog = read_table(catalog_path, format=format)
|
265 |
-
|
266 |
-
return collate_inventory(catalog=data_catalog,
|
267 |
-
root_dir=root_dir,
|
268 |
-
*args, **kwargs)
|
269 |
-
|
270 |
-
|
271 |
-
def deduplicate(df: DataFrame,
|
272 |
-
column: str = 'smiles',
|
273 |
-
input_representation: str = 'smiles',
|
274 |
-
index_columns: Optional[List[str]] = None,
|
275 |
-
drop_inchikey: bool = False) -> DataFrame:
|
276 |
-
|
277 |
-
index_columns = index_columns or []
|
278 |
-
|
279 |
-
inchikey_converter = partial(convert_string_representation,
|
280 |
-
input_representation=input_representation,
|
281 |
-
output_representation='inchikey')
|
282 |
-
|
283 |
-
df = df.assign(inchikey=lambda x: inchikey_converter(x[column]))
|
284 |
-
|
285 |
-
structure_columns = [column, 'inchikey']
|
286 |
-
df_unique = []
|
287 |
-
|
288 |
-
for (string_rep, inchikey), structure_df in df.groupby(structure_columns):
|
289 |
-
|
290 |
-
collapsed_indexes = {col: [';'.join(sorted(map(str, set(structure_df[col].tolist()))))]
|
291 |
-
for col in structure_df if col in index_columns}
|
292 |
-
collapsed_indexes.update({column: [string_rep],
|
293 |
-
'inchikey': [inchikey],
|
294 |
-
'instance_count': [structure_df.shape[0]]})
|
295 |
-
|
296 |
-
df_unique.append(DataFrame(collapsed_indexes))
|
297 |
-
|
298 |
-
df_unique = concat(df_unique, axis=0)
|
299 |
-
|
300 |
-
if drop_inchikey:
|
301 |
-
|
302 |
-
df_unique = df_unique.drop(columns=['inchikey'])
|
303 |
-
|
304 |
-
report = {'starting rows:': df.shape[0],
|
305 |
-
'ending_rows': df_unique.shape[0]}
|
306 |
-
|
307 |
-
return report, df_unique
|
308 |
-
|
309 |
-
|
310 |
-
def deduplicate_file(filename: Union[str, TextIO],
|
311 |
-
format: Optional[str] = None,
|
312 |
-
*args, **kwargs) -> DataFrame:
|
313 |
-
|
314 |
-
table = read_table(filename)
|
315 |
-
|
316 |
-
return deduplicate(table, *args, **kwargs)
|
317 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/converting.py
DELETED
@@ -1,369 +0,0 @@
|
|
1 |
-
"""Converting between chemical representation formats."""
|
2 |
-
|
3 |
-
from typing import Any, Callable, Dict, Iterable, List, Optional, Union
|
4 |
-
|
5 |
-
from functools import wraps
|
6 |
-
|
7 |
-
from carabiner import print_err
|
8 |
-
from carabiner.cast import cast, flatten
|
9 |
-
from carabiner.decorators import return_none_on_error, vectorize
|
10 |
-
from carabiner.itertools import batched
|
11 |
-
|
12 |
-
from datamol import sanitize_smiles
|
13 |
-
import nemony as nm
|
14 |
-
from pandas import DataFrame
|
15 |
-
from rdkit.Chem import (Crippen, Descriptors, rdMolDescriptors,
|
16 |
-
Mol, MolFromInchi, MolFromHELM, MolFromSequence,
|
17 |
-
MolFromSmiles, MolToInchi, MolToInchiKey,
|
18 |
-
MolToSmiles)
|
19 |
-
from rdkit.Chem.Scaffolds.MurckoScaffold import MurckoScaffoldSmiles
|
20 |
-
from requests import Session
|
21 |
-
import selfies as sf
|
22 |
-
|
23 |
-
from .rest_lookup import _inchikey2pubchem_name_id, _inchikey2cactus_name
|
24 |
-
|
25 |
-
@vectorize
|
26 |
-
@return_none_on_error
|
27 |
-
def _seq2mol(s: str) -> Union[Mol, None]:
|
28 |
-
|
29 |
-
return MolFromSequence(s, sanitize=True)
|
30 |
-
|
31 |
-
|
32 |
-
@vectorize
|
33 |
-
@return_none_on_error
|
34 |
-
def _helm2mol(s: str) -> Union[Mol, None]:
|
35 |
-
|
36 |
-
return MolFromHELM(s, sanitize=True)
|
37 |
-
|
38 |
-
|
39 |
-
def mini_helm2helm(s: str) -> List[str]:
|
40 |
-
|
41 |
-
new_s = []
|
42 |
-
token = ''
|
43 |
-
between_sq_brackets = False
|
44 |
-
|
45 |
-
for letter in s:
|
46 |
-
|
47 |
-
if letter.islower() and not between_sq_brackets:
|
48 |
-
|
49 |
-
letter = f"[d{letter.upper()}]"
|
50 |
-
|
51 |
-
token += letter
|
52 |
-
|
53 |
-
if letter == '[':
|
54 |
-
between_sq_brackets = True
|
55 |
-
elif letter == ']':
|
56 |
-
between_sq_brackets = False
|
57 |
-
|
58 |
-
if not between_sq_brackets:
|
59 |
-
new_s.append(token)
|
60 |
-
token = ''
|
61 |
-
|
62 |
-
return "PEPTIDE1{{{inner_helm}}}$$$$".format(inner_helm='.'.join(new_s))
|
63 |
-
|
64 |
-
|
65 |
-
@vectorize
|
66 |
-
@return_none_on_error
|
67 |
-
def _mini_helm2mol(s: str) -> Mol:
|
68 |
-
|
69 |
-
s = mini_helm2helm(s)
|
70 |
-
|
71 |
-
return MolFromHELM(s, sanitize=True)
|
72 |
-
|
73 |
-
|
74 |
-
@vectorize
|
75 |
-
@return_none_on_error
|
76 |
-
def _inchi2mol(s: str) -> Mol:
|
77 |
-
|
78 |
-
return MolFromInchi(s,
|
79 |
-
sanitize=True,
|
80 |
-
removeHs=True)
|
81 |
-
|
82 |
-
@vectorize
|
83 |
-
@return_none_on_error
|
84 |
-
def _smiles2mol(s: str) -> Mol:
|
85 |
-
|
86 |
-
return MolFromSmiles(sanitize_smiles(s))
|
87 |
-
|
88 |
-
|
89 |
-
@vectorize
|
90 |
-
@return_none_on_error
|
91 |
-
def _selfies2mol(s: str) -> Mol:
|
92 |
-
|
93 |
-
return MolFromSmiles(sf.decoder(s))
|
94 |
-
|
95 |
-
|
96 |
-
@vectorize
|
97 |
-
@return_none_on_error
|
98 |
-
def _mol2clogp(m: Mol,
|
99 |
-
**kwargs) -> float:
|
100 |
-
|
101 |
-
return Crippen.MolLogP(m)
|
102 |
-
|
103 |
-
|
104 |
-
@vectorize
|
105 |
-
@return_none_on_error
|
106 |
-
def _mol2nonstandard_inchikey(m: Mol,
|
107 |
-
**kwargs) -> str:
|
108 |
-
|
109 |
-
return MolToInchiKey(m,
|
110 |
-
options="/FixedH /SUU /RecMet /KET /15T")
|
111 |
-
|
112 |
-
|
113 |
-
@vectorize
|
114 |
-
@return_none_on_error
|
115 |
-
def _mol2hash(m: Mol,
|
116 |
-
**kwargs) -> str:
|
117 |
-
|
118 |
-
nonstandard_inchikey = _mol2nonstandard_inchikey(m)
|
119 |
-
|
120 |
-
return nm.hash(nonstandard_inchikey)
|
121 |
-
|
122 |
-
|
123 |
-
@vectorize
|
124 |
-
@return_none_on_error
|
125 |
-
def _mol2id(m: Mol,
|
126 |
-
n: int = 8,
|
127 |
-
prefix: str = '',
|
128 |
-
**kwargs) -> str:
|
129 |
-
|
130 |
-
return prefix + str(int(_mol2hash(m), 16))[:n]
|
131 |
-
|
132 |
-
|
133 |
-
@vectorize
|
134 |
-
@return_none_on_error
|
135 |
-
def _mol2isomeric_canonical_smiles(m: Mol,
|
136 |
-
**kwargs) -> str:
|
137 |
-
|
138 |
-
return MolToSmiles(m,
|
139 |
-
isomericSmiles=True,
|
140 |
-
canonical=True)
|
141 |
-
|
142 |
-
|
143 |
-
@vectorize
|
144 |
-
@return_none_on_error
|
145 |
-
def _mol2inchi(m: Mol,
|
146 |
-
**kwargs) -> str:
|
147 |
-
|
148 |
-
return MolToInchi(m)
|
149 |
-
|
150 |
-
|
151 |
-
@vectorize
|
152 |
-
@return_none_on_error
|
153 |
-
def _mol2inchikey(m: Mol,
|
154 |
-
**kwargs) -> str:
|
155 |
-
|
156 |
-
return MolToInchiKey(m)
|
157 |
-
|
158 |
-
|
159 |
-
@vectorize
|
160 |
-
@return_none_on_error
|
161 |
-
def _mol2random_smiles(m: Mol,
|
162 |
-
**kwargs) -> str:
|
163 |
-
|
164 |
-
return MolToSmiles(m,
|
165 |
-
isomericSmiles=True,
|
166 |
-
doRandom=True)
|
167 |
-
|
168 |
-
|
169 |
-
@vectorize
|
170 |
-
@return_none_on_error
|
171 |
-
def _mol2mnemonic(m: Mol,
|
172 |
-
**kwargs) -> str:
|
173 |
-
|
174 |
-
nonstandard_inchikey = _mol2nonstandard_inchikey(m)
|
175 |
-
|
176 |
-
return nm.encode(nonstandard_inchikey)
|
177 |
-
|
178 |
-
|
179 |
-
@vectorize
|
180 |
-
@return_none_on_error
|
181 |
-
def _mol2mwt(m: Mol,
|
182 |
-
**kwargs) -> float:
|
183 |
-
|
184 |
-
return Descriptors.ExactMolWt(m)
|
185 |
-
|
186 |
-
|
187 |
-
@vectorize
|
188 |
-
@return_none_on_error
|
189 |
-
def _mol2min_charge(m: Mol,
|
190 |
-
**kwargs) -> float:
|
191 |
-
|
192 |
-
return Descriptors.MinPartialCharge(m)
|
193 |
-
|
194 |
-
|
195 |
-
@vectorize
|
196 |
-
@return_none_on_error
|
197 |
-
def _mol2max_charge(m: Mol,
|
198 |
-
**kwargs) -> float:
|
199 |
-
|
200 |
-
return Descriptors.MaxPartialCharge(m)
|
201 |
-
|
202 |
-
|
203 |
-
@vectorize
|
204 |
-
@return_none_on_error
|
205 |
-
def _mol2tpsa(m: Mol,
|
206 |
-
**kwargs) -> float:
|
207 |
-
|
208 |
-
return rdMolDescriptors.CalcTPSA(m)
|
209 |
-
|
210 |
-
|
211 |
-
def _mol2pubchem(m: Union[Mol, Iterable[Mol]],
|
212 |
-
session: Optional[Session] = None,
|
213 |
-
chunksize: int = 32) -> List[Dict[str, Union[None, int, str]]]:
|
214 |
-
|
215 |
-
inchikeys = cast(_mol2inchikey(m), to=list)
|
216 |
-
pubchem_ids = []
|
217 |
-
|
218 |
-
for _inchikeys in batched(inchikeys, chunksize):
|
219 |
-
|
220 |
-
these_ids = _inchikey2pubchem_name_id(_inchikeys,
|
221 |
-
session=session)
|
222 |
-
pubchem_ids += these_ids
|
223 |
-
|
224 |
-
return pubchem_ids
|
225 |
-
|
226 |
-
|
227 |
-
@return_none_on_error
|
228 |
-
def _mol2pubchem_id(m: Union[Mol, Iterable[Mol]],
|
229 |
-
session: Optional[Session] = None,
|
230 |
-
chunksize: int = 32,
|
231 |
-
**kwargs) -> Union[str, List[str]]:
|
232 |
-
|
233 |
-
return flatten([val['pubchem_id']
|
234 |
-
for val in _mol2pubchem(m,
|
235 |
-
session=session,
|
236 |
-
chunksize=chunksize)])
|
237 |
-
|
238 |
-
|
239 |
-
@return_none_on_error
|
240 |
-
def _mol2pubchem_name(m: Union[Mol, Iterable[Mol]],
|
241 |
-
session: Optional[Session] = None,
|
242 |
-
chunksize: int = 32,
|
243 |
-
**kwargs) -> Union[str, List[str]]:
|
244 |
-
|
245 |
-
return flatten([val['pubchem_name']
|
246 |
-
for val in _mol2pubchem(m,
|
247 |
-
session=session,
|
248 |
-
chunksize=chunksize)])
|
249 |
-
|
250 |
-
@return_none_on_error
|
251 |
-
def _mol2cactus_name(m: Union[Mol, Iterable[Mol]],
|
252 |
-
session: Optional[Session] = None,
|
253 |
-
**kwargs) -> Union[str, List[str]]:
|
254 |
-
|
255 |
-
return _inchikey2cactus_name(_mol2inchikey(m),
|
256 |
-
session=session)
|
257 |
-
|
258 |
-
|
259 |
-
@vectorize
|
260 |
-
@return_none_on_error
|
261 |
-
def _mol2scaffold(m: Mol,
|
262 |
-
chiral: bool = True,
|
263 |
-
**kwargs) -> str:
|
264 |
-
|
265 |
-
return MurckoScaffoldSmiles(mol=m,
|
266 |
-
includeChirality=chiral)
|
267 |
-
|
268 |
-
|
269 |
-
@vectorize
|
270 |
-
@return_none_on_error
|
271 |
-
def _mol2selfies(m: Mol,
|
272 |
-
**kwargs) -> str:
|
273 |
-
|
274 |
-
s = sf.encoder(_mol2isomeric_canonical_smiles(m))
|
275 |
-
|
276 |
-
return s if s != -1 else None
|
277 |
-
|
278 |
-
|
279 |
-
_TO_FUNCTIONS = {"smiles": _mol2isomeric_canonical_smiles,
|
280 |
-
"selfies": _mol2selfies,
|
281 |
-
"inchi": _mol2inchi,
|
282 |
-
"inchikey": _mol2inchikey,
|
283 |
-
"nonstandard_inchikey": _mol2nonstandard_inchikey,
|
284 |
-
"hash": _mol2hash,
|
285 |
-
"mnemonic": _mol2mnemonic,
|
286 |
-
"id": _mol2id,
|
287 |
-
"scaffold": _mol2scaffold,
|
288 |
-
"permuted_smiles": _mol2random_smiles,
|
289 |
-
"pubchem_id": _mol2pubchem_id,
|
290 |
-
"pubchem_name": _mol2pubchem_name,
|
291 |
-
"cactus_name": _mol2cactus_name,
|
292 |
-
"clogp": _mol2clogp,
|
293 |
-
"tpsa": _mol2tpsa,
|
294 |
-
"mwt": _mol2mwt,
|
295 |
-
"min_charge": _mol2min_charge,
|
296 |
-
"max_charge": _mol2max_charge}
|
297 |
-
|
298 |
-
_FROM_FUNCTIONS = {"smiles": _smiles2mol,
|
299 |
-
"selfies": _selfies2mol,
|
300 |
-
"inchi": _inchi2mol,
|
301 |
-
"aa_seq": _seq2mol,
|
302 |
-
"helm": _helm2mol,
|
303 |
-
"minihelm": _mini_helm2mol}
|
304 |
-
|
305 |
-
|
306 |
-
def _x2mol(
|
307 |
-
strings: Union[Iterable[str], str],
|
308 |
-
input_representation: str = 'smiles'
|
309 |
-
) -> Union[Mol, None, Iterable[Union[Mol, None]]]:
|
310 |
-
|
311 |
-
from_function = _FROM_FUNCTIONS[input_representation.casefold()]
|
312 |
-
return from_function(strings)
|
313 |
-
|
314 |
-
|
315 |
-
def _mol2x(
|
316 |
-
mols: Union[Iterable[Mol], Mol],
|
317 |
-
output_representation: str = 'smiles',
|
318 |
-
**kwargs
|
319 |
-
) -> Union[str, None, Iterable[Union[str, None]]]:
|
320 |
-
|
321 |
-
to_function = _TO_FUNCTIONS[output_representation.casefold()]
|
322 |
-
|
323 |
-
return to_function(mols, **kwargs)
|
324 |
-
|
325 |
-
|
326 |
-
def convert_string_representation(
|
327 |
-
strings: Union[Iterable[str], str],
|
328 |
-
input_representation: str = 'smiles',
|
329 |
-
output_representation: Union[Iterable[str], str] = 'smiles',
|
330 |
-
**kwargs
|
331 |
-
) -> Union[str, None, Iterable[Union[str, None]], Dict[str, Union[str, None, Iterable[Union[str, None]]]]]:
|
332 |
-
|
333 |
-
"""Convert between string representations of chemical structures.
|
334 |
-
|
335 |
-
"""
|
336 |
-
|
337 |
-
mols = _x2mol(cast(strings, to=list), input_representation)
|
338 |
-
# print_err(mols)
|
339 |
-
|
340 |
-
if not isinstance(output_representation, str) and isinstance(output_representation, Iterable):
|
341 |
-
mols = cast(mols, to=list)
|
342 |
-
outstrings = {rep_name: _mol2x(mols, rep_name, **kwargs)
|
343 |
-
for rep_name in output_representation}
|
344 |
-
elif isinstance(output_representation, str):
|
345 |
-
outstrings = _mol2x(mols, output_representation, **kwargs)
|
346 |
-
else:
|
347 |
-
raise TypeError(f"Specified output representation must be a string or iterable")
|
348 |
-
# print_err(outstrings)
|
349 |
-
|
350 |
-
return outstrings
|
351 |
-
|
352 |
-
|
353 |
-
def _convert_input_to_smiles(f: Callable) -> Callable:
|
354 |
-
|
355 |
-
@wraps(f)
|
356 |
-
def _f(
|
357 |
-
strings: Union[Iterable[str], str],
|
358 |
-
input_representation: str = 'smiles',
|
359 |
-
*args, **kwargs
|
360 |
-
) -> Union[str, None, Iterable[Union[str, None]]]:
|
361 |
-
|
362 |
-
smiles = convert_string_representation(
|
363 |
-
cast(strings, to=list),
|
364 |
-
output_representation='smiles',
|
365 |
-
input_representation=input_representation
|
366 |
-
)
|
367 |
-
return f(strings=smiles, *args, **kwargs)
|
368 |
-
|
369 |
-
return _f
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/features.py
DELETED
@@ -1,271 +0,0 @@
|
|
1 |
-
"""Tools for generating chemical features."""
|
2 |
-
|
3 |
-
from typing import Any, Callable, Iterable, List, Optional, Tuple, Union
|
4 |
-
from functools import wraps
|
5 |
-
|
6 |
-
from carabiner.cast import cast
|
7 |
-
from descriptastorus.descriptors import MakeGenerator
|
8 |
-
from pandas import DataFrame, Series
|
9 |
-
import numpy as np
|
10 |
-
from rdkit import RDLogger
|
11 |
-
RDLogger.DisableLog('rdApp.*')
|
12 |
-
from rdkit.Chem.AllChem import FingeprintGenerator64, GetMorganGenerator, Mol
|
13 |
-
|
14 |
-
from .converting import _smiles2mol, _convert_input_to_smiles
|
15 |
-
|
16 |
-
def _feature_matrix(f: Callable[[Any], DataFrame]) -> Callable[[Any], Union[DataFrame, Tuple[np.ndarray, np.ndarray]]]:
|
17 |
-
|
18 |
-
@wraps(f)
|
19 |
-
def _f(prefix: Optional[str] = None,
|
20 |
-
*args, **kwargs) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
|
21 |
-
|
22 |
-
feature_matrix = f(*args, **kwargs)
|
23 |
-
|
24 |
-
if prefix is not None and isinstance(feature_matrix, DataFrame):
|
25 |
-
new_cols = {col: f"{prefix}_{col}"
|
26 |
-
for col in feature_matrix.columns
|
27 |
-
if not col.startswith('_meta')}
|
28 |
-
feature_matrix = feature_matrix.rename(columns=new_cols)
|
29 |
-
|
30 |
-
return feature_matrix
|
31 |
-
|
32 |
-
return _f
|
33 |
-
|
34 |
-
|
35 |
-
def _get_descriptastorus_features(
|
36 |
-
smiles: Iterable[str],
|
37 |
-
generator: str
|
38 |
-
) -> Union[DataFrame, Tuple[np.ndarray, List[str]]]:
|
39 |
-
|
40 |
-
generator = MakeGenerator((generator, ))
|
41 |
-
features = list(map(generator.process, smiles))
|
42 |
-
return np.stack(features, axis=0), [col for col, _ in generator.GetColumns()]
|
43 |
-
|
44 |
-
|
45 |
-
@_feature_matrix
|
46 |
-
@_convert_input_to_smiles
|
47 |
-
def calculate_2d_features(
|
48 |
-
strings: Union[Iterable[str], str],
|
49 |
-
normalized: bool = True,
|
50 |
-
histogram_normalized: bool = True,
|
51 |
-
return_dataframe: bool = False
|
52 |
-
) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
|
53 |
-
|
54 |
-
"""Calculate 2d features from string representation.
|
55 |
-
|
56 |
-
Parameters
|
57 |
-
----------
|
58 |
-
strings : str
|
59 |
-
Input string representation(s).
|
60 |
-
input_representation : str
|
61 |
-
Representation type
|
62 |
-
normalized : bool, optional
|
63 |
-
Whether to return normalized features. Default: `True`.
|
64 |
-
histogram_normalized : bool, optional
|
65 |
-
Whether to return histogram normalized features (faster). Default: `True`.
|
66 |
-
return_dataframe : bool, optional
|
67 |
-
Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: `False`.
|
68 |
-
|
69 |
-
Returns
|
70 |
-
-------
|
71 |
-
DataFrame, Tuple of numpy Arrays
|
72 |
-
If `return_dataframe = True`, a DataFrame with named feature columns, and
|
73 |
-
the final column called `"meta_feature_valid"` being the validity indicator.
|
74 |
-
Otherwise returns a tuple of Arrays with the first being the matrix of
|
75 |
-
features and the second being the vector of validity indicators.
|
76 |
-
|
77 |
-
Examples
|
78 |
-
--------
|
79 |
-
>>> features, validity = calculate_2d_features(strings='CCC')
|
80 |
-
>>> features[:,:3]
|
81 |
-
array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05]])
|
82 |
-
>>> validity
|
83 |
-
array([1.])
|
84 |
-
>>> features, validity = calculate_2d_features(strings=['CCC', 'CCCO'])
|
85 |
-
>>> features[:,:3]
|
86 |
-
array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05],
|
87 |
-
[7.38891722e-01, 6.00042003e-04, 5.00035002e-05]])
|
88 |
-
>>> validity
|
89 |
-
array([1., 1.])
|
90 |
-
>>> calculate_2d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
|
91 |
-
CCC True
|
92 |
-
CCCO True
|
93 |
-
Name: meta_feature_valid, dtype: bool
|
94 |
-
|
95 |
-
"""
|
96 |
-
|
97 |
-
if normalized:
|
98 |
-
if histogram_normalized:
|
99 |
-
generator_name = "RDKit2DHistogramNormalized"
|
100 |
-
else:
|
101 |
-
generator_name = "RDKit2DNormalized"
|
102 |
-
else:
|
103 |
-
generator_name = "RDKit2D"
|
104 |
-
|
105 |
-
strings = cast(strings, to=list)
|
106 |
-
feature_matrix, columns = _get_descriptastorus_features(
|
107 |
-
strings,
|
108 |
-
generator=generator_name,
|
109 |
-
)
|
110 |
-
|
111 |
-
if return_dataframe:
|
112 |
-
feature_matrix = DataFrame(
|
113 |
-
feature_matrix,
|
114 |
-
index=strings,
|
115 |
-
columns=columns,
|
116 |
-
)
|
117 |
-
|
118 |
-
feature_matrix = (
|
119 |
-
feature_matrix
|
120 |
-
.rename(columns={f"{generator_name}_calculated": "meta_feature_valid0"})
|
121 |
-
.assign(meta_feature_type=generator_name,
|
122 |
-
meta_feature_valid=lambda x: (x['meta_feature_valid0'] == 1.))
|
123 |
-
.drop(columns=['meta_feature_valid0'])
|
124 |
-
)
|
125 |
-
return feature_matrix
|
126 |
-
else:
|
127 |
-
return feature_matrix[:,1:], feature_matrix[:,0]
|
128 |
-
|
129 |
-
|
130 |
-
def _fast_fingerprint(generator: FingeprintGenerator64,
|
131 |
-
mol: Mol,
|
132 |
-
to_np: bool = True) -> Union[str, np.ndarray]:
|
133 |
-
|
134 |
-
try:
|
135 |
-
fp_string = generator.GetFingerprint(mol).ToBitString()
|
136 |
-
except:
|
137 |
-
return None
|
138 |
-
else:
|
139 |
-
if to_np:
|
140 |
-
return np.frombuffer(fp_string.encode(), 'u1') - ord('0')
|
141 |
-
else:
|
142 |
-
return fp_string
|
143 |
-
|
144 |
-
|
145 |
-
@_feature_matrix
|
146 |
-
@_convert_input_to_smiles
|
147 |
-
def calculate_fingerprints(
|
148 |
-
strings: Union[Iterable[str], str],
|
149 |
-
fp_type: str = 'morgan',
|
150 |
-
radius: int = 2,
|
151 |
-
chiral: bool = True,
|
152 |
-
on_bits: bool = True,
|
153 |
-
return_dataframe: bool = False
|
154 |
-
) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
|
155 |
-
|
156 |
-
"""Calculate the binary fingerprint of string representation(s).
|
157 |
-
|
158 |
-
Only Morgan fingerprints are allowed.
|
159 |
-
|
160 |
-
Parameters
|
161 |
-
----------
|
162 |
-
strings : str
|
163 |
-
Input string representation(s).
|
164 |
-
input_representation : str
|
165 |
-
Representation type
|
166 |
-
fp_type : str, opional
|
167 |
-
Which fingerprint type to calculate. Default: `'morgan'`.
|
168 |
-
radius : int, optional
|
169 |
-
Atom radius for fingerprints. Default: `2`.
|
170 |
-
chiral : bool, optional
|
171 |
-
Whether to take chirality into account. Default: `True`.
|
172 |
-
on_bits : bool, optional
|
173 |
-
Whether to return the non-zero indices instead of the full binary vector. Default: `True`.
|
174 |
-
return_dataframe : bool, optional
|
175 |
-
Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: `False`.
|
176 |
-
|
177 |
-
Returns
|
178 |
-
-------
|
179 |
-
DataFrame, Tuple of numpy Arrays
|
180 |
-
If `return_dataframe = True`, a DataFrame with named feature columns, and
|
181 |
-
the final column called `"meta_feature_valid"` being the validity indicator.
|
182 |
-
Otherwise returns a tuple of Arrays with the first being the matrix of
|
183 |
-
features and the second being the vector of validity indicators.
|
184 |
-
|
185 |
-
Raises
|
186 |
-
------
|
187 |
-
NotImplementedError
|
188 |
-
If `fp_type` is not `'morgan'`.
|
189 |
-
|
190 |
-
Examples
|
191 |
-
--------
|
192 |
-
>>> bits, validity = calculate_fingerprints(strings='CCC')
|
193 |
-
>>> bits.tolist()
|
194 |
-
['80;294;1057;1344']
|
195 |
-
>>> sum(validity) # doctest: +NORMALIZE_WHITESPACE
|
196 |
-
1
|
197 |
-
>>> bits, validity = calculate_fingerprints(strings=['CCC', 'CCCO'])
|
198 |
-
>>> bits.tolist()
|
199 |
-
['80;294;1057;1344', '80;222;294;473;794;807;1057;1277']
|
200 |
-
>>> sum(validity) # doctest: +NORMALIZE_WHITESPACE
|
201 |
-
2
|
202 |
-
>>> np.sum(calculate_fingerprints(strings=['CCC', 'CCCO'], on_bits=False)[0], axis=-1)
|
203 |
-
array([4, 8])
|
204 |
-
>>> calculate_fingerprints(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
|
205 |
-
CCC True
|
206 |
-
CCCO True
|
207 |
-
Name: meta_feature_valid, dtype: bool
|
208 |
-
|
209 |
-
"""
|
210 |
-
|
211 |
-
if fp_type.casefold() == 'morgan':
|
212 |
-
generator_class = GetMorganGenerator
|
213 |
-
else:
|
214 |
-
raise NotImplementedError(f"Fingerprint type {fp_type} not supported!")
|
215 |
-
|
216 |
-
fp_generator = generator_class(radius=radius,
|
217 |
-
includeChirality=chiral)
|
218 |
-
strings = cast(strings, to=list)
|
219 |
-
mols = (_smiles2mol(s) for s in strings)
|
220 |
-
fp_strings = (_fast_fingerprint(fp_generator, mol, to_np=on_bits)
|
221 |
-
for mol in mols)
|
222 |
-
|
223 |
-
if on_bits:
|
224 |
-
|
225 |
-
fingerprints = (map(str, np.flatnonzero(fp_string).tolist())
|
226 |
-
for fp_string in fp_strings)
|
227 |
-
fingerprints = [';'.join(fp) for fp in fingerprints]
|
228 |
-
validity = [len(fp) > 0 for fp in fingerprints]
|
229 |
-
|
230 |
-
else:
|
231 |
-
|
232 |
-
fingerprints = [np.array([int(digit) for digit in fp_string])
|
233 |
-
if fp_string is not None
|
234 |
-
else (-np.ones((fp_generator.GetOptions().fpSize, )))
|
235 |
-
for fp_string in fp_strings]
|
236 |
-
validity = [np.all(fp >= 0) for fp in fingerprints]
|
237 |
-
|
238 |
-
feature_matrix = np.stack(fingerprints, axis=0)
|
239 |
-
|
240 |
-
if return_dataframe:
|
241 |
-
if feature_matrix.ndim == 1: # on_bits only
|
242 |
-
feature_matrix = DataFrame(
|
243 |
-
feature_matrix,
|
244 |
-
columns=['fp_bits'],
|
245 |
-
index=strings,
|
246 |
-
)
|
247 |
-
else:
|
248 |
-
feature_matrix = DataFrame(feature_matrix,
|
249 |
-
columns=[f"fp_{i}" for i, _ in enumerate(feature_matrix.T)])
|
250 |
-
return feature_matrix.assign(meta_feature_type=fp_type.casefold(),
|
251 |
-
meta_feature_valid=validity)
|
252 |
-
else:
|
253 |
-
return feature_matrix, validity
|
254 |
-
|
255 |
-
|
256 |
-
_FEATURE_CALCULATORS = {
|
257 |
-
"2d": calculate_2d_features,
|
258 |
-
"fp": calculate_fingerprints,
|
259 |
-
}
|
260 |
-
|
261 |
-
def calculate_feature(
|
262 |
-
feature_type: str,
|
263 |
-
return_dataframe: bool = False,
|
264 |
-
*args, **kwargs) -> Union[DataFrame, Tuple[np.ndarray, np.ndarray]]:
|
265 |
-
|
266 |
-
"""Calculate the binary fingerprint or descriptor vector of string representation(s).
|
267 |
-
|
268 |
-
"""
|
269 |
-
|
270 |
-
featurizer = _FEATURE_CALCULATORS[feature_type]
|
271 |
-
return featurizer(*args, **kwargs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/generating.py
DELETED
@@ -1,262 +0,0 @@
|
|
1 |
-
"""Tools for enumerating compounds. Currently only works with peptides."""
|
2 |
-
|
3 |
-
from typing import Callable, Iterable, Optional, Tuple, Union
|
4 |
-
|
5 |
-
from functools import partial
|
6 |
-
from itertools import chain, islice, product, repeat
|
7 |
-
from math import ceil, expm1, floor
|
8 |
-
from random import choice, choices, random, seed
|
9 |
-
|
10 |
-
from carabiner import print_err
|
11 |
-
from carabiner.decorators import vectorize, return_none_on_error
|
12 |
-
from carabiner.random import sample_iter
|
13 |
-
from rdkit.Chem import Mol, rdChemReactions
|
14 |
-
import numpy as np
|
15 |
-
|
16 |
-
from .converting import (_x2mol, _mol2x,
|
17 |
-
_convert_input_to_smiles)
|
18 |
-
|
19 |
-
AA = tuple('GALVITSMCPFYWHKRDENQ')
|
20 |
-
dAA = tuple(aa.casefold() for aa in AA)
|
21 |
-
|
22 |
-
REACTIONS = {'N_to_C_cyclization': '([N;H1:5][C:1][C:2](=[O:6])[O:3].[N;H2:4][C:7][C:8](=[O:9])[N;H1:10])>>[N;H1:5][C:1][C:2](=[O:6])[N;H1:4][C:7][C:8](=[O:9])[N;H1:10].[O;H2:3]',
|
23 |
-
'cysteine_to_chloroacetyl_cyclization': '([N;H1:5][C:2](=[O:6])[C:1][Cl:3].[S;H1:4][C;H2:7][C:8])>>[N;H1:5][C:2](=[O:6])[C:1][S:4][C;H2:7][C:8]',
|
24 |
-
'cysteine_to_N_cyclization':'([N;H1:5][C:2](=[O:6])[C:1][N;H2:3].[S;H1:4][C;H2:7][C:8])>>[N;H1:5][C:2](=[O:6])[C:1][S:4][C;H2:7][C:8].[N;H3:3]'}
|
25 |
-
|
26 |
-
def _get_alphabet(alphabet: Optional[Iterable[str]] = None,
|
27 |
-
d_aa_only: bool = False,
|
28 |
-
include_d_aa: bool = False) -> Tuple[str]:
|
29 |
-
|
30 |
-
alphabet = alphabet or AA
|
31 |
-
alphabet_lower = tuple(set(aa.casefold() for aa in AA))
|
32 |
-
|
33 |
-
if d_aa_only:
|
34 |
-
alphabet = alphabet_lower
|
35 |
-
elif include_d_aa:
|
36 |
-
alphabet = tuple(set(chain(alphabet, alphabet_lower)))
|
37 |
-
|
38 |
-
return alphabet
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
def all_peptides_of_one_length(length: int,
|
43 |
-
alphabet: Optional[Iterable[str]] = None,
|
44 |
-
d_aa_only: bool = False,
|
45 |
-
include_d_aa: bool = False) -> Iterable[str]:
|
46 |
-
|
47 |
-
"""
|
48 |
-
|
49 |
-
"""
|
50 |
-
|
51 |
-
alphabet = _get_alphabet(alphabet=alphabet,
|
52 |
-
d_aa_only=d_aa_only,
|
53 |
-
include_d_aa=include_d_aa)
|
54 |
-
|
55 |
-
return (''.join(peptide)
|
56 |
-
for peptide in product(alphabet, repeat=length))
|
57 |
-
|
58 |
-
|
59 |
-
def all_peptides_in_length_range(max_length: int,
|
60 |
-
min_length: int = 1,
|
61 |
-
by: int = 1,
|
62 |
-
alphabet: Optional[Iterable[str]] = None,
|
63 |
-
d_aa_only: bool = False,
|
64 |
-
include_d_aa: bool = False,
|
65 |
-
*args, **kwargs) -> Iterable[str]:
|
66 |
-
|
67 |
-
"""
|
68 |
-
|
69 |
-
"""
|
70 |
-
|
71 |
-
length_range = range(*sorted([min_length, max_length + 1]), by)
|
72 |
-
peptide_maker = partial(all_peptides_of_one_length,
|
73 |
-
alphabet=alphabet,
|
74 |
-
d_aa_only=d_aa_only,
|
75 |
-
include_d_aa=include_d_aa,
|
76 |
-
*args, **kwargs)
|
77 |
-
|
78 |
-
return chain.from_iterable(peptide_maker(length=length)
|
79 |
-
for length in length_range)
|
80 |
-
|
81 |
-
|
82 |
-
def _number_of_peptides(max_length: int,
|
83 |
-
min_length: int = 1,
|
84 |
-
by: int = 1,
|
85 |
-
alphabet: Optional[Iterable[str]] = None,
|
86 |
-
d_aa_only: bool = False,
|
87 |
-
include_d_aa: bool = False):
|
88 |
-
|
89 |
-
alphabet = _get_alphabet(alphabet=alphabet,
|
90 |
-
d_aa_only=d_aa_only,
|
91 |
-
include_d_aa=include_d_aa)
|
92 |
-
n_peptides = [len(alphabet) ** length
|
93 |
-
for length in range(*sorted([min_length, max_length + 1]), by)]
|
94 |
-
|
95 |
-
return n_peptides
|
96 |
-
|
97 |
-
|
98 |
-
def _naive_sample_peptides_in_length_range(max_length: int,
|
99 |
-
min_length: int = 1,
|
100 |
-
by: int = 1,
|
101 |
-
n: Optional[Union[float, int]] = None,
|
102 |
-
alphabet: Optional[Iterable[str]] = None,
|
103 |
-
d_aa_only: bool = False,
|
104 |
-
include_d_aa: bool = False,
|
105 |
-
set_seed: Optional[int] = None):
|
106 |
-
|
107 |
-
alphabet = _get_alphabet(alphabet=alphabet,
|
108 |
-
d_aa_only=d_aa_only,
|
109 |
-
include_d_aa=include_d_aa)
|
110 |
-
n_peptides = _number_of_peptides(max_length=max_length,
|
111 |
-
min_length=min_length,
|
112 |
-
by=by,
|
113 |
-
alphabet=alphabet,
|
114 |
-
d_aa_only=d_aa_only,
|
115 |
-
include_d_aa=include_d_aa)
|
116 |
-
lengths = list(range(*sorted([min_length, max_length + 1]), by))
|
117 |
-
weight_per_length = [n / min(n_peptides) for n in n_peptides]
|
118 |
-
weighted_lengths = list(chain.from_iterable(repeat(l, ceil(w)) for l, w in zip(lengths, weight_per_length)))
|
119 |
-
|
120 |
-
lengths_sample = (choice(weighted_lengths) for _ in range(n))
|
121 |
-
return (''.join(choices(list(alphabet), k=k)) for k in lengths_sample)
|
122 |
-
|
123 |
-
|
124 |
-
def sample_peptides_in_length_range(max_length: int,
|
125 |
-
min_length: int = 1,
|
126 |
-
by: int = 1,
|
127 |
-
n: Optional[Union[float, int]] = None,
|
128 |
-
alphabet: Optional[Iterable[str]] = None,
|
129 |
-
d_aa_only: bool = False,
|
130 |
-
include_d_aa: bool = False,
|
131 |
-
naive_sampling_cutoff: float = 5e-3,
|
132 |
-
reservoir_sampling: bool = True,
|
133 |
-
indexes: Optional[Iterable[int]] = None,
|
134 |
-
set_seed: Optional[int] = None,
|
135 |
-
*args, **kwargs) -> Iterable[str]:
|
136 |
-
|
137 |
-
"""
|
138 |
-
|
139 |
-
"""
|
140 |
-
|
141 |
-
seed(set_seed)
|
142 |
-
|
143 |
-
alphabet = _get_alphabet(alphabet=alphabet,
|
144 |
-
d_aa_only=d_aa_only,
|
145 |
-
include_d_aa=include_d_aa)
|
146 |
-
|
147 |
-
n_peptides = sum(len(alphabet) ** length
|
148 |
-
for length in range(*sorted([min_length, max_length + 1]), by))
|
149 |
-
if n is None:
|
150 |
-
n_requested = n_peptides
|
151 |
-
elif n >= 1.:
|
152 |
-
n_requested = min(floor(n), n_peptides)
|
153 |
-
elif n < 1.:
|
154 |
-
n_requested = floor(n * n_peptides)
|
155 |
-
|
156 |
-
frac_requested = n_requested / n_peptides
|
157 |
-
|
158 |
-
# approximation of birthday problem
|
159 |
-
p_any_collision = -expm1(-n_requested * (n_requested - 1.) / (2. * n_peptides))
|
160 |
-
n_collisons = n_requested * (1. - ((n_peptides - 1.) / n_peptides) ** (n_requested - 1.))
|
161 |
-
frac_collisions = n_collisons / n_requested
|
162 |
-
|
163 |
-
print_err(f"Sampling {n_requested} ({frac_requested * 100.} %) peptides from "
|
164 |
-
f"length {min_length} to {max_length} ({n_peptides} combinations). "
|
165 |
-
f"Probability of collision if drawing randomly is {p_any_collision}, "
|
166 |
-
f"with {n_collisons} ({100. * frac_collisions} %) collisions on average.")
|
167 |
-
|
168 |
-
if frac_collisions < naive_sampling_cutoff and n_peptides > 2e9:
|
169 |
-
|
170 |
-
print_err("> Executing naive sampling. ")
|
171 |
-
|
172 |
-
peptides = _naive_sample_peptides_in_length_range(max_length, min_length, by,
|
173 |
-
n=n_requested,
|
174 |
-
alphabet=alphabet,
|
175 |
-
d_aa_only=d_aa_only,
|
176 |
-
include_d_aa=include_d_aa)
|
177 |
-
|
178 |
-
else:
|
179 |
-
|
180 |
-
print_err("> Executing exhaustive sampling.")
|
181 |
-
|
182 |
-
all_peptides = all_peptides_in_length_range(max_length, min_length, by,
|
183 |
-
alphabet=alphabet,
|
184 |
-
d_aa_only=d_aa_only,
|
185 |
-
include_d_aa=include_d_aa,
|
186 |
-
*args, **kwargs)
|
187 |
-
|
188 |
-
if n is None:
|
189 |
-
|
190 |
-
peptides = all_peptides
|
191 |
-
|
192 |
-
elif n >= 1.:
|
193 |
-
|
194 |
-
if reservoir_sampling:
|
195 |
-
peptides = sample_iter(all_peptides, k=n_requested,
|
196 |
-
shuffle_output=False)
|
197 |
-
else:
|
198 |
-
peptides = (pep for pep in all_peptides
|
199 |
-
if random() <= frac_requested)
|
200 |
-
|
201 |
-
elif n < 1.:
|
202 |
-
|
203 |
-
peptides = (pep for pep in all_peptides
|
204 |
-
if random() <= n)
|
205 |
-
|
206 |
-
if indexes is not None:
|
207 |
-
|
208 |
-
indexes = (int(ix) if (isinstance(ix, str) and ix.isdigit()) or isinstance(ix, int) or isinstance(ix, float)
|
209 |
-
else None
|
210 |
-
for ix in islice(indexes, 3))
|
211 |
-
indexes = [ix if (ix is None or ix >= 0) else None
|
212 |
-
for ix in indexes]
|
213 |
-
|
214 |
-
if len(indexes) > 1:
|
215 |
-
if n is not None and n >=1. and indexes[0] > n:
|
216 |
-
raise ValueError(f"Minimum slice ({indexes[0]}) is higher than number of items ({n}).")
|
217 |
-
|
218 |
-
peptides = islice(peptides, *indexes)
|
219 |
-
|
220 |
-
return peptides
|
221 |
-
|
222 |
-
|
223 |
-
def _reactor(smarts: str) -> Callable[[Mol], Union[Mol, None]]:
|
224 |
-
|
225 |
-
rxn = rdChemReactions.ReactionFromSmarts(smarts)
|
226 |
-
reaction_function = rxn.RunReactants
|
227 |
-
|
228 |
-
@vectorize
|
229 |
-
@return_none_on_error
|
230 |
-
def reactor(s: Mol) -> Mol:
|
231 |
-
|
232 |
-
return reaction_function([s])[0][0]
|
233 |
-
|
234 |
-
return reactor
|
235 |
-
|
236 |
-
|
237 |
-
@_convert_input_to_smiles
|
238 |
-
def react(strings: Union[str, Iterable[str]],
|
239 |
-
reaction: str = 'N_to_C_cyclization',
|
240 |
-
output_representation: str = 'smiles',
|
241 |
-
**kwargs) -> Union[str, Iterable[str]]:
|
242 |
-
|
243 |
-
"""
|
244 |
-
|
245 |
-
"""
|
246 |
-
|
247 |
-
try:
|
248 |
-
_this_reaction = REACTIONS[reaction]
|
249 |
-
except KeyError:
|
250 |
-
raise KeyError(f"Reaction {reaction} is not available. Try: " +
|
251 |
-
", ".join(list(REACTIONS)))
|
252 |
-
|
253 |
-
# strings = cast(strings, to=list)
|
254 |
-
# print_err((strings))
|
255 |
-
|
256 |
-
reactor = _reactor(_this_reaction)
|
257 |
-
mols = _x2mol(strings)
|
258 |
-
mols = reactor(mols)
|
259 |
-
|
260 |
-
return _mol2x(mols,
|
261 |
-
output_representation=output_representation,
|
262 |
-
**kwargs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/io.py
DELETED
@@ -1,149 +0,0 @@
|
|
1 |
-
"""Tools to facilitate input and output."""
|
2 |
-
|
3 |
-
from typing import Any, Callable, List, Optional, TextIO, Tuple, Union
|
4 |
-
|
5 |
-
from collections import defaultdict
|
6 |
-
from functools import partial
|
7 |
-
from string import printable
|
8 |
-
from tempfile import NamedTemporaryFile
|
9 |
-
from xml.etree import ElementTree
|
10 |
-
|
11 |
-
from carabiner import print_err
|
12 |
-
from carabiner.cast import cast
|
13 |
-
from carabiner.itertools import tenumerate
|
14 |
-
from carabiner.pd import read_table, write_stream
|
15 |
-
|
16 |
-
from pandas import DataFrame, read_excel
|
17 |
-
from rdkit.Chem import SDMolSupplier
|
18 |
-
|
19 |
-
from .converting import _mol2isomeric_canonical_smiles
|
20 |
-
|
21 |
-
def _mutate_df_stream(input_file: Union[str, TextIO],
|
22 |
-
output_file: Union[str, TextIO],
|
23 |
-
function: Callable[[DataFrame], Tuple[Any, DataFrame]],
|
24 |
-
file_format: Optional[str] = None,
|
25 |
-
chunksize: int = 1000) -> List[Any]:
|
26 |
-
|
27 |
-
carries = []
|
28 |
-
|
29 |
-
for i, chunk in tenumerate(read_table(input_file,
|
30 |
-
format=file_format,
|
31 |
-
progress=False,
|
32 |
-
chunksize=chunksize)):
|
33 |
-
|
34 |
-
result = function(chunk)
|
35 |
-
|
36 |
-
try:
|
37 |
-
carry, df = result
|
38 |
-
except ValueError:
|
39 |
-
df = result
|
40 |
-
carry = 0
|
41 |
-
|
42 |
-
write_stream(df,
|
43 |
-
output=output_file,
|
44 |
-
format=file_format,
|
45 |
-
header=i == 0,
|
46 |
-
mode='w' if i == 0 else 'a')
|
47 |
-
|
48 |
-
carries.append(carry)
|
49 |
-
|
50 |
-
return carries
|
51 |
-
|
52 |
-
|
53 |
-
def read_weird_xml(filename: Union[str, TextIO],
|
54 |
-
header: bool = True,
|
55 |
-
namespace: str = '{urn:schemas-microsoft-com:office:spreadsheet}') -> DataFrame:
|
56 |
-
|
57 |
-
"""
|
58 |
-
|
59 |
-
"""
|
60 |
-
|
61 |
-
with cast(filename, TextIO, mode='r') as f:
|
62 |
-
|
63 |
-
xml_string = ''.join(filter(printable.__contains__, f.read()))
|
64 |
-
|
65 |
-
try:
|
66 |
-
|
67 |
-
root = ElementTree.fromstring(xml_string)
|
68 |
-
|
69 |
-
except Exception as e:
|
70 |
-
|
71 |
-
print_err('\n!!! ' + xml_string.split('\n')[1184][377:380])
|
72 |
-
|
73 |
-
raise e
|
74 |
-
|
75 |
-
for i, row in enumerate(root.iter(f'{namespace}Row') ):
|
76 |
-
|
77 |
-
this_row = [datum.text for datum in row.iter(f'{namespace}Data')]
|
78 |
-
|
79 |
-
if i == 0:
|
80 |
-
|
81 |
-
if header:
|
82 |
-
|
83 |
-
heading = this_row
|
84 |
-
df = {colname: [] for colname in heading}
|
85 |
-
|
86 |
-
else:
|
87 |
-
|
88 |
-
heading = [f'X{j}' for j, _ in enumerate(this_row)]
|
89 |
-
df = {colname: [datum] for colname, datum in zip(heading, this_row)}
|
90 |
-
|
91 |
-
else:
|
92 |
-
|
93 |
-
for colname, datum in zip(heading, this_row):
|
94 |
-
|
95 |
-
df[colname].append(datum)
|
96 |
-
|
97 |
-
return DataFrame(df)
|
98 |
-
|
99 |
-
|
100 |
-
def read_sdf(filename: Union[str, TextIO]):
|
101 |
-
|
102 |
-
"""
|
103 |
-
|
104 |
-
"""
|
105 |
-
|
106 |
-
filename = cast(filename, str)
|
107 |
-
|
108 |
-
with open(filename, 'r', errors='replace') as f:
|
109 |
-
with NamedTemporaryFile("w") as o:
|
110 |
-
|
111 |
-
o.write(f.read())
|
112 |
-
o.seek(0)
|
113 |
-
|
114 |
-
df = defaultdict(list)
|
115 |
-
|
116 |
-
for i, mol in enumerate(SDMolSupplier(o.name)):
|
117 |
-
|
118 |
-
if mol is None:
|
119 |
-
|
120 |
-
continue
|
121 |
-
|
122 |
-
propdict = mol.GetPropsAsDict()
|
123 |
-
propdict['SMILES'] = _mol2isomeric_canonical_smiles(mol)
|
124 |
-
|
125 |
-
for colname in propdict:
|
126 |
-
|
127 |
-
df[colname].append(propdict[colname])
|
128 |
-
|
129 |
-
for colname in df:
|
130 |
-
|
131 |
-
if colname not in propdict:
|
132 |
-
|
133 |
-
df[colname].append(None)
|
134 |
-
|
135 |
-
col_lengths = {col: len(val) for col, val in df.items()}
|
136 |
-
|
137 |
-
if len(set(col_lengths.values())) > 1:
|
138 |
-
|
139 |
-
raise ValueError(f"Column lengths not all the same:\n\t" +
|
140 |
-
'\n\t'.join(f"{key}:{val}" for key, val in col_lengths.items()))
|
141 |
-
|
142 |
-
return DataFrame(df)
|
143 |
-
|
144 |
-
|
145 |
-
FILE_READERS = {
|
146 |
-
'bad_xml': read_weird_xml,
|
147 |
-
'xlsx': partial(read_excel, engine='openpyxl'),
|
148 |
-
'sdf': read_sdf
|
149 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/rest_lookup.py
DELETED
@@ -1,118 +0,0 @@
|
|
1 |
-
"""Tools for querying PubChem."""
|
2 |
-
|
3 |
-
from typing import Dict, Iterable, List, Optional, Union
|
4 |
-
from time import sleep
|
5 |
-
from xml.etree import ElementTree
|
6 |
-
|
7 |
-
from carabiner import print_err
|
8 |
-
from carabiner.cast import cast
|
9 |
-
from carabiner.decorators import vectorize
|
10 |
-
from requests import Response, Session
|
11 |
-
|
12 |
-
_PUBCHEM_URL = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/{inchikey}/property/{get}/{format}"
|
13 |
-
_CACTUS_URL = "https://cactus.nci.nih.gov/chemical/structure/{inchikey}/{get}"
|
14 |
-
|
15 |
-
_OVERLOAD_CODES = {500, 501, 503, 504}
|
16 |
-
|
17 |
-
|
18 |
-
def _url_request(inchikeys: Union[str, Iterable[str]],
|
19 |
-
url: str,
|
20 |
-
session: Optional[Session] = None,
|
21 |
-
**kwargs) -> Response:
|
22 |
-
|
23 |
-
if session is None:
|
24 |
-
session = Session()
|
25 |
-
|
26 |
-
inchikeys = cast(inchikeys, to=list)
|
27 |
-
|
28 |
-
return session.get(url.format(inchikey=','.join(inchikeys), **kwargs))
|
29 |
-
|
30 |
-
|
31 |
-
def _inchikey2pubchem_name_id(inchikeys: Union[str, Iterable[str]],
|
32 |
-
session: Optional[Session] = None,
|
33 |
-
counter: int = 0,
|
34 |
-
max_tries: int = 10,
|
35 |
-
namespace: str = "{http://pubchem.ncbi.nlm.nih.gov/pug_rest}") -> List[Dict[str, Union[None, int, str]]]:
|
36 |
-
|
37 |
-
r = _url_request(inchikeys, url=_PUBCHEM_URL,
|
38 |
-
session=session,
|
39 |
-
get="Title,InchiKey", format="XML")
|
40 |
-
|
41 |
-
if r.status_code == 200:
|
42 |
-
|
43 |
-
root = ElementTree.fromstring(r.text)
|
44 |
-
compounds = root.iter(f'{namespace}Properties')
|
45 |
-
|
46 |
-
result_dict = dict()
|
47 |
-
|
48 |
-
for cmpd in compounds:
|
49 |
-
|
50 |
-
cmpd_dict = dict()
|
51 |
-
|
52 |
-
for child in cmpd:
|
53 |
-
cmpd_dict[child.tag.split(namespace)[1]] = child.text
|
54 |
-
|
55 |
-
try:
|
56 |
-
inchikey, name, pcid = cmpd_dict['InChIKey'], cmpd_dict['Title'], cmpd_dict['CID']
|
57 |
-
except KeyError:
|
58 |
-
print(cmpd_dict)
|
59 |
-
else:
|
60 |
-
result_dict[inchikey] = {'pubchem_name': name.casefold(),
|
61 |
-
'pubchem_id': pcid}
|
62 |
-
|
63 |
-
print_err(f'PubChem: Looked up InchiKeys: {",".join(inchikeys)}')
|
64 |
-
|
65 |
-
result_list = [result_dict[inchikey]
|
66 |
-
if inchikey in result_dict
|
67 |
-
else {'pubchem_name': None, 'pubchem_id': None}
|
68 |
-
for inchikey in inchikeys]
|
69 |
-
|
70 |
-
return result_list
|
71 |
-
|
72 |
-
elif r.status_code in _OVERLOAD_CODES and counter < max_tries:
|
73 |
-
|
74 |
-
sleep(1.)
|
75 |
-
|
76 |
-
return _inchikey2pubchem_name_id(inchikeys,
|
77 |
-
session=session,
|
78 |
-
counter=counter + 1,
|
79 |
-
max_tries=max_tries,
|
80 |
-
namespace=namespace)
|
81 |
-
|
82 |
-
else:
|
83 |
-
|
84 |
-
print_err(f'PubChem: InchiKey {",".join(inchikeys)} gave status {r.status_code}')
|
85 |
-
|
86 |
-
return [{'pubchem_name': None, 'pubchem_id': None}
|
87 |
-
for _ in range(len(inchikeys))]
|
88 |
-
|
89 |
-
|
90 |
-
@vectorize
|
91 |
-
def _inchikey2cactus_name(inchikeys: str,
|
92 |
-
session: Optional[Session] = None,
|
93 |
-
counter: int = 0,
|
94 |
-
max_tries: int = 10):
|
95 |
-
|
96 |
-
r = _url_request(inchikeys, url=_CACTUS_URL,
|
97 |
-
session=session,
|
98 |
-
get="names")
|
99 |
-
|
100 |
-
if r.status_code == 200:
|
101 |
-
|
102 |
-
return r.text.split('\n')[0].casefold()
|
103 |
-
|
104 |
-
elif r.status_code in _OVERLOAD_CODES and counter < max_tries:
|
105 |
-
|
106 |
-
sleep(1.)
|
107 |
-
|
108 |
-
return _inchikey2cactus_name(inchikeys,
|
109 |
-
session=session,
|
110 |
-
counter=counter + 1,
|
111 |
-
max_tries=max_tries)
|
112 |
-
|
113 |
-
else:
|
114 |
-
|
115 |
-
print_err(f'Cactus: InchiKey {",".join(inchikeys)} gave status {r.status_code}')
|
116 |
-
|
117 |
-
return None
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/splitting.py
DELETED
@@ -1,205 +0,0 @@
|
|
1 |
-
"""Tools for splitting tabular datasets, optionally based on chemical features."""
|
2 |
-
|
3 |
-
from typing import Dict, Iterable, List, Optional, Tuple, Union
|
4 |
-
from collections import defaultdict
|
5 |
-
from math import ceil
|
6 |
-
from random import random, seed
|
7 |
-
|
8 |
-
try:
|
9 |
-
from itertools import batched
|
10 |
-
except ImportError:
|
11 |
-
from carabiner.itertools import batched
|
12 |
-
|
13 |
-
from tqdm.auto import tqdm
|
14 |
-
|
15 |
-
from .converting import convert_string_representation, _convert_input_to_smiles
|
16 |
-
from .typing import DataSplits
|
17 |
-
|
18 |
-
# def _train_test_splits
|
19 |
-
|
20 |
-
def _train_test_val_sizes(total: int,
|
21 |
-
train: float = 1.,
|
22 |
-
test: float = 0.) -> Tuple[int]:
|
23 |
-
|
24 |
-
n_train = int(ceil(train * total))
|
25 |
-
n_test = int(ceil(test * total))
|
26 |
-
n_val = total - n_train - n_test
|
27 |
-
|
28 |
-
return n_train, n_test, n_val
|
29 |
-
|
30 |
-
|
31 |
-
def _random_chunk(strings: str,
|
32 |
-
train: float = 1.,
|
33 |
-
test: float = 0.,
|
34 |
-
carry: Optional[Dict[str, List[int]]] = None,
|
35 |
-
start_from: int = 0) -> Dict[str, List[int]]:
|
36 |
-
|
37 |
-
carry = carry or defaultdict(list)
|
38 |
-
|
39 |
-
train_test: float = train + test
|
40 |
-
|
41 |
-
for i, _ in enumerate(strings):
|
42 |
-
|
43 |
-
random_number: float = random()
|
44 |
-
|
45 |
-
if random_number < train:
|
46 |
-
|
47 |
-
key = 'train'
|
48 |
-
|
49 |
-
elif random_number < train_test:
|
50 |
-
|
51 |
-
key = 'test'
|
52 |
-
|
53 |
-
else:
|
54 |
-
|
55 |
-
key = 'validation'
|
56 |
-
|
57 |
-
carry[key].append(start_from + i)
|
58 |
-
|
59 |
-
return carry
|
60 |
-
|
61 |
-
|
62 |
-
def split_random(strings: Union[str, Iterable[str]],
|
63 |
-
train: float = 1.,
|
64 |
-
test: float = 0.,
|
65 |
-
chunksize: Optional[int] = None,
|
66 |
-
set_seed: Optional[int] = None,
|
67 |
-
*args, **kwargs) -> DataSplits:
|
68 |
-
|
69 |
-
"""
|
70 |
-
|
71 |
-
"""
|
72 |
-
|
73 |
-
if set_seed is not None:
|
74 |
-
|
75 |
-
seed(set_seed)
|
76 |
-
|
77 |
-
|
78 |
-
if chunksize is None:
|
79 |
-
|
80 |
-
idx = _random_chunk(strings=strings,
|
81 |
-
train=train,
|
82 |
-
test=test)
|
83 |
-
|
84 |
-
else:
|
85 |
-
|
86 |
-
idx = defaultdict(list)
|
87 |
-
|
88 |
-
for i, chunk in enumerate(batched(strings, chunksize)):
|
89 |
-
|
90 |
-
idx = _random_chunk(strings=chunk,
|
91 |
-
train=train,
|
92 |
-
test=test,
|
93 |
-
carry=idx,
|
94 |
-
start_from=i * chunksize)
|
95 |
-
|
96 |
-
seed(None)
|
97 |
-
|
98 |
-
return DataSplits(**idx)
|
99 |
-
|
100 |
-
|
101 |
-
@_convert_input_to_smiles
|
102 |
-
def _scaffold_chunk(strings: str,
|
103 |
-
carry: Optional[Dict[str, List[int]]] = None,
|
104 |
-
start_from: int = 0) -> Dict[str, List[int]]:
|
105 |
-
|
106 |
-
carry = carry or defaultdict(list)
|
107 |
-
|
108 |
-
these_scaffolds = convert_string_representation(strings=strings,
|
109 |
-
output_representation='scaffold')
|
110 |
-
|
111 |
-
for j, scaff in enumerate(these_scaffolds):
|
112 |
-
carry[scaff].append(start_from + j)
|
113 |
-
|
114 |
-
return carry
|
115 |
-
|
116 |
-
|
117 |
-
def _scaffold_aggregator(scaffold_sets: Dict[str, List[int]],
|
118 |
-
train: float = 1.,
|
119 |
-
test: float = 0.,
|
120 |
-
progress: bool = False) -> DataSplits:
|
121 |
-
|
122 |
-
scaffold_sets = {key: sorted(value)
|
123 |
-
for key, value in scaffold_sets.items()}
|
124 |
-
scaffold_sets = sorted(scaffold_sets.items(),
|
125 |
-
key=lambda x: (len(x[1]), x[1][0]),
|
126 |
-
reverse=True)
|
127 |
-
nrows = sum(len(idx) for _, idx in scaffold_sets)
|
128 |
-
n_train, n_test, n_val = _train_test_val_sizes(nrows,
|
129 |
-
train,
|
130 |
-
test)
|
131 |
-
idx = defaultdict(list)
|
132 |
-
|
133 |
-
iterator = tqdm(scaffold_sets) if progress else scaffold_sets
|
134 |
-
for _, scaffold_idx in iterator:
|
135 |
-
|
136 |
-
if (len(idx['train']) + len(scaffold_idx)) > n_train:
|
137 |
-
|
138 |
-
if (len(idx['test']) + len(scaffold_idx)) > n_test:
|
139 |
-
|
140 |
-
key = 'validation'
|
141 |
-
|
142 |
-
else:
|
143 |
-
|
144 |
-
key = 'test'
|
145 |
-
else:
|
146 |
-
|
147 |
-
key = 'train'
|
148 |
-
|
149 |
-
idx[key] += scaffold_idx
|
150 |
-
|
151 |
-
return DataSplits(**idx)
|
152 |
-
|
153 |
-
|
154 |
-
def split_scaffold(strings: Union[str, Iterable[str]],
|
155 |
-
train: float = 1.,
|
156 |
-
test: float = 0.,
|
157 |
-
chunksize: Optional[int] = None,
|
158 |
-
progress: bool = True,
|
159 |
-
*args, **kwargs) -> DataSplits:
|
160 |
-
|
161 |
-
"""
|
162 |
-
|
163 |
-
"""
|
164 |
-
|
165 |
-
if chunksize is None:
|
166 |
-
|
167 |
-
scaffold_sets = _scaffold_chunk(strings)
|
168 |
-
|
169 |
-
else:
|
170 |
-
|
171 |
-
scaffold_sets = defaultdict(list)
|
172 |
-
|
173 |
-
for i, chunk in enumerate(batched(strings, chunksize)):
|
174 |
-
|
175 |
-
scaffold_sets = _scaffold_chunk(chunk,
|
176 |
-
carry=scaffold_sets,
|
177 |
-
start_from=i * chunksize)
|
178 |
-
|
179 |
-
return _scaffold_aggregator(scaffold_sets,
|
180 |
-
train=train, test=test,
|
181 |
-
progress=progress)
|
182 |
-
|
183 |
-
|
184 |
-
_SPLITTERS = {#'simpd': split_simpd,
|
185 |
-
'scaffold': split_scaffold,
|
186 |
-
'random': split_random}
|
187 |
-
|
188 |
-
# _SPLIT_SUPERTYPES = {'scaffold': 'grouped',
|
189 |
-
# 'random': 'independent'}
|
190 |
-
|
191 |
-
_GROUPED_SPLITTERS = {'scaffold': (_scaffold_chunk, _scaffold_aggregator)}
|
192 |
-
|
193 |
-
assert all(_type in _SPLITTERS
|
194 |
-
for _type in _GROUPED_SPLITTERS) ## Should never fail!
|
195 |
-
|
196 |
-
def split(split_type: str,
|
197 |
-
*args, **kwargs) -> DataSplits:
|
198 |
-
|
199 |
-
"""
|
200 |
-
|
201 |
-
"""
|
202 |
-
|
203 |
-
splitter = _SPLITTERS[split_type]
|
204 |
-
|
205 |
-
return splitter(*args, **kwargs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/tables.py
DELETED
@@ -1,228 +0,0 @@
|
|
1 |
-
"""Tools for processing tabular data."""
|
2 |
-
|
3 |
-
from typing import Any, Callable, Dict, Iterable, List, Mapping, Optional, Tuple, Union
|
4 |
-
from functools import partial
|
5 |
-
|
6 |
-
try:
|
7 |
-
from itertools import batched
|
8 |
-
except ImportError:
|
9 |
-
from carabiner.itertools import batched
|
10 |
-
|
11 |
-
from carabiner.cast import cast
|
12 |
-
from pandas import DataFrame, concat
|
13 |
-
|
14 |
-
from .cleaning import clean_smiles, clean_selfies
|
15 |
-
from .converting import convert_string_representation
|
16 |
-
from .features import calculate_feature
|
17 |
-
from .generating import sample_peptides_in_length_range, react
|
18 |
-
from .splitting import split
|
19 |
-
from .typing import DataSplits
|
20 |
-
|
21 |
-
def _get_column_values(df: DataFrame,
|
22 |
-
column: Union[str, List[str]]):
|
23 |
-
|
24 |
-
try:
|
25 |
-
column_values = df[column]
|
26 |
-
except KeyError:
|
27 |
-
raise KeyError(f"Column {column} does not appear to be in the data: {', '.join(df.columns)}")
|
28 |
-
else:
|
29 |
-
return column_values
|
30 |
-
|
31 |
-
|
32 |
-
def _get_error_tally(df: DataFrame,
|
33 |
-
cols: Union[str, List[str]]) -> Dict[str, int]:
|
34 |
-
|
35 |
-
cols = cast(cols, to=list)
|
36 |
-
|
37 |
-
try:
|
38 |
-
tally = {col: (df[col].isna() | ~df[col]).sum() for col in cols}
|
39 |
-
except TypeError:
|
40 |
-
tally = {col: df[col].isna().sum() for col in cols}
|
41 |
-
|
42 |
-
return tally
|
43 |
-
|
44 |
-
|
45 |
-
def converter(df: DataFrame,
|
46 |
-
column: str = 'smiles',
|
47 |
-
input_representation: str = 'smiles',
|
48 |
-
output_representation: Union[str, Iterable[str]] = 'smiles',
|
49 |
-
prefix: Optional[str] = None,
|
50 |
-
options: Optional[Mapping[str, Any]] = None) -> Tuple[Dict[str, int], DataFrame]:
|
51 |
-
|
52 |
-
"""
|
53 |
-
|
54 |
-
"""
|
55 |
-
|
56 |
-
prefix = prefix or ''
|
57 |
-
options = options or {}
|
58 |
-
|
59 |
-
column_values = _get_column_values(df, column)
|
60 |
-
|
61 |
-
output_representation = cast(output_representation, to=list)
|
62 |
-
converters = convert_string_representation(
|
63 |
-
column_values,
|
64 |
-
output_representation=output_representation,
|
65 |
-
input_representation=input_representation,
|
66 |
-
**options,
|
67 |
-
)
|
68 |
-
converted = {f"{prefix}{conversion_name}": cast(conversion, to=list)
|
69 |
-
for conversion_name, conversion in converters.items()}
|
70 |
-
df = df.assign(**converted)
|
71 |
-
|
72 |
-
return _get_error_tally(df, list(converted)), df
|
73 |
-
|
74 |
-
|
75 |
-
def cleaner(df: DataFrame,
|
76 |
-
column: str = 'smiles',
|
77 |
-
input_representation: str = 'smiles',
|
78 |
-
prefix: Optional[str] = None) -> Tuple[Dict[str, int], DataFrame]:
|
79 |
-
|
80 |
-
"""
|
81 |
-
|
82 |
-
"""
|
83 |
-
|
84 |
-
if input_representation.casefold() == 'smiles':
|
85 |
-
cleaner = clean_smiles
|
86 |
-
elif input_representation.casefold() == 'selfies':
|
87 |
-
cleaner = clean_selfies
|
88 |
-
else:
|
89 |
-
raise ValueError(f"Representation {input_representation} is not supported for cleaning.")
|
90 |
-
|
91 |
-
prefix = prefix or ''
|
92 |
-
new_column = f"{prefix}{column}"
|
93 |
-
|
94 |
-
df = df.assign(**{new_column: lambda x: cast(cleaner(_get_column_values(x, column)), to=list)})
|
95 |
-
|
96 |
-
return _get_error_tally(df, new_column), df
|
97 |
-
|
98 |
-
|
99 |
-
def featurizer(df: DataFrame,
|
100 |
-
feature_type: str,
|
101 |
-
column: str = 'smiles',
|
102 |
-
ids: Optional[Union[str, List[str]]] = None,
|
103 |
-
input_representation: str = 'smiles',
|
104 |
-
prefix: Optional[str] = None) -> Tuple[Dict[str, int], DataFrame]:
|
105 |
-
|
106 |
-
"""
|
107 |
-
|
108 |
-
"""
|
109 |
-
|
110 |
-
if ids is None:
|
111 |
-
ids = df.columns.tolist()
|
112 |
-
else:
|
113 |
-
ids = cast(ids, to=list)
|
114 |
-
|
115 |
-
feature_df = calculate_feature(feature_type=feature_type,
|
116 |
-
strings=_get_column_values(df, column),
|
117 |
-
prefix=prefix,
|
118 |
-
input_representation=input_representation,
|
119 |
-
return_dataframe=True)
|
120 |
-
|
121 |
-
if len(ids) > 0:
|
122 |
-
df = concat([df[ids], feature_df], axis=1)
|
123 |
-
|
124 |
-
return _get_error_tally(feature_df, 'meta_feature_valid'), df
|
125 |
-
|
126 |
-
|
127 |
-
def assign_groups(df: DataFrame,
|
128 |
-
grouper: Callable[[Union[str, Iterable[str]]], Dict[str, Tuple[int]]],
|
129 |
-
group_name: str = 'group',
|
130 |
-
column: str = 'smiles',
|
131 |
-
input_representation: str = 'smiles',
|
132 |
-
*args, **kwargs) -> Tuple[Dict[str, Tuple[int]], DataFrame]:
|
133 |
-
|
134 |
-
group_idx = grouper(strings=_get_column_values(df, column),
|
135 |
-
input_representation=input_representation,
|
136 |
-
*args, **kwargs)
|
137 |
-
|
138 |
-
inv_group_idx = {i: group for group, idx in group_idx.items() for i in idx}
|
139 |
-
groups = [inv_group_idx[i] for i in range(len(inv_group_idx))]
|
140 |
-
|
141 |
-
return group_idx, df.assign(**{group_name: groups})
|
142 |
-
|
143 |
-
|
144 |
-
def _assign_splits(df: DataFrame,
|
145 |
-
split_idx: DataSplits,
|
146 |
-
use_df_index: bool = False) -> DataFrame:
|
147 |
-
|
148 |
-
row_index = df.index if use_df_index else tuple(range(df.shape[0]))
|
149 |
-
|
150 |
-
df = df.assign(**{f'is_{key}': [i in getattr(split_idx, key) for i in row_index]
|
151 |
-
for key in split_idx._fields})
|
152 |
-
split_counts = {key: sum(df[f'is_{key}'].values) for key in split_idx._fields}
|
153 |
-
|
154 |
-
return split_counts, df
|
155 |
-
|
156 |
-
|
157 |
-
def splitter(df: DataFrame,
|
158 |
-
split_type: str = 'random',
|
159 |
-
column: str = 'smiles',
|
160 |
-
input_representation: str = 'smiles',
|
161 |
-
*args, **kwargs) -> Tuple[Dict[str, int], DataFrame]:
|
162 |
-
|
163 |
-
"""
|
164 |
-
|
165 |
-
"""
|
166 |
-
|
167 |
-
split_idx = split(split_type=split_type,
|
168 |
-
strings=_get_column_values(df, column),
|
169 |
-
input_representation=input_representation,
|
170 |
-
*args, **kwargs)
|
171 |
-
|
172 |
-
return _assign_splits(df, split_idx=split_idx)
|
173 |
-
|
174 |
-
|
175 |
-
def reactor(df: DataFrame,
|
176 |
-
column: str = 'smiles',
|
177 |
-
reaction: Union[str, Iterable[str]] = 'N_to_C_cyclization',
|
178 |
-
prefix: Optional[str] = None,
|
179 |
-
*args, **kwargs) -> Tuple[Dict[str, int], DataFrame]:
|
180 |
-
|
181 |
-
"""
|
182 |
-
|
183 |
-
"""
|
184 |
-
|
185 |
-
prefix = prefix or ''
|
186 |
-
|
187 |
-
reactors = {col: partial(react, reaction=col)
|
188 |
-
for col in cast(reaction, to=list)}
|
189 |
-
|
190 |
-
column_values = _get_column_values(df, column)
|
191 |
-
|
192 |
-
new_columns = {f"{prefix}{col}": list(_reactor(strings=column_values, *args, **kwargs))
|
193 |
-
for col, _reactor in reactors.items()}
|
194 |
-
|
195 |
-
df = df.assign(**new_columns)
|
196 |
-
|
197 |
-
return _get_error_tally(df, reaction), df
|
198 |
-
|
199 |
-
|
200 |
-
def _peptide_table(max_length: int,
|
201 |
-
min_length: Optional[int] = None,
|
202 |
-
by: int = 1,
|
203 |
-
n: Optional[Union[float, int]] = None,
|
204 |
-
prefix: str = '',
|
205 |
-
suffix: str = '',
|
206 |
-
generator: bool = False,
|
207 |
-
batch_size: int = 1000,
|
208 |
-
*args, **kwargs) -> Union[DataFrame, Iterable]:
|
209 |
-
|
210 |
-
min_length = min_length or max_length
|
211 |
-
|
212 |
-
peptides = sample_peptides_in_length_range(max_length=max_length,
|
213 |
-
min_length=min_length,
|
214 |
-
by=by,
|
215 |
-
n=n,
|
216 |
-
*args, **kwargs)
|
217 |
-
|
218 |
-
if generator:
|
219 |
-
|
220 |
-
return (DataFrame(dict(peptide_sequence=[f"{prefix}{pep}{suffix}" for pep in peps]))
|
221 |
-
for peps in batched(peptides, batch_size))
|
222 |
-
|
223 |
-
else:
|
224 |
-
|
225 |
-
peps = [f"{prefix}{pep}{suffix}"
|
226 |
-
for pep in peptides]
|
227 |
-
|
228 |
-
return DataFrame(dict(peptide_sequence=peps))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/typing.py
DELETED
@@ -1,7 +0,0 @@
|
|
1 |
-
"""Types used in schemist."""
|
2 |
-
|
3 |
-
from collections import namedtuple
|
4 |
-
|
5 |
-
DataSplits = namedtuple('DataSplits',
|
6 |
-
['train', 'test', 'validation'],
|
7 |
-
defaults=[tuple(), tuple(), tuple()])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
schemist/utils.py
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
"""Miscellaneous utilities for schemist."""
|
|
|
|
test/data/AmpC_screen_table_10k.csv.gz
DELETED
Binary file (171 kB)
|
|
test/tests.py
DELETED
@@ -1,6 +0,0 @@
|
|
1 |
-
import doctest
|
2 |
-
import schemist as sch
|
3 |
-
|
4 |
-
if __name__ == '__main__':
|
5 |
-
|
6 |
-
doctest.testmod(sch)
|
|
|
|
|
|
|
|
|
|
|
|
|
|