rr-ss commited on
Commit
3290550
·
verified ·
1 Parent(s): f929fd2

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -1,35 +1,5 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.pt filter=lfs diff=lfs merge=lfs -text
2
+ *.bcool filter=lfs diff=lfs merge=lfs -text
3
+ *.mcool filter=lfs diff=lfs merge=lfs -text
4
+ doc/Polaris.png filter=lfs diff=lfs merge=lfs -text
5
+ doc/logo.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ build/*
2
+ dist/*
3
+ docs/*
4
+ polaris.egg-info/*
5
+ polaris/**/__pycache__/
6
+ requirements.txt
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2022 ai4nucleome
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="./doc/logo.png" alt="Polaris" title="Polaris" width="400">
2
+
3
+ # A Versatile Framework for Chromatin Loop Annotation in Bulk and Single-cell Hi-C Data
4
+
5
+ <a href="https://github.com/ai4nucleome/Polaris/releases/latest">
6
+ <img src="https://img.shields.io/badge/Polaris-v1.0.0-green">
7
+ <img src="https://img.shields.io/badge/platform-Linux%20%7C%20Mac%20-green">
8
+ <img src="https://img.shields.io/badge/Language-python3-green">
9
+ <!-- <img src="https://img.shields.io/badge/dependencies-tested-green"> -->
10
+ </a>
11
+
12
+
13
+ 🌟 **Polaris** is a versatile and efficient command line tool tailored for rapid and accurate chromatin loop detectionfrom from contact maps generated by various assays, including bulk Hi-C, scHi-C, Micro-C, and DNA SPRITE. Polaris is particularly well-suited for analyzing **sparse scHi-C data and low-coverage datasets**.
14
+
15
+ <div style="text-align: center;">
16
+ <img src="./doc/Polaris.png" alt="Polaris Model" title="Polaris Model" width="600">
17
+ </div>
18
+
19
+
20
+ - Using examples for single cell Hi-C and bulk cell Hi-C loop annotations are under [**example folder**](https://github.com/ai4nucleome/Polaris/tree/master/example).
21
+ - The scripts and data to **reproduce our analysis** can be found at: [**Polaris Reproducibility**](https://zenodo.org/records/14294273).
22
+
23
+ > ❗️<b>NOTE❗️:</b> We suggest users run Polaris on <b>GPU</b>.
24
+ > You can run Polaris on CPU for loop annotations, but it is much slower than on GPU.
25
+
26
+ > ❗️**NOTE❗️:** If you encounter a `CUDA OUT OF MEMORY` error, please:
27
+ > - Check your GPU's status and available memory.
28
+ > - Reduce the --batchsize parameter. (The default value of 128 requires approximately 36GB of CUDA memory. Setting it to 24 will reduce the requirement to less than 10GB.)
29
+
30
+ ## Documentation
31
+ 📝 **Extensive documentation** can be found at: [Polaris Doc](https://nucleome-polaris.readthedocs.io/en/latest/).
32
+
33
+ ## Installation
34
+ Polaris is developed and tested on Linux machines with python3.9 and relies on several libraries including pytorch, scipy, etc.
35
+ We **strongly recommend** that you install Polaris in a virtual environment.
36
+
37
+ We suggest users using [conda](https://anaconda.org/) to create a virtual environment for it (It should also work without using conda, i.e. with pip). You can run the command snippets below to install Polaris:
38
+
39
+ ```bash
40
+ git clone https://github.com/ai4nucleome/Polaris.git
41
+ cd Polaris
42
+ conda create -n polaris python=3.9
43
+ conda activate polaris
44
+ ```
45
+ -------
46
+ ### ❗️Important Note❗️: Downloading Polaris Network Weights
47
+
48
+ The Polaris repository utilizes Git Large File Storage (Git-LFS) to host its pre-trained model weight files. Standard `git clone` operations **will not** automatically download these large files unless Git-LFS is installed and configured.
49
+
50
+ To resolve this, please follow one of the methods below:
51
+
52
+ #### Method 1: Manual Download via Browser
53
+
54
+ 1. Directly download the pre-trained model weights (`sft_loop.pt`) from the [Polaris model directory](https://github.com/ai4nucleome/Polaris/blob/master/polaris/model/sft_loop.pt).
55
+ 2. Save the file to the directory:
56
+ ```bash
57
+ Polaris/polaris/model/
58
+ ```
59
+ #### Method 2: Install Git-LFS
60
+ 1. Install Git-LFS by following the official instructions: [Git-LFS Installation Guide](https://git-lfs.com/).
61
+
62
+ 2. After installation, either:
63
+
64
+ Re-clone the repository:
65
+
66
+ ```bash
67
+ git clone https://github.com/ai4nucleome/Polaris.git
68
+ ```
69
+ OR, if the repository is already cloned, run:
70
+
71
+ ```bash
72
+ git lfs pull
73
+ ```
74
+ This ensures all large files, including model weights, are retrieved.
75
+ ----------
76
+
77
+ Install [PyTorch](https://pytorch.org/get-started/locally/) as described on their website. It might be the following command depending on your cuda version:
78
+
79
+ ```bash
80
+ pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
81
+ ```
82
+ Install Polaris:
83
+ ```bash
84
+ pip install --use-pep517 --editable .
85
+ ```
86
+ If fail, please try `python setup build` and `python setup install` first.
87
+
88
+ The installation requires network access to download libraries. Usually, the installation will finish within 5 minutes. The installation time is longer if network access is slow and/or unstable.
89
+
90
+ ## Quick Start for Loop Annotation
91
+ ```bash
92
+ polaris loop pred -i [input mcool file] -o [output path of annotated loops]
93
+ ```
94
+ It outputs predicted loops from the input contact map at 5kb resolution.
95
+ ### output format
96
+ It contains tab separated fields as follows:
97
+ ```
98
+ Chr1 Start1 End1 Chr2 Start2 End2 Score
99
+ ```
100
+ | Field | Detail |
101
+ |:-------------:|:-----------------------------------------------------------------------:|
102
+ | Chr1/Chr2 | chromosome names |
103
+ | Start1/Start2 | start genomic coordinates |
104
+ | End1/End2 | end genomic coordinates (i.e. End1=Start1+resol) |
105
+ | Score | Polaris's loop score [0~1] |
106
+
107
+
108
+ ## Citation:
109
+ Yusen Hou, Audrey Baguette, Mathieu Blanchette*, & Yanlin Zhang*. __A versatile tool for chromatin loop annotation in bulk and single-cell Hi-C data__. _bioRxiv_, 2024. [Paper](https://doi.org/10.1101/2024.12.24.630215)
110
+ <br>
111
+ ```
112
+ @article {Hou2024Polaris,
113
+ title = {A versatile tool for chromatin loop annotation in bulk and single-cell Hi-C data},
114
+ author = {Yusen Hou, Audrey Baguette, Mathieu Blanchette, and Yanlin Zhang},
115
+ journal = {bioRxiv}
116
+ year = {2024},
117
+ }
118
+ ```
119
+
120
+ ## 📩 Contact
121
+ A GitHub issue is preferable for all problems related to using Polaris.
122
+
123
+ For other concerns, please email Yusen Hou or Yanlin Zhang ([email protected], [email protected]).
doc/Polaris.png ADDED

Git LFS Details

  • SHA256: e61b54271dfe016eaa2a86dd0f0f91082712bd6f0dc4ae4aa2ec75d9fe303e9f
  • Pointer size: 132 Bytes
  • Size of remote file: 3.4 MB
doc/logo.png ADDED

Git LFS Details

  • SHA256: 21dfbcd2a0795688fae5209771c3cf749633c7fae25738d59964146528e53b37
  • Pointer size: 131 Bytes
  • Size of remote file: 742 kB
doc/world_logo.jpg ADDED
example/APA/APA_example.ipynb ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# APA Analysis"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "After detecting chromatin loops using Polaris, we can use Aggregated Peak Analysis (APA) to visualize the results and assess their quality. This approach allows us to aggregate the detected loops across different genomic regions and observe their behavior in a visually intuitive way."
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "markdown",
19
+ "metadata": {},
20
+ "source": [
21
+ "You can run the following command to get a quick check of loops detected by Polaris."
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": 1,
27
+ "metadata": {},
28
+ "outputs": [],
29
+ "source": [
30
+ "%%bash \n",
31
+ "\n",
32
+ "polaris util pileup --savefig \"./GM12878_250M_chr151617_loops.pileup.png\" --p2ll True \"../loop_annotation/GM12878_250M_chr151617_loops.bedpe\" \"../loop_annotation/GM12878_250M.bcool\""
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {},
38
+ "source": [
39
+ "The result will be saved at `\"./GM12878_250M_chr151617_loops.pileup.png\"`."
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": []
46
+ }
47
+ ],
48
+ "metadata": {
49
+ "kernelspec": {
50
+ "display_name": "polaris",
51
+ "language": "python",
52
+ "name": "python3"
53
+ },
54
+ "language_info": {
55
+ "codemirror_mode": {
56
+ "name": "ipython",
57
+ "version": 3
58
+ },
59
+ "file_extension": ".py",
60
+ "mimetype": "text/x-python",
61
+ "name": "python",
62
+ "nbconvert_exporter": "python",
63
+ "pygments_lexer": "ipython3",
64
+ "version": "3.9.20"
65
+ }
66
+ },
67
+ "nbformat": 4,
68
+ "nbformat_minor": 2
69
+ }
example/APA/GM12878_250M_chr151617_loops.pileup.png ADDED
example/CLI_walkthrough.ipynb ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Polaris command line interface"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "If you type `Polaris` at the command line with no arguments or with `--help` you'll get the following quick reference of available subcommands."
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "code",
19
+ "execution_count": 1,
20
+ "metadata": {},
21
+ "outputs": [
22
+ {
23
+ "name": "stdout",
24
+ "output_type": "stream",
25
+ "text": [
26
+ "Usage: polaris [OPTIONS] COMMAND [ARGS]...\n",
27
+ "\n",
28
+ " Polaris\n",
29
+ "\n",
30
+ " A Versatile Tool for Chromatin Loop Annotation in Bulk and Single-cell Hi-C\n",
31
+ " Data\n",
32
+ "\n",
33
+ "Options:\n",
34
+ " --help Show this message and exit.\n",
35
+ "\n",
36
+ "Commands:\n",
37
+ " loop Loop annotation.\n",
38
+ " util Utilities.\n"
39
+ ]
40
+ }
41
+ ],
42
+ "source": [
43
+ "%%bash\n",
44
+ "\n",
45
+ "polaris --help"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "markdown",
50
+ "metadata": {},
51
+ "source": [
52
+ "## polaris subcommand\n",
53
+ "\n",
54
+ "For more information about a specific subcommand, type `polaris <subcommand> --help` to display the help text."
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "markdown",
59
+ "metadata": {},
60
+ "source": [
61
+ "### Polaris loop"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": 2,
67
+ "metadata": {},
68
+ "outputs": [
69
+ {
70
+ "name": "stdout",
71
+ "output_type": "stream",
72
+ "text": [
73
+ "Usage: polaris loop [OPTIONS] COMMAND [ARGS]...\n",
74
+ "\n",
75
+ " Loop annotation.\n",
76
+ "\n",
77
+ " Annotate loops from chromosomal contact maps.\n",
78
+ "\n",
79
+ "Options:\n",
80
+ " --help Show this message and exit.\n",
81
+ "\n",
82
+ "Commands:\n",
83
+ " dev *development function* Coming soon...\n",
84
+ " pool Call loops from loop candidates by clustering\n",
85
+ " pred Predict loops from input contact map directly\n",
86
+ " score Predict loop score for each pixel in the input contact map\n"
87
+ ]
88
+ }
89
+ ],
90
+ "source": [
91
+ "%%bash\n",
92
+ "\n",
93
+ "polaris loop --help"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "metadata": {},
99
+ "source": [
100
+ "### Polaris util"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": 6,
106
+ "metadata": {},
107
+ "outputs": [
108
+ {
109
+ "name": "stdout",
110
+ "output_type": "stream",
111
+ "text": [
112
+ "Usage: polaris util [OPTIONS] COMMAND [ARGS]...\n",
113
+ "\n",
114
+ " Utilities.\n",
115
+ "\n",
116
+ " Utilities for analysis and visualization.\n",
117
+ "\n",
118
+ "Options:\n",
119
+ " --help Show this message and exit.\n",
120
+ "\n",
121
+ "Commands:\n",
122
+ " cool2bcool covert a .mcool file to a .bcool file\n",
123
+ " pileup 2D pileup contact maps around given foci\n"
124
+ ]
125
+ }
126
+ ],
127
+ "source": [
128
+ "%%bash\n",
129
+ "\n",
130
+ "polaris util --help"
131
+ ]
132
+ },
133
+ {
134
+ "cell_type": "markdown",
135
+ "metadata": {},
136
+ "source": [
137
+ "# Detailed Instructions\n",
138
+ "\n",
139
+ "For detailed instructions of each subcommand, please refer to [Polaris Doc](https://nucleome-polaris.readthedocs.io/en/latest/) and tutorials :\n",
140
+ "- [Loop Annotation tutorial](https://github.com/ai4nucleome/Polaris/blob/master/example/loop_annotation/loop_annotation.ipynb)\n",
141
+ "- [Aggregated Peak Analysis tutorial](https://github.com/ai4nucleome/Polaris/blob/master/example/APA/APA.ipynb) "
142
+ ]
143
+ },
144
+ {
145
+ "cell_type": "markdown",
146
+ "metadata": {},
147
+ "source": []
148
+ }
149
+ ],
150
+ "metadata": {
151
+ "kernelspec": {
152
+ "display_name": "polaris",
153
+ "language": "python",
154
+ "name": "python3"
155
+ },
156
+ "language_info": {
157
+ "codemirror_mode": {
158
+ "name": "ipython",
159
+ "version": 3
160
+ },
161
+ "file_extension": ".py",
162
+ "mimetype": "text/x-python",
163
+ "name": "python",
164
+ "nbconvert_exporter": "python",
165
+ "pygments_lexer": "ipython3",
166
+ "version": "3.9.20"
167
+ }
168
+ },
169
+ "nbformat": 4,
170
+ "nbformat_minor": 2
171
+ }
example/README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Example Use of Loop Annotaion and APA
2
+
3
+ This folder contains two subfolders that showcase example results of **Polaris** on loop prediction and aggregated peak analysis.
4
+
5
+ You can re-run **Polaris** to reproduce these results by following the commands provided in the sections below.
6
+
7
+ > **Note:** If you encounter a `CUDA OUT OF MEMORY` error, please:
8
+ > - Check your GPU's status and available memory.
9
+ > - Reduce the --batchsize parameter. (The default value of 128 requires approximately 36GB of CUDA memory. Setting it to 24 will reduce the requirement to less than 10GB.)
10
+
11
+ ## Loop Prediction on GM12878 (250M Valid Read Pairs)
12
+
13
+ ```bash
14
+ polaris loop pred --chrom chr15,chr16,chr17 -i ./loop_annotation/GM12878_250M.bcool -o ./loop_annotation/GM12878_250M_chr151617_loops.bedpe
15
+ ```
16
+
17
+ The [loop_annotation](https://github.com/compbiodsa/Polaris/tree/master/example/loop_annotation) sub-folder contains the results on bulk Hi-C data of GM12878 (250M valid read pairs).
18
+
19
+
20
+
21
+ ## APA of Loops Detected by Polaris
22
+
23
+ ``` bash
24
+ polaris util pileup --savefig ./APA/GM12878_250M_chr151617_loops.pileup.png --p2ll True ./loop_annotation/GM12878_250M_chr151617_loops.bedpe ./loop_annotation/GM12878_250M.bcool
25
+ ```
26
+
27
+ The [APA](https://github.com/compbiodsa/Polaris/tree/master/example/APA) sub-folder contains the Aggregate Peak Analysis result of loops detected on GM12878 (250M Valid Read Pairs) by Polaris.
28
+
29
+ <div style="text-align: center;">
30
+ <figure>
31
+ <img src="./APA/GM12878_250M_chr151617_loops.pileup.png"
32
+ alt="GM12878_250M_chr151617_loops"
33
+ title="GM12878_250M_chr151617_loops"
34
+ width="150">
35
+ <figcaption>APA of loops on GM12878 (250M Valid Read Pairs)</figcaption>
36
+ </figure>
37
+ </div>
38
+
39
+
40
+ ---
41
+ - **Extensive documentation** can be found at: [Polaris Documentaion](https://nucleome-polaris.readthedocs.io/en/latest/).
42
+ - You can find more detailed tutorials in the **Jupyter Notebooks located within the respective subfolders**.
example/loop_annotation/GM12878_250M.bcool ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:962c8dbb130eb024d9d931cf50ace2f1adff8a84bdbf023f6d3770d27842212d
3
+ size 70088396
example/loop_annotation/GM12878_250M_chr151617_loop_score.bedpe ADDED
The diff for this file is too large to render. See raw diff
 
example/loop_annotation/GM12878_250M_chr151617_loops.bedpe ADDED
The diff for this file is too large to render. See raw diff
 
example/loop_annotation/GM12878_250M_chr151617_loops_method2.bedpe ADDED
The diff for this file is too large to render. See raw diff
 
example/loop_annotation/loop_annotation.ipynb ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Loop Annotation"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "## Input files"
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "markdown",
19
+ "metadata": {},
20
+ "source": [
21
+ "Polaris requires a `.mcool` file as input. You can obtain `.mcool` files in the following ways:"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "markdown",
26
+ "metadata": {},
27
+ "source": [
28
+ "### 1. Download from the 4DN Database\n",
29
+ "\n",
30
+ "- Visit the [4DN Data Portal](https://data.4dnucleome.org/).\n",
31
+ "- Search for and download `.mcool` files suitable for your study."
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "markdown",
36
+ "metadata": {},
37
+ "source": [
38
+ "### 2. Convert Files Using cooler\n",
39
+ "\n",
40
+ "If you have data in formats such as `.pairs` or `.cool`, you can convert them to `.mcool` format using the Python library [cooler](https://cooler.readthedocs.io/en/latest/index.html). Follow these steps:\n",
41
+ "\n",
42
+ "- **Install cooler**\n",
43
+ "\n",
44
+ " Ensure you have installed cooler using the following command:\n",
45
+ " ```bash\n",
46
+ " pip install cooler\n",
47
+ " ```\n",
48
+ "- **Convert .pairs to .cool**\n",
49
+ "\n",
50
+ " If you are starting with a .pairs file (e.g., normalized contact data with columns for chrom1, pos1, chrom2, pos2), use this command to create a .cool file:\n",
51
+ " ```bash\n",
52
+ " cooler cload pairs --assembly <genome_version> -c1 chrom1 -p1 pos1 -c2 chrom2 -p2 pos2 <pairs_file> <resolution>.cool\n",
53
+ " ```\n",
54
+ " Replace `<genome_version> with the appropriate genome assembly (e.g., hg38) and <resolution> with the desired bin size in base pairs.\n",
55
+ "- **Generate a Multiresolution .mcool File**\n",
56
+ "\n",
57
+ " To convert a single-resolution .cool file into a multiresolution .mcool file, use the following command:\n",
58
+ "\n",
59
+ " ```bash\n",
60
+ " cooler zoomify <input.cool>\n",
61
+ " ```"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "markdown",
66
+ "metadata": {},
67
+ "source": [
68
+ "The resulting `.mcool` file can be directly used as input for Polaris."
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "metadata": {},
74
+ "source": [
75
+ "## Loop Annotation by Polaris"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "metadata": {},
81
+ "source": [
82
+ "Polaris provides two methods to generate loop annotations for input `.mcool` file. Both methods ultimately yield consistent loop results."
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "markdown",
87
+ "metadata": {},
88
+ "source": [
89
+ "### Method 1: polaris loop pred\n",
90
+ "\n",
91
+ "This is the simplest approach, allowing you to directly predict loops in a single step.\n",
92
+ "The command below will take approximately 30 seconds, depending on your device, to identify loops in GM12878 data (250M valid read pairs)."
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": 5,
98
+ "metadata": {},
99
+ "outputs": [
100
+ {
101
+ "name": "stdout",
102
+ "output_type": "stream",
103
+ "text": [
104
+ "use gping cuda:0\n",
105
+ "\n",
106
+ "Analysing chroms: ['chr15', 'chr16', 'chr17']\n"
107
+ ]
108
+ },
109
+ {
110
+ "name": "stderr",
111
+ "output_type": "stream",
112
+ "text": [
113
+ "[analyzing chr17]: 100%|██████████| 3/3 [00:24<00:00, 8.31s/it]\n",
114
+ "[Runing clustering on chr15]: 100%|██████████| 3/3 [00:01<00:00, 1.87it/s]\n"
115
+ ]
116
+ },
117
+ {
118
+ "name": "stdout",
119
+ "output_type": "stream",
120
+ "text": [
121
+ "1830 loops saved to GM12878_250M_chr151617_loops.bedpe\n"
122
+ ]
123
+ }
124
+ ],
125
+ "source": [
126
+ "%%bash\n",
127
+ "\n",
128
+ "polaris loop pred --chrom chr15,chr16,chr17 -i GM12878_250M.bcool -o GM12878_250M_chr151617_loops.bedpe "
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "markdown",
133
+ "metadata": {},
134
+ "source": [
135
+ "> **Note:** If you encounter a `CUDA OUT OF MEMORY` error, please:\n",
136
+ "> - Check your GPU's status and available memory.\n",
137
+ "> - Reduce the --batchsize parameter. (The default value of 128 requires approximately 36GB of CUDA memory. Setting it to 24 will reduce the requirement to less than 10GB.)"
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "markdown",
142
+ "metadata": {},
143
+ "source": [
144
+ "### Method 2: polaris loop score and polaris loop pool"
145
+ ]
146
+ },
147
+ {
148
+ "cell_type": "markdown",
149
+ "metadata": {},
150
+ "source": [
151
+ "This method involves two steps: generating loop scores for each pixel in the contact map and clustering these scores to call loops.\n"
152
+ ]
153
+ },
154
+ {
155
+ "cell_type": "markdown",
156
+ "metadata": {},
157
+ "source": [
158
+ "**Step 1: Generate Loop Scores**\n",
159
+ "\n",
160
+ "Run the following command to calculate the loop score for each pixel in the input contact map and save the result in `GM12878_250M_chr151617_loop_score.bedpe`."
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "code",
165
+ "execution_count": 6,
166
+ "metadata": {},
167
+ "outputs": [
168
+ {
169
+ "name": "stdout",
170
+ "output_type": "stream",
171
+ "text": [
172
+ "use gping cuda:0\n",
173
+ "\n",
174
+ "Analysing chroms: ['chr15', 'chr16', 'chr17']\n"
175
+ ]
176
+ },
177
+ {
178
+ "name": "stderr",
179
+ "output_type": "stream",
180
+ "text": [
181
+ "[analyzing chr17]: 100%|██████████| 3/3 [00:34<00:00, 11.37s/it]\n"
182
+ ]
183
+ }
184
+ ],
185
+ "source": [
186
+ "%%bash\n",
187
+ "\n",
188
+ "polaris loop score --chrom chr15,chr16,chr17 -i GM12878_250M.bcool -o GM12878_250M_chr151617_loop_score.bedpe "
189
+ ]
190
+ },
191
+ {
192
+ "cell_type": "markdown",
193
+ "metadata": {},
194
+ "source": [
195
+ "**Step 2: Call Loops from Loop Candidates**\n",
196
+ "\n",
197
+ "Use the following command to identify loops by clustering from the generated loop score file."
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": 7,
203
+ "metadata": {},
204
+ "outputs": [
205
+ {
206
+ "name": "stderr",
207
+ "output_type": "stream",
208
+ "text": [
209
+ "[Runing clustering on chr16]: 100%|██████████| 3/3 [00:01<00:00, 1.72it/s]\n"
210
+ ]
211
+ },
212
+ {
213
+ "name": "stdout",
214
+ "output_type": "stream",
215
+ "text": [
216
+ "1830 loops saved to GM12878_250M_chr151617_loops_method2.bedpe\n"
217
+ ]
218
+ }
219
+ ],
220
+ "source": [
221
+ "%%bash\n",
222
+ "\n",
223
+ "polaris loop pool -i GM12878_250M_chr151617_loop_score.bedpe -o GM12878_250M_chr151617_loops_method2.bedpe "
224
+ ]
225
+ },
226
+ {
227
+ "cell_type": "markdown",
228
+ "metadata": {},
229
+ "source": [
230
+ "We can see both methods ultimately yield consistent loop number.\n",
231
+ "\n",
232
+ "The we can perform [Aggregate Peak Analysis](https://github.com/ai4nucleome/Polaris/blob/master/example/APA/APA.ipynb) to visualize these results."
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "markdown",
237
+ "metadata": {},
238
+ "source": []
239
+ }
240
+ ],
241
+ "metadata": {
242
+ "kernelspec": {
243
+ "display_name": "polaris",
244
+ "language": "python",
245
+ "name": "python3"
246
+ },
247
+ "language_info": {
248
+ "codemirror_mode": {
249
+ "name": "ipython",
250
+ "version": 3
251
+ },
252
+ "file_extension": ".py",
253
+ "mimetype": "text/x-python",
254
+ "name": "python",
255
+ "nbconvert_exporter": "python",
256
+ "pygments_lexer": "ipython3",
257
+ "version": "3.9.20"
258
+ }
259
+ },
260
+ "nbformat": 4,
261
+ "nbformat_minor": 2
262
+ }
polaris/loop.py ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import torch
3
+ import cooler
4
+ import click
5
+ import numpy as np
6
+ import pandas as pd
7
+ from importlib_resources import files
8
+
9
+ from torch import nn
10
+ from tqdm import tqdm
11
+ from torch.cuda.amp import autocast
12
+ from torch.utils.data import DataLoader
13
+
14
+ from sklearn.neighbors import KDTree
15
+ from polaris.model.polarisnet import polarisnet
16
+ from polaris.utils.util_data import centerPredCoolDataset
17
+
18
+ def rhoDelta(data,resol,dc,radius):
19
+
20
+ pos = data[[1, 4]].to_numpy() // resol
21
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
22
+ NNindexes, NNdists = posTree.query_radius(pos, r=radius, return_distance=True)
23
+ _l = []
24
+ for v in NNindexes:
25
+ _l.append(len(v))
26
+ _l=np.asarray(_l)
27
+ data = data[_l>5].reset_index(drop=True)
28
+
29
+ if data.shape[0] != 0:
30
+ pos = data[[1, 4]].to_numpy() // resol
31
+ val = data[6].to_numpy()
32
+
33
+ try:
34
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
35
+ NNindexes, NNdists = posTree.query_radius(pos, r=dc, return_distance=True)
36
+ except ValueError as e:
37
+ if "Found array with 0 sample(s)" in str(e):
38
+ print("#"*88,'\n#')
39
+ print("#\033[91m Error!!! The data is too sparse. Please increase the value of: [t]\033[0m\n#")
40
+ print("#"*88,'\n')
41
+ sys.exit(1)
42
+ else:
43
+ raise
44
+
45
+ rhos = []
46
+ for i in range(len(NNindexes)):
47
+ rhos.append(np.dot(np.exp(-(NNdists[i] / dc) ** 2), val[NNindexes[i]]))
48
+ rhos = np.asarray(rhos)
49
+
50
+ _r = 100
51
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
52
+ deltas = rhos * 0
53
+ LargerNei = rhos * 0 - 1
54
+ for i in range(len(_indexes)):
55
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
56
+ if idx.shape[0] == 0:
57
+ deltas[i] = _dists[i][-1] + 1
58
+ else:
59
+ LargerNei[i] = _indexes[i][idx[0]]
60
+ deltas[i] = _dists[i][idx[0]]
61
+ failed = np.argwhere(LargerNei == -1).flatten()
62
+ while len(failed) > 1 and _r < 100000:
63
+ _r = _r * 10
64
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
65
+ for i in range(len(_indexes)):
66
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
67
+ if idx.shape[0] == 0:
68
+ deltas[failed[i]] = _dists[i][-1] + 1
69
+ else:
70
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
71
+ deltas[failed[i]] = _dists[i][idx[0]]
72
+ failed = np.argwhere(LargerNei == -1).flatten()
73
+
74
+ data['rhos']=rhos
75
+ data['deltas']=deltas
76
+ else:
77
+ data['rhos']=[]
78
+ data['deltas']=[]
79
+
80
+ return data
81
+
82
+ def pool(data,dc,resol,mindelta,t,output,radius,refine=True):
83
+ ccs = set(data.iloc[:,0])
84
+
85
+ if data.shape[0] == 0:
86
+ print("#"*88,'\n#')
87
+ print("#\033[91m Error!!! The file is empty. Please check your file.\033[0m\n#")
88
+ print("#"*88,'\n')
89
+ sys.exit(1)
90
+ data = data[data[6] > t].reset_index(drop=True)
91
+ data = data[data[4] - data[1] > 11*resol].reset_index(drop=True)
92
+ if data.shape[0] == 0:
93
+ print("#"*88,'\n#')
94
+ print("#\033[91m Error!!! The data is too sparse. Please decrease: [threshold] (minimum: 0.5).\033[0m\n#")
95
+ print("#"*88,'\n')
96
+ sys.exit(1)
97
+ data[['rhos','deltas']]=0
98
+ data=data.groupby([0]).apply(rhoDelta,resol=resol,dc=dc,radius=radius).reset_index(drop=True)
99
+ minrho=0
100
+ targetData=data.reset_index(drop=True)
101
+
102
+ loopPds=[]
103
+ chroms=tqdm(set(targetData[0]), dynamic_ncols=True)
104
+ for chrom in chroms:
105
+ chroms.desc = f"[Runing clustering on {chrom}]"
106
+ data = targetData[targetData[0]==chrom].reset_index(drop=True)
107
+
108
+ pos = data[[1, 4]].to_numpy() // resol
109
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
110
+
111
+ rhos = data['rhos'].to_numpy()
112
+ deltas = data['deltas'].to_numpy()
113
+ centroid = np.argwhere((rhos > minrho) & (deltas > mindelta)).flatten()
114
+
115
+ _r = 100
116
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
117
+ LargerNei = rhos * 0 - 1
118
+ for i in range(len(_indexes)):
119
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
120
+ if idx.shape[0] == 0:
121
+ pass
122
+ else:
123
+ LargerNei[i] = _indexes[i][idx[0]]
124
+
125
+ failed = np.argwhere(LargerNei == -1).flatten()
126
+ while len(failed) > 1 and _r < 100000:
127
+ _r = _r * 10
128
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
129
+ for i in range(len(_indexes)):
130
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
131
+ if idx.shape[0] == 0:
132
+ pass
133
+ else:
134
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
135
+ failed = np.argwhere(LargerNei == -1).flatten()
136
+
137
+ LargerNei = LargerNei.astype(int)
138
+ label = LargerNei * 0 - 1
139
+ for i in range(len(centroid)):
140
+ label[centroid[i]] = i
141
+ decreasingsortedIdxRhos = np.argsort(-rhos)
142
+ for i in decreasingsortedIdxRhos:
143
+ if label[i] == -1:
144
+ label[i] = label[LargerNei[i]]
145
+
146
+ val = data[6].to_numpy()
147
+ refinedLoop = []
148
+ label = label.flatten()
149
+ for l in set(label):
150
+ idx = np.argwhere(label == l).flatten()
151
+ if len(idx) > 0:
152
+ refinedLoop.append(idx[np.argmax(val[idx])])
153
+ if refine:
154
+ loopPds.append(data.loc[refinedLoop])
155
+ else:
156
+ loopPds.append(data.loc[centroid])
157
+
158
+ loopPd=pd.concat(loopPds).sort_values(6,ascending=False)
159
+ loopPd[[1, 2, 4, 5]] = loopPd[[1, 2, 4, 5]].astype(int)
160
+ loopPd[[0,1,2,3,4,5,6]].to_csv(output,sep='\t',header=False, index=False)
161
+
162
+ ccs_ = set(loopPd.iloc[:,0])
163
+ badc = ccs.difference(ccs_)
164
+
165
+ return len(loopPd),badc,ccs
166
+
167
+
168
+ @click.command()
169
+ @click.option('-b','--batchsize', type=int, default=128, help='Batch size [128]')
170
+ @click.option('-C','--cpu', type=bool, default=False, help='Use CPU [False]')
171
+ @click.option('-G','--gpu', type=str, default=None, help='Comma-separated GPU indices [auto select]')
172
+ @click.option('-c','--chrom', type=str, default=None, help='Comma separated chroms [all autosomes]')
173
+ @click.option('-nw','--workers', type=int, default=16, help='Number of cpu threads [16]')
174
+ @click.option('-t','--threshold', type=float, default=0.6, help='Loop Score Threshold [0.6]')
175
+ @click.option('-s','--sparsity', type=float, default=0.9, help='Allowed sparsity of submatrices [0.9]')
176
+ @click.option('-md','--max_distance', type=int, default=3000000, help='Max distance (bp) between contact pairs [3000000]')
177
+ @click.option('-r','--resol',type=int,default=5000,help ='Resolution [5000]')
178
+ @click.option('-dc','--distance_cutoff', type=int, default=5, help='Distance cutoff for local density calculation in terms of bin. [5]')
179
+ @click.option('-R','--radius', type=int, default=2, help='Radius threshold to remove outliers. [2]')
180
+ @click.option('-d','--mindelta', type=float, default=5, help='Min distance allowed between two loops [5]')
181
+ @click.option('--raw',type=bool,default=False,help ='Raw matrix or balanced matrix')
182
+ @click.option('-i','--input', type=str,required=True,help='Hi-C contact map path')
183
+ @click.option('-o','--output', type=str,required=True,help='.bedpe file path to save loops')
184
+ def pred(batchsize, cpu, gpu, chrom, threshold, sparsity, workers, max_distance, resol, distance_cutoff, radius, mindelta, input, output, raw, image=224):
185
+ """Predict loops from input contact map directly
186
+ """
187
+ print('\npolaris loop pred START :)')
188
+
189
+ center_size = image // 2
190
+ start_idx = (image - center_size) // 2
191
+ end_idx = (image + center_size) // 2
192
+ slice_obj_pred = (slice(None), slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
193
+ slice_obj_coord = (slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
194
+
195
+ results=[]
196
+
197
+ if cpu:
198
+ assert gpu is None, "\033[91m QAQ The CPU and GPU modes cannot be used simultaneously. Please check the command. \033[0m\n"
199
+ gpu = ['None']
200
+ device = torch.device("cpu")
201
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
202
+ else:
203
+ if torch.cuda.is_available():
204
+ if gpu is not None:
205
+ print("Using the specified GPU: " + gpu)
206
+ gpu=[int(i) for i in gpu.split(',')]
207
+ device = torch.device(f"cuda:{gpu[0]}")
208
+ else:
209
+ gpuIdx = torch.cuda.current_device()
210
+ device = torch.device(gpuIdx)
211
+ print("Automatically selected GPU: " + str(gpuIdx))
212
+ gpu=[gpu]
213
+ else:
214
+ device = torch.device("cpu")
215
+ gpu = ['None']
216
+ cpu = True
217
+ print('GPU is not available!')
218
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
219
+
220
+
221
+ coolfile = cooler.Cooler(input + '::/resolutions/' + str(resol))
222
+ modelstate = str(files('polaris').joinpath('model/sft_loop.pt'))
223
+ _modelstate = torch.load(modelstate, map_location=device.type)
224
+ parameters = _modelstate['parameters']
225
+
226
+ if chrom is None:
227
+ chrom =coolfile.chromnames
228
+ else:
229
+ chrom = chrom.split(',')
230
+
231
+ # for rmchr in ['chrMT','MT','chrM','M','Y','chrY','X','chrX','chrW','W','chrZ','Z']: # 'Y','chrY','X','chrX'
232
+ # if rmchr in chrom:
233
+ # chrom.remove(rmchr)
234
+
235
+ print(f"Analysing chroms: {chrom}")
236
+
237
+ model = polarisnet(
238
+ image_size=parameters['image_size'],
239
+ in_channels=parameters['in_channels'],
240
+ out_channels=parameters['out_channels'],
241
+ embed_dim=parameters['embed_dim'],
242
+ depths=parameters['depths'],
243
+ channels=parameters['channels'],
244
+ num_heads=parameters['num_heads'],
245
+ drop=parameters['drop'],
246
+ drop_path=parameters['drop_path'],
247
+ pos_embed=parameters['pos_embed']
248
+ ).to(device)
249
+ model.load_state_dict(_modelstate['model_state_dict'])
250
+ if not cpu and len(gpu) > 1:
251
+ model = nn.DataParallel(model, device_ids=gpu)
252
+ model.eval()
253
+
254
+ print('\n********score START********')
255
+
256
+ badc=[]
257
+ chrom_ = tqdm(chrom, dynamic_ncols=True)
258
+ for _chrom in chrom_:
259
+ test_data = centerPredCoolDataset(coolfile,_chrom,max_distance_bin=max_distance//resol,w=image,step=center_size,s=sparsity,raw=raw)
260
+ test_dataloader = DataLoader(test_data, batch_size=batchsize, shuffle=False,num_workers=workers,prefetch_factor=4,pin_memory=(gpu is not None))
261
+
262
+ chrom_.desc = f"[Analyzing {_chrom} with {len(test_data)} submatrices]"
263
+
264
+ if len(test_data) == 0:
265
+ badc.append(_chrom)
266
+
267
+ with torch.no_grad():
268
+ for X in test_dataloader:
269
+ bin_i,bin_j,targetX=X
270
+ bin_i = bin_i*resol
271
+ bin_j = bin_j*resol
272
+ with autocast():
273
+ pred = torch.sigmoid(model(targetX.float().to(device)))[slice_obj_pred].flatten()
274
+ loop = torch.nonzero(pred>threshold).flatten().cpu()
275
+ prob = pred[loop].cpu().numpy().flatten().tolist()
276
+ frag1 = bin_i[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
277
+ frag2 = bin_j[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
278
+
279
+ for i in range(len(frag1)):
280
+ # if frag1[i] < frag2[i] and frag2[i]-frag1[i] > 11*resol and frag2[i]-frag1[i] < max_distance:
281
+ if frag1[i] < frag2[i] and frag2[i]-frag1[i] < max_distance:
282
+ results.append([_chrom, frag1[i], frag1[i] + resol,
283
+ _chrom, frag2[i], frag2[i] + resol,
284
+ prob[i]])
285
+ if len(badc)==len(chrom):
286
+ raise ValueError("score FAILED :(\nThe '-s' value needs to be increased for more sparse data.")
287
+ else:
288
+ print(f'********score FINISHED********')
289
+ if len(badc)>0:
290
+ print(f"· But the size of {badc} are too small or their contact matrix are too sparse.\n· You may need to check the data or run these chr respectively by increasing -s.")
291
+ print(f'********pool START********')
292
+
293
+ df = pd.DataFrame(results)
294
+ loopNum,badcp,ccs = pool(df,distance_cutoff,resol,mindelta,threshold,output,radius)
295
+ if len(badcp) == len(ccs):
296
+ raise ValueError("pool FAILED :(\nPlease check input and mcool file to yield scoreFile. Or use higher '-s' value for more sparse mcool data.")
297
+ else:
298
+ print(f'********pool FINISHED********')
299
+ if len(badcp) > 0:
300
+ print(f"· But the loop score of {badcp} are too sparse.\n· You may need to check the mcool data or re-run polaris loop score by increasing -s.")
301
+
302
+
303
+ print(f'\npolaris loop pred FINISHED :)\n{loopNum} loops saved to {output}')
304
+
305
+ if __name__ == '__main__':
306
+ pred()
polaris/loopDev.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import click
3
+ import cooler
4
+ import warnings
5
+ import numpy as np
6
+ from torch import nn
7
+ from tqdm import tqdm
8
+ from torch.cuda.amp import autocast
9
+ from importlib_resources import files
10
+ from polaris.utils.util_loop import bedpewriter
11
+ from polaris.model.polarisnet import polarisnet
12
+ from scipy.sparse import coo_matrix
13
+ from scipy.sparse import SparseEfficiencyWarning
14
+ warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
15
+
16
+ def getLocal(mat, i, jj, w, N):
17
+ if i >= 0 and jj >= 0 and i+w <= N and jj+w <= N:
18
+ mat = mat[i:i+w,jj:jj+w].toarray()
19
+ # print(f"global: {mat.shape}")
20
+ return mat[None,...]
21
+ # pad_width = ((up, down), (left, right))
22
+ slice_pos = [[i, i+w], [jj, jj+w]]
23
+ pad_width = [[0, 0], [0, 0]]
24
+ if i < 0:
25
+ pad_width[0][0] = -i
26
+ slice_pos[0][0] = 0
27
+ if jj < 0:
28
+ pad_width[1][0] = -jj
29
+ slice_pos[1][0] = 0
30
+ if i+w > N:
31
+ pad_width[0][1] = i+w-N
32
+ slice_pos[0][1] = N
33
+ if jj+w > N:
34
+ pad_width[1][1] = jj+w-N
35
+ slice_pos[1][1] = N
36
+ _mat = mat[slice_pos[0][0]:slice_pos[0][1],slice_pos[1][0]:slice_pos[1][1]].toarray()
37
+ padded_mat = np.pad(_mat, pad_width, mode='constant', constant_values=0)
38
+ # print(f"global: {padded_mat.shape}",slice_pos, pad_width)
39
+ return padded_mat[None,...]
40
+
41
+ def upperCoo2symm(row,col,data,N=None):
42
+ # print(np.max(row),np.max(col),N)
43
+ if N:
44
+ shape=(N,N)
45
+ else:
46
+ shape=(row.max() + 1,col.max() + 1)
47
+
48
+ sparse_matrix = coo_matrix((data, (row, col)), shape=shape)
49
+ symm = sparse_matrix + sparse_matrix.T
50
+ diagVal = symm.diagonal(0)/2
51
+ symm = symm.tocsr()
52
+ symm.setdiag(diagVal)
53
+ return symm
54
+
55
+ def processCoolFile(coolfile, cchrom):
56
+ extent = coolfile.extent(cchrom)
57
+ N = extent[1] - extent[0]
58
+ ccdata = coolfile.matrix(balance=True, sparse=True, as_pixels=True).fetch(cchrom)
59
+ ccdata['balanced'] = ccdata['balanced'].fillna(0)
60
+ ccdata['bin1_id'] -= extent[0]
61
+ ccdata['bin2_id'] -= extent[0]
62
+
63
+ ccdata['distance'] = ccdata['bin2_id'] - ccdata['bin1_id']
64
+ d_means = ccdata.groupby('distance')['balanced'].transform('mean')
65
+ ccdata['oe'] = ccdata['balanced'] / d_means
66
+ ccdata['oe'] = ccdata['oe'].fillna(0)
67
+ ccdata['oe'] = ccdata['oe'] / ccdata['oe'].max()
68
+ oeMat = upperCoo2symm(ccdata['bin1_id'].ravel(), ccdata['bin2_id'].ravel(), ccdata['oe'].ravel(), N)
69
+
70
+ return oeMat, N
71
+
72
+ @click.command()
73
+ @click.option('--batchsize', type=int, default=16, help='Batch size [16]')
74
+ @click.option('--cpu', type=bool, default=False, help='Use CPU [False]')
75
+ @click.option('--gpu', type=str, default=None, help='Comma-separated GPU indices [auto select]')
76
+ @click.option('--chrom', type=str, default=None, help='Comma separated chroms')
77
+ @click.option('--max_distance', type=int, default=3000000, help='Max distance (bp) between contact pairs')
78
+ @click.option('--resol',type=int,default=500,help ='Resolution')
79
+ @click.option('--image',type=int,default=1024,help ='Resolution')
80
+ @click.option('--center_size',type=int,default=224,help ='Resolution')
81
+ @click.option('-i','--input', type=str,required=True,help='Hi-C contact map path')
82
+ @click.option('-o','--output', type=str,required=True,help='.bedpe file path to save loop candidates')
83
+ def dev(batchsize, cpu, gpu, chrom, max_distance, resol, input, output, image, center_size):
84
+ """ *development function* Coming soon...
85
+ """
86
+ print('polaris loop dev START :) ')
87
+
88
+ # center_size = 224
89
+ # center_size = image // 2
90
+ start_idx = (image - center_size) // 2
91
+ end_idx = (image + center_size) // 2
92
+ slice_obj_pred = (slice(None), slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
93
+ slice_obj_coord = (slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
94
+
95
+ max_distance_bin=max_distance//resol
96
+
97
+ loopwriter = bedpewriter(output,resol,max_distance)
98
+
99
+ if cpu:
100
+ assert gpu is None, "\033[91m QAQ The CPU and GPU modes cannot be used simultaneously. Please check the command. \033[0m\n"
101
+ gpu = ['None']
102
+ device = torch.device("cpu")
103
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
104
+ else:
105
+ if torch.cuda.is_available():
106
+ if gpu is not None:
107
+ print("Using the specified GPU: " + gpu)
108
+ gpu=[int(i) for i in gpu.split(',')]
109
+ device = torch.device(f"cuda:{gpu[0]}")
110
+ else:
111
+ gpuIdx = torch.cuda.current_device()
112
+ device = torch.device(gpuIdx)
113
+ print("Automatically selected GPU: " + str(gpuIdx))
114
+ gpu=[gpu]
115
+ else:
116
+ device = torch.device("cpu")
117
+ gpu = ['None']
118
+ cpu = True
119
+ print('GPU is not available!')
120
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
121
+
122
+ coolfile = cooler.Cooler(input + '::/resolutions/' + str(resol))
123
+ modelstate = str(files('polaris').joinpath('model/sft_loop.pt'))
124
+ _modelstate = torch.load(modelstate, map_location=device.type)
125
+ parameters = _modelstate['parameters']
126
+
127
+ if chrom is None:
128
+ chrom =coolfile.chromnames
129
+ else:
130
+ chrom = chrom.split(',')
131
+ for rmchr in ['chrMT','MT','chrM','M','Y','chrY','X','chrX']: # 'Y','chrY','X','chrX'
132
+ if rmchr in chrom:
133
+ chrom.remove(rmchr)
134
+ print(f"\nAnalysing chroms: {chrom}")
135
+
136
+ model = polarisnet(
137
+ image_size=parameters['image_size'],
138
+ in_channels=parameters['in_channels'],
139
+ out_channels=parameters['out_channels'],
140
+ embed_dim=parameters['embed_dim'],
141
+ depths=parameters['depths'],
142
+ channels=parameters['channels'],
143
+ num_heads=parameters['num_heads'],
144
+ drop=parameters['drop'],
145
+ drop_path=parameters['drop_path'],
146
+ pos_embed=parameters['pos_embed']
147
+ ).to(device)
148
+ model.load_state_dict(_modelstate['model_state_dict'])
149
+ if not cpu and len(gpu) > 1:
150
+ model = nn.DataParallel(model, device_ids=gpu)
151
+ model.eval()
152
+
153
+ chrom = tqdm(chrom, dynamic_ncols=True)
154
+ for _chrom in chrom:
155
+ chrom.desc = f"[analyzing {_chrom}]"
156
+
157
+ oeMat, N = processCoolFile(coolfile, _chrom)
158
+ start_point = -(image - center_size) // 2
159
+ joffset = np.repeat(np.linspace(0, image, image, endpoint=False, dtype=int)[np.newaxis, :], image, axis=0)
160
+ ioffset = np.repeat(np.linspace(0, image, image, endpoint=False, dtype=int)[:, np.newaxis], image, axis=1)
161
+ data, i_list, j_list = [], [], []
162
+
163
+ for i in range(start_point, N - image - start_point, center_size):
164
+ for j in range(0, max_distance_bin, center_size):
165
+ jj = j + i
166
+ # if jj + w <= N and i + w <= N:
167
+ _oeMat = getLocal(oeMat, i, jj, image, N)
168
+ if np.sum(_oeMat == 0) <= (image*image*0.9):
169
+ data.append(_oeMat)
170
+ i_list.append(i + ioffset)
171
+ j_list.append(jj + joffset)
172
+
173
+ while len(data) >= batchsize or (i + center_size > N - image - start_point and len(data) > 0):
174
+ bin_i = torch.tensor(np.stack(i_list[:batchsize], axis=0)).to(device)
175
+ bin_j = torch.tensor(np.stack(j_list[:batchsize], axis=0)).to(device)
176
+ targetX = torch.tensor(np.stack(data[:batchsize], axis=0)).to(device)
177
+ bin_i = bin_i*resol
178
+ bin_j = bin_j*resol
179
+
180
+ data = data[batchsize:]
181
+ i_list = i_list[batchsize:]
182
+ j_list = j_list[batchsize:]
183
+
184
+ print(targetX.shape)
185
+ print(bin_i.shape)
186
+ print(bin_j.shape)
187
+
188
+ with torch.no_grad():
189
+ with autocast():
190
+ pred = torch.sigmoid(model(targetX.float().to(device)))[slice_obj_pred].flatten()
191
+ loop = torch.nonzero(pred>0.5).flatten().cpu()
192
+ prob = pred[loop].cpu().numpy().flatten().tolist()
193
+ frag1 = bin_i[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
194
+ frag2 = bin_j[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
195
+
196
+ loopwriter.write(_chrom,frag1,frag2,prob)
197
+
198
+
199
+ if __name__ == '__main__':
200
+ dev()
polaris/loopPool.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import click
3
+ import numpy as np
4
+ from sklearn.neighbors import KDTree
5
+ import pandas as pd
6
+ from tqdm import tqdm
7
+
8
+ def rhoDelta(data,resol,dc,radius):
9
+
10
+ pos = data[[1, 4]].to_numpy() // resol
11
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
12
+ NNindexes, NNdists = posTree.query_radius(pos, r=radius, return_distance=True)
13
+ _l = []
14
+ for v in NNindexes:
15
+ _l.append(len(v))
16
+ _l=np.asarray(_l)
17
+ data = data[_l>5].reset_index(drop=True)
18
+
19
+ if data.shape[0] != 0:
20
+ pos = data[[1, 4]].to_numpy() // resol
21
+ val = data[6].to_numpy()
22
+
23
+ try:
24
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
25
+ NNindexes, NNdists = posTree.query_radius(pos, r=dc, return_distance=True)
26
+ except ValueError as e:
27
+ if "Found array with 0 sample(s)" in str(e):
28
+ print("#"*88,'\n#')
29
+ print("#\033[91m Error!!! The data is too sparse. Please decrease the value of: [t]\033[0m\n#")
30
+ print("#"*88,'\n')
31
+ sys.exit(1)
32
+ else:
33
+ raise
34
+
35
+ rhos = []
36
+ for i in range(len(NNindexes)):
37
+ rhos.append(np.dot(np.exp(-(NNdists[i] / dc) ** 2), val[NNindexes[i]]))
38
+ rhos = np.asarray(rhos)
39
+
40
+ _r = 100
41
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
42
+ deltas = rhos * 0
43
+ LargerNei = rhos * 0 - 1
44
+ for i in range(len(_indexes)):
45
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
46
+ if idx.shape[0] == 0:
47
+ deltas[i] = _dists[i][-1] + 1
48
+ else:
49
+ LargerNei[i] = _indexes[i][idx[0]]
50
+ deltas[i] = _dists[i][idx[0]]
51
+ failed = np.argwhere(LargerNei == -1).flatten()
52
+ while len(failed) > 1 and _r < 100000:
53
+ _r = _r * 10
54
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
55
+ for i in range(len(_indexes)):
56
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
57
+ if idx.shape[0] == 0:
58
+ deltas[failed[i]] = _dists[i][-1] + 1
59
+ else:
60
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
61
+ deltas[failed[i]] = _dists[i][idx[0]]
62
+ failed = np.argwhere(LargerNei == -1).flatten()
63
+
64
+ data['rhos']=rhos
65
+ data['deltas']=deltas
66
+ else:
67
+ data['rhos']=[]
68
+ data['deltas']=[]
69
+ return data
70
+
71
+
72
+
73
+ @click.command()
74
+ @click.option('-dc','--distance_cutoff', type=int, default=5, help='Distance cutoff for local density calculation in terms of bin. [5]')
75
+ @click.option('-t','--threshold', type=float, default=0.6, help='Loop score threshold [0.6]')
76
+ @click.option('-r','--resol', default=5000, help='resolution [5000]')
77
+ @click.option('-R','--radius', type=int, default=2, help='Radius threshold to remove outliers. [2]')
78
+ @click.option('-d','--mindelta', type=float, default=5, help='Min distance allowed between two loops [5]')
79
+ @click.option('-i','--candidates', type=str,required=True,help ='Loop candidates file path')
80
+ @click.option('-o','--output', type=str,required=True,help ='.bedpe file path to save loops')
81
+ def pool(distance_cutoff,candidates,resol,mindelta,threshold,output,radius,refine=True):
82
+ """Call loops from loop candidates by clustering
83
+ """
84
+ print('\npolaris loop pool START :) ')
85
+
86
+ data = pd.read_csv(candidates, sep='\t', header=None)
87
+
88
+ ccs = set(data.iloc[:,0])
89
+
90
+ if data.shape[0] == 0:
91
+ print("#"*88,'\n#')
92
+ print("#\033[91m Error!!! The file is empty. Please check your file.\033[0m\n#")
93
+ print("#"*88,'\n')
94
+ sys.exit(1)
95
+ data = data[data[6] > threshold].reset_index(drop=True)
96
+ data = data[data[4] - data[1] > 11*resol].reset_index(drop=True)
97
+ if data.shape[0] == 0:
98
+ print("#"*88,'\n#')
99
+ print("#\033[91m Error!!! The data is too sparse. Please decrease: [threshold] (minimum: 0.5).\033[0m\n#")
100
+ print("#"*88,'\n')
101
+ sys.exit(1)
102
+ data[['rhos','deltas']]=0
103
+ data=data.groupby([0]).apply(rhoDelta,resol=resol,dc=distance_cutoff,radius=radius).reset_index(drop=True)
104
+ minrho=0
105
+ targetData=data.reset_index(drop=True)
106
+
107
+ loopPds=[]
108
+ chroms=tqdm(set(targetData[0]), dynamic_ncols=True)
109
+ for chrom in chroms:
110
+ chroms.desc = f"[Runing clustering on {chrom}]"
111
+ data = targetData[targetData[0]==chrom].reset_index(drop=True)
112
+
113
+ pos = data[[1, 4]].to_numpy() // resol
114
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
115
+
116
+ rhos = data['rhos'].to_numpy()
117
+ deltas = data['deltas'].to_numpy()
118
+ centroid = np.argwhere((rhos > minrho) & (deltas > mindelta)).flatten()
119
+
120
+ _r = 100
121
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
122
+ LargerNei = rhos * 0 - 1
123
+ for i in range(len(_indexes)):
124
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
125
+ if idx.shape[0] == 0:
126
+ pass
127
+ else:
128
+ LargerNei[i] = _indexes[i][idx[0]]
129
+
130
+ failed = np.argwhere(LargerNei == -1).flatten()
131
+ while len(failed) > 1 and _r < 100000:
132
+ _r = _r * 10
133
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
134
+ for i in range(len(_indexes)):
135
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
136
+ if idx.shape[0] == 0:
137
+ pass
138
+ else:
139
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
140
+ failed = np.argwhere(LargerNei == -1).flatten()
141
+
142
+ LargerNei = LargerNei.astype(int)
143
+ label = LargerNei * 0 - 1
144
+ for i in range(len(centroid)):
145
+ label[centroid[i]] = i
146
+ decreasingsortedIdxRhos = np.argsort(-rhos)
147
+ for i in decreasingsortedIdxRhos:
148
+ if label[i] == -1:
149
+ label[i] = label[LargerNei[i]]
150
+
151
+ val = data[6].to_numpy()
152
+ refinedLoop = []
153
+ label = label.flatten()
154
+ for l in set(label):
155
+ idx = np.argwhere(label == l).flatten()
156
+ if len(idx) > 0:
157
+ refinedLoop.append(idx[np.argmax(val[idx])])
158
+ if refine:
159
+ loopPds.append(data.loc[refinedLoop])
160
+ else:
161
+ loopPds.append(data.loc[centroid])
162
+
163
+ loopPd=pd.concat(loopPds).sort_values(6,ascending=False)
164
+ loopPd[[1, 2, 4, 5]] = loopPd[[1, 2, 4, 5]].astype(int)
165
+ loopPd[[0,1,2,3,4,5,6]].to_csv(output,sep='\t',header=False, index=False)
166
+
167
+ ccs_ = set(loopPd.iloc[:,0])
168
+ badc = ccs.difference(ccs_)
169
+ if len(badc) == len(ccs):
170
+ raise ValueError("polaris loop pool FAILED :(\nPlease check input and mcool file to yield scoreFile. Or use higher '-s' value for more sparse mcool data.")
171
+ else:
172
+ print(f'\npolaris loop pool FINISHED :)\n{len(loopPd)} loops saved to {output}')
173
+ if len(badc) > 0:
174
+ print(f"But the loop score of {badc} are too sparse.\nYou may need to check the mcool data or re-run polaris loop score by increasing -s.")
175
+
176
+
177
+ if __name__ == '__main__':
178
+ pool()
polaris/loopPool.py.bak ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import click
3
+ import numpy as np
4
+ from sklearn.neighbors import KDTree
5
+ import pandas as pd
6
+ from tqdm import tqdm
7
+
8
+ def rhoDelta(data,resol,dc,radius):
9
+
10
+ pos = data[[1, 4]].to_numpy() // resol
11
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
12
+ NNindexes, NNdists = posTree.query_radius(pos, r=radius, return_distance=True)
13
+ _l = []
14
+ for v in NNindexes:
15
+ _l.append(len(v))
16
+ _l=np.asarray(_l)
17
+ data = data[_l>5].reset_index(drop=True)
18
+
19
+ if data.shape[0] != 0:
20
+ pos = data[[1, 4]].to_numpy() // resol
21
+ val = data[6].to_numpy()
22
+
23
+ try:
24
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
25
+ NNindexes, NNdists = posTree.query_radius(pos, r=dc, return_distance=True)
26
+ except ValueError as e:
27
+ if "Found array with 0 sample(s)" in str(e):
28
+ print("#"*88,'\n#')
29
+ print("#\033[91m Error!!! The data is too sparse. Please decrease the value of: [t]\033[0m\n#")
30
+ print("#"*88,'\n')
31
+ sys.exit(1)
32
+ else:
33
+ raise
34
+
35
+ rhos = []
36
+ for i in range(len(NNindexes)):
37
+ rhos.append(np.dot(np.exp(-(NNdists[i] / dc) ** 2), val[NNindexes[i]]))
38
+ rhos = np.asarray(rhos)
39
+
40
+ _r = 100
41
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
42
+ deltas = rhos * 0
43
+ LargerNei = rhos * 0 - 1
44
+ for i in range(len(_indexes)):
45
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
46
+ if idx.shape[0] == 0:
47
+ deltas[i] = _dists[i][-1] + 1
48
+ else:
49
+ LargerNei[i] = _indexes[i][idx[0]]
50
+ deltas[i] = _dists[i][idx[0]]
51
+ failed = np.argwhere(LargerNei == -1).flatten()
52
+ while len(failed) > 1 and _r < 100000:
53
+ _r = _r * 10
54
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
55
+ for i in range(len(_indexes)):
56
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
57
+ if idx.shape[0] == 0:
58
+ deltas[failed[i]] = _dists[i][-1] + 1
59
+ else:
60
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
61
+ deltas[failed[i]] = _dists[i][idx[0]]
62
+ failed = np.argwhere(LargerNei == -1).flatten()
63
+
64
+ data['rhos']=rhos
65
+ data['deltas']=deltas
66
+ else:
67
+ data['rhos']=[]
68
+ data['deltas']=[]
69
+ return data
70
+
71
+
72
+
73
+ @click.command()
74
+ @click.option('-dc','--distance_cutoff', type=int, default=5, help='Distance cutoff for local density calculation in terms of bin. [5]')
75
+ @click.option('-t','--threshold', type=float, default=0.6, help='Loop score threshold [0.6]')
76
+ @click.option('-r','--resol', default=5000, help='resolution [5000]')
77
+ @click.option('-R','--radius', type=int, default=2, help='Radius threshold to remove outliers. [2]')
78
+ @click.option('-d','--mindelta', type=float, default=5, help='Min distance allowed between two loops [5]')
79
+ @click.option('-i','--candidates', type=str,required=True,help ='Loop candidates file path')
80
+ @click.option('-o','--output', type=str,required=True,help ='.bedpe file path to save loops')
81
+ def pool(distance_cutoff,candidates,resol,mindelta,threshold,output,radius,refine=True):
82
+ """Call loops from loop candidates by clustering
83
+ """
84
+ print('\npolaris loop pool START :) ')
85
+
86
+ data = pd.read_csv(candidates, sep='\t', header=None)
87
+
88
+ ccs = set(data.iloc[:,0])
89
+
90
+ if data.shape[0] == 0:
91
+ print("#"*88,'\n#')
92
+ print("#\033[91m Error!!! The file is empty. Please check your file.\033[0m\n#")
93
+ print("#"*88,'\n')
94
+ sys.exit(1)
95
+ data = data[data[6] > threshold].reset_index(drop=True)
96
+ data = data[data[4] - data[1] > 11*resol].reset_index(drop=True)
97
+ if data.shape[0] == 0:
98
+ print("#"*88,'\n#')
99
+ print("#\033[91m Error!!! The data is too sparse. Please decrease: [threshold] (minimum: 0.5).\033[0m\n#")
100
+ print("#"*88,'\n')
101
+ sys.exit(1)
102
+ data[['rhos','deltas']]=0
103
+ data=data.groupby([0]).apply(rhoDelta,resol=resol,dc=distance_cutoff,radius=radius).reset_index(drop=True)
104
+ minrho=0
105
+ targetData=data.reset_index(drop=True)
106
+
107
+ loopPds=[]
108
+ chroms=tqdm(set(targetData[0]), dynamic_ncols=True)
109
+ for chrom in chroms:
110
+ chroms.desc = f"[Runing clustering on {chrom}]"
111
+ data = targetData[targetData[0]==chrom].reset_index(drop=True)
112
+
113
+ pos = data[[1, 4]].to_numpy() // resol
114
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
115
+
116
+ rhos = data['rhos'].to_numpy()
117
+ deltas = data['deltas'].to_numpy()
118
+ centroid = np.argwhere((rhos > minrho) & (deltas > mindelta)).flatten()
119
+
120
+ _r = 100
121
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
122
+ LargerNei = rhos * 0 - 1
123
+ for i in range(len(_indexes)):
124
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
125
+ if idx.shape[0] == 0:
126
+ pass
127
+ else:
128
+ LargerNei[i] = _indexes[i][idx[0]]
129
+
130
+ failed = np.argwhere(LargerNei == -1).flatten()
131
+ while len(failed) > 1 and _r < 100000:
132
+ _r = _r * 10
133
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
134
+ for i in range(len(_indexes)):
135
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
136
+ if idx.shape[0] == 0:
137
+ pass
138
+ else:
139
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
140
+ failed = np.argwhere(LargerNei == -1).flatten()
141
+
142
+ LargerNei = LargerNei.astype(int)
143
+ label = LargerNei * 0 - 1
144
+ for i in range(len(centroid)):
145
+ label[centroid[i]] = i
146
+ decreasingsortedIdxRhos = np.argsort(-rhos)
147
+ for i in decreasingsortedIdxRhos:
148
+ if label[i] == -1:
149
+ label[i] = label[LargerNei[i]]
150
+
151
+ val = data[6].to_numpy()
152
+ refinedLoop = []
153
+ label = label.flatten()
154
+ for l in set(label):
155
+ idx = np.argwhere(label == l).flatten()
156
+ if len(idx) > 0:
157
+ refinedLoop.append(idx[np.argmax(val[idx])])
158
+ if refine:
159
+ loopPds.append(data.loc[refinedLoop])
160
+ else:
161
+ loopPds.append(data.loc[centroid])
162
+
163
+ loopPd=pd.concat(loopPds).sort_values(6,ascending=False)
164
+ loopPd[[1, 2, 4, 5]] = loopPd[[1, 2, 4, 5]].astype(int)
165
+ loopPd[[0,1,2,3,4,5,6]].to_csv(output,sep='\t',header=False, index=False)
166
+
167
+ ccs_ = set(loopPd.iloc[:,0])
168
+ badc = ccs.difference(ccs_)
169
+ if len(badc) == len(ccs):
170
+ raise ValueError("polaris loop pool FAILED :(\nPlease check input and mcool file to yield scoreFile. Or use higher '-s' value for more sparse mcool data.")
171
+ else:
172
+ print(f'\npolaris loop pool FINISHED :)\n{len(loopPd)} loops saved to {output}')
173
+ if len(badc) > 0:
174
+ print(f"But the loop score of {badc} are too sparse.\nYou may need to check the mcool data or re-run polaris loop score by increasing -s.")
175
+
176
+
177
+ if __name__ == '__main__':
178
+ pool()
polaris/loopPool_proof_wang_duplicate.py.bak ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import click
3
+ import numpy as np
4
+ from sklearn.neighbors import KDTree
5
+ import pandas as pd
6
+ from tqdm import tqdm
7
+
8
+ def rhoDelta(data,resol,dc,radius):
9
+
10
+ pos = data[[1, 4]].to_numpy() // resol
11
+ val = data[6].to_numpy()
12
+
13
+ try:
14
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
15
+ NNindexes, NNdists = posTree.query_radius(pos, r=dc, return_distance=True)
16
+ except ValueError as e:
17
+ if "Found array with 0 sample(s)" in str(e):
18
+ print("#"*88,'\n#')
19
+ print("#\033[91m Error!!! The data is too sparse. Please decrease the value of: [t]\033[0m\n#")
20
+ print("#"*88,'\n')
21
+ sys.exit(1)
22
+ else:
23
+ raise
24
+
25
+ rhos = []
26
+ for i in range(len(NNindexes)):
27
+ rhos.append(np.dot(np.exp(-(NNdists[i] / dc) ** 2), val[NNindexes[i]]))
28
+ rhos = np.asarray(rhos)
29
+
30
+ _r = 100
31
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
32
+ deltas = rhos * 0
33
+ LargerNei = rhos * 0 - 1
34
+ for i in range(len(_indexes)):
35
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
36
+ if idx.shape[0] == 0:
37
+ deltas[i] = _dists[i][-1] + 1
38
+ else:
39
+ LargerNei[i] = _indexes[i][idx[0]]
40
+ deltas[i] = _dists[i][idx[0]]
41
+ failed = np.argwhere(LargerNei == -1).flatten()
42
+ while len(failed) > 1 and _r < 100000:
43
+ _r = _r * 10
44
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
45
+ for i in range(len(_indexes)):
46
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
47
+ if idx.shape[0] == 0:
48
+ deltas[failed[i]] = _dists[i][-1] + 1
49
+ else:
50
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
51
+ deltas[failed[i]] = _dists[i][idx[0]]
52
+ failed = np.argwhere(LargerNei == -1).flatten()
53
+
54
+ data['rhos']=rhos
55
+ data['deltas']=deltas
56
+
57
+ return data
58
+
59
+
60
+
61
+ @click.command()
62
+ @click.option('-dc','--distance_cutoff', type=int, default=5, help='Distance cutoff for local density calculation in terms of bin. [5]')
63
+ @click.option('-t','--threshold', type=float, default=0.6, help='Loop score threshold [0.6]')
64
+ @click.option('-r','--resol', default=5000, help='resolution [5000]')
65
+ @click.option('-R','--radius', type=int, default=2, help='Radius threshold to remove outliers. [2]')
66
+ @click.option('-d','--mindelta', type=float, default=5, help='Min distance allowed between two loops [5]')
67
+ @click.option('-i','--candidates', type=str,required=True,help ='Loop candidates file path')
68
+ @click.option('-o','--output', type=str,required=True,help ='.bedpe file path to save loops')
69
+ def pool(distance_cutoff,candidates,resol,mindelta,threshold,output,radius,refine=True):
70
+ """Call loops from loop candidates by clustering
71
+ """
72
+ print('\npolaris loop pool START :) ')
73
+
74
+ data = pd.read_csv(candidates, sep='\t', header=None, comment='#')
75
+
76
+ print(data.head())
77
+ data[6]=1
78
+ print(data.head())
79
+
80
+ data[[1,4]] = data[[1,4]]//resol*resol
81
+
82
+ print(data.head())
83
+
84
+ data = data.drop_duplicates().reset_index(drop=True)
85
+
86
+ ccs = set(data.iloc[:,0])
87
+
88
+ # if data.shape[0] == 0:
89
+ # print("#"*88,'\n#')
90
+ # print("#\033[91m Error!!! The file is empty. Please check your file.\033[0m\n#")
91
+ # print("#"*88,'\n')
92
+ # sys.exit(1)
93
+ # data = data[data[6] > threshold].reset_index(drop=True)
94
+ # data = data[data[4] - data[1] > 11*resol].reset_index(drop=True)
95
+ # if data.shape[0] == 0:
96
+ # print("#"*88,'\n#')
97
+ # print("#\033[91m Error!!! The data is too sparse. Please decrease: [threshold] (minimum: 0.5).\033[0m\n#")
98
+ # print("#"*88,'\n')
99
+ # sys.exit(1)
100
+ data[['rhos','deltas']]=0
101
+
102
+ print(data.shape)
103
+
104
+ data=data.groupby([0]).apply(rhoDelta,resol=resol,dc=distance_cutoff,radius=radius).reset_index(drop=True)
105
+ minrho=0
106
+ targetData=data.reset_index(drop=True)
107
+
108
+ print(data.shape)
109
+
110
+ loopPds=[]
111
+ # chroms=tqdm(set(targetData[0]), dynamic_ncols=True)
112
+
113
+ rep=0
114
+
115
+ chroms=set(targetData[0])
116
+ for chrom in chroms:
117
+ print(f"[Runing clustering on {chrom}]")
118
+ # chroms.desc = f"[Runing clustering on {chrom}]"
119
+ data = targetData[targetData[0]==chrom].reset_index(drop=True)
120
+
121
+ print(data.shape)
122
+
123
+ pos = data[[1, 4]].to_numpy() // resol
124
+ posTree = KDTree(pos, leaf_size=30, metric='chebyshev')
125
+
126
+ rhos = data['rhos'].to_numpy()
127
+ deltas = data['deltas'].to_numpy()
128
+ # centroid = np.argwhere((rhos > minrho) & (deltas > mindelta)).flatten()
129
+ centroid = np.argwhere((deltas > mindelta)).flatten()
130
+
131
+ print(centroid.shape)
132
+ rep += data.shape[0] - centroid.shape[0]
133
+
134
+ _r = 100
135
+ _indexes, _dists = posTree.query_radius(pos, r=_r, return_distance=True, sort_results=True)
136
+ LargerNei = rhos * 0 - 1
137
+ for i in range(len(_indexes)):
138
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
139
+ if idx.shape[0] == 0:
140
+ pass
141
+ else:
142
+ LargerNei[i] = _indexes[i][idx[0]]
143
+
144
+ failed = np.argwhere(LargerNei == -1).flatten()
145
+ while len(failed) > 1 and _r < 100000:
146
+ _r = _r * 10
147
+ _indexes, _dists = posTree.query_radius(pos[failed], r=_r, return_distance=True, sort_results=True)
148
+ for i in range(len(_indexes)):
149
+ idx = np.argwhere(rhos[_indexes[i]] > rhos[_indexes[i][0]])
150
+ if idx.shape[0] == 0:
151
+ pass
152
+ else:
153
+ LargerNei[failed[i]] = _indexes[i][idx[0]]
154
+ failed = np.argwhere(LargerNei == -1).flatten()
155
+
156
+ LargerNei = LargerNei.astype(int)
157
+ label = LargerNei * 0 - 1
158
+ for i in range(len(centroid)):
159
+ label[centroid[i]] = i
160
+ decreasingsortedIdxRhos = np.argsort(-rhos)
161
+ for i in decreasingsortedIdxRhos:
162
+ if label[i] == -1:
163
+ label[i] = label[LargerNei[i]]
164
+
165
+ val = data[6].to_numpy()
166
+ refinedLoop = []
167
+ label = label.flatten()
168
+ for l in set(label):
169
+ idx = np.argwhere(label == l).flatten()
170
+ if len(idx) > 0:
171
+ refinedLoop.append(idx[np.argmax(val[idx])])
172
+ if refine:
173
+ loopPds.append(data.loc[refinedLoop])
174
+ else:
175
+ loopPds.append(data.loc[centroid])
176
+
177
+ loopPd=pd.concat(loopPds).sort_values(6,ascending=False)
178
+ loopPd[[1, 2, 4, 5]] = loopPd[[1, 2, 4, 5]].astype(int)
179
+ loopPd[[0,1,2,3,4,5,6]].to_csv(output,sep='\t',header=False, index=False)
180
+
181
+ ccs_ = set(loopPd.iloc[:,0])
182
+ badc = ccs.difference(ccs_)
183
+ if len(badc) == len(ccs):
184
+ raise ValueError("polaris loop pool FAILED :(\nPlease check input and mcool file to yield scoreFile. Or use higher '-s' value for more sparse mcool data.")
185
+ else:
186
+ print(f'\npolaris loop pool FINISHED :)\n{len(loopPd)} loops saved to {output}')
187
+ if len(badc) > 0:
188
+ print(f"But the loop score of {badc} are too sparse.\nYou may need to check the mcool data or re-run polaris loop score by increasing -s.")
189
+ print(f"dupicate loop: {rep}")
190
+
191
+ if __name__ == '__main__':
192
+ pool()
polaris/loopScore.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import cooler
3
+ import click
4
+ from torch import nn
5
+ from tqdm import tqdm
6
+ from torch.cuda.amp import autocast
7
+ from importlib_resources import files
8
+ from torch.utils.data import DataLoader
9
+ from polaris.utils.util_loop import bedpewriter
10
+ from polaris.model.polarisnet import polarisnet
11
+ from polaris.utils.util_data import centerPredCoolDataset
12
+
13
+ @click.command()
14
+ @click.option('-b','--batchsize', type=int, default=128, help='Batch size [128]')
15
+ @click.option('-C','--cpu', type=bool, default=False, help='Use CPU [False]')
16
+ @click.option('-G','--gpu', type=str, default=None, help='Comma-separated GPU indices [auto select]')
17
+ @click.option('-c','--chrom', type=str, default=None, help='Comma separated chroms [all autosomes]')
18
+ @click.option('-nw','--workers', type=int, default=16, help='Number of cpu threads [16]')
19
+ @click.option('-t','--threshold', type=float, default=0.5, help='Loop Score Threshold [0.5]')
20
+ @click.option('-s','--sparsity', type=float, default=0.9, help='Allowed sparsity of submatrices [0.9]')
21
+ @click.option('-md','--max_distance', type=int, default=3000000, help='Max distance (bp) between contact pairs [3000000]')
22
+ @click.option('-r','--resol',type=int,default=5000,help ='Resolution [5000]')
23
+ @click.option('--raw',type=bool,default=False,help ='Raw matrix or balanced matrix')
24
+ @click.option('-i','--input', type=str,required=True,help='Hi-C contact map path')
25
+ @click.option('-o','--output', type=str,required=True,help='.bedpe file path to save loop candidates')
26
+ def score(batchsize, cpu, gpu, chrom, workers, threshold, sparsity, max_distance, resol, input, output, raw, image=224):
27
+ """Predict loop score for each pixel in the input contact map
28
+ """
29
+ print('\npolaris loop score START :) ')
30
+
31
+ center_size = image // 2
32
+ start_idx = (image - center_size) // 2
33
+ end_idx = (image + center_size) // 2
34
+ slice_obj_pred = (slice(None), slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
35
+ slice_obj_coord = (slice(None), slice(start_idx, end_idx), slice(start_idx, end_idx))
36
+
37
+ loopwriter = bedpewriter(output,resol,max_distance)
38
+
39
+ if cpu:
40
+ assert gpu is None, "\033[91m QAQ The CPU and GPU modes cannot be used simultaneously. Please check the command. \033[0m\n"
41
+ gpu = ['None']
42
+ device = torch.device("cpu")
43
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
44
+ else:
45
+ if torch.cuda.is_available():
46
+ if gpu is not None:
47
+ print("Using the specified GPU: " + gpu)
48
+ gpu=[int(i) for i in gpu.split(',')]
49
+ device = torch.device(f"cuda:{gpu[0]}")
50
+ else:
51
+ gpuIdx = torch.cuda.current_device()
52
+ device = torch.device(gpuIdx)
53
+ print("Automatically selected GPU: " + str(gpuIdx))
54
+ gpu=[gpu]
55
+ else:
56
+ device = torch.device("cpu")
57
+ gpu = ['None']
58
+ cpu = True
59
+ print('GPU is not available!')
60
+ print('Using CPU mode... (This may take significantly longer than using GPU mode.)')
61
+
62
+
63
+ coolfile = cooler.Cooler(input + '::/resolutions/' + str(resol))
64
+ modelstate = str(files('polaris').joinpath('model/sft_loop.pt'))
65
+ _modelstate = torch.load(modelstate, map_location=device.type)
66
+ parameters = _modelstate['parameters']
67
+
68
+ if chrom is None:
69
+ chrom =coolfile.chromnames
70
+ else:
71
+ chrom = chrom.split(',')
72
+
73
+ # for rmchr in ['chrMT','MT','chrM','M','Y','chrY','X','chrX','chrW','W','chrZ','Z']: # 'Y','chrY','X','chrX'
74
+ # if rmchr in chrom:
75
+ # chrom.remove(rmchr)
76
+
77
+ print(f"Analysing chroms: {chrom}")
78
+
79
+ model = polarisnet(
80
+ image_size=parameters['image_size'],
81
+ in_channels=parameters['in_channels'],
82
+ out_channels=parameters['out_channels'],
83
+ embed_dim=parameters['embed_dim'],
84
+ depths=parameters['depths'],
85
+ channels=parameters['channels'],
86
+ num_heads=parameters['num_heads'],
87
+ drop=parameters['drop'],
88
+ drop_path=parameters['drop_path'],
89
+ pos_embed=parameters['pos_embed']
90
+ ).to(device)
91
+ model.load_state_dict(_modelstate['model_state_dict'])
92
+ if not cpu and len(gpu) > 1:
93
+ model = nn.DataParallel(model, device_ids=gpu)
94
+ model.eval()
95
+
96
+ badc=[]
97
+ chrom_ = tqdm(chrom, dynamic_ncols=True)
98
+ for _chrom in chrom_:
99
+ test_data = centerPredCoolDataset(coolfile,_chrom,max_distance_bin=max_distance//resol,w=image,step=center_size,s=sparsity,raw=raw)
100
+ test_dataloader = DataLoader(test_data, batch_size=batchsize, shuffle=False,num_workers=workers,prefetch_factor=4,pin_memory=(gpu is not None))
101
+
102
+ chrom_.desc = f"[Analyzing {_chrom} with {len(test_data)} submatrices]"
103
+
104
+ if len(test_data) == 0:
105
+ badc.append(_chrom)
106
+
107
+ with torch.no_grad():
108
+ for X in test_dataloader:
109
+ bin_i,bin_j,targetX=X
110
+ bin_i = bin_i*resol
111
+ bin_j = bin_j*resol
112
+ with autocast():
113
+ pred = torch.sigmoid(model(targetX.float().to(device)))[slice_obj_pred].flatten()
114
+ loop = torch.nonzero(pred>threshold).flatten().cpu()
115
+ prob = pred[loop].cpu().numpy().flatten().tolist()
116
+ frag1 = bin_i[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
117
+ frag2 = bin_j[slice_obj_coord].flatten().cpu().numpy()[loop].flatten().tolist()
118
+
119
+ loopwriter.write(_chrom,frag1,frag2,prob)
120
+
121
+ if len(badc)==len(chrom):
122
+ raise ValueError("polaris loop score FAILED :( \nThe '-s' value needs to be increased for more sparse data.")
123
+ else:
124
+ print(f'\npolaris loop score FINISHED :)\nLoopscore file saved at {output}')
125
+ if len(badc)>0:
126
+ print(f"But the size of {badc} are too small or their contact matrix are too sparse.\nYou may need to check the data or run these chr respectively by increasing -s.")
127
+
128
+ if __name__ == '__main__':
129
+ score()
polaris/model/polarisnet.py ADDED
@@ -0,0 +1,526 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from operator import itemgetter
4
+
5
+ from typing import Type, Callable, Tuple, Optional, Set, List, Union
6
+ from timm.models.layers import drop_path, trunc_normal_, Mlp, DropPath
7
+ from timm.models.efficientnet_blocks import SqueezeExcite, DepthwiseSeparableConv
8
+
9
+ def exists(val):
10
+
11
+ return val is not None
12
+
13
+ def map_el_ind(arr, ind):
14
+
15
+ return list(map(itemgetter(ind), arr))
16
+
17
+ def sort_and_return_indices(arr):
18
+
19
+ indices = [ind for ind in range(len(arr))]
20
+ arr = zip(arr, indices)
21
+ arr = sorted(arr)
22
+
23
+ return map_el_ind(arr, 0), map_el_ind(arr, 1)
24
+
25
+ def calculate_permutations(num_dimensions, emb_dim):
26
+ total_dimensions = num_dimensions + 2
27
+ axial_dims = [ind for ind in range(1, total_dimensions) if ind != emb_dim]
28
+
29
+ permutations = []
30
+
31
+ for axial_dim in axial_dims:
32
+ last_two_dims = [axial_dim, emb_dim]
33
+ dims_rest = set(range(0, total_dimensions)) - set(last_two_dims)
34
+ permutation = [*dims_rest, *last_two_dims]
35
+ permutations.append(permutation)
36
+
37
+ return permutations
38
+
39
+ class ChanLayerNorm(nn.Module):
40
+ def __init__(self, dim, eps = 1e-5):
41
+ super().__init__()
42
+ self.eps = eps
43
+ self.g = nn.Parameter(torch.ones(1, dim, 1, 1))
44
+ self.b = nn.Parameter(torch.zeros(1, dim, 1, 1))
45
+
46
+ def forward(self, x):
47
+
48
+ std = torch.var(x, dim = 1, unbiased = False, keepdim = True).sqrt()
49
+ mean = torch.mean(x, dim = 1, keepdim = True)
50
+ return (x - mean) / (std + self.eps) * self.g + self.b
51
+
52
+ class PreNorm(nn.Module):
53
+ def __init__(self, dim, fn):
54
+ super().__init__()
55
+ self.fn = fn
56
+ self.norm = nn.LayerNorm(dim)
57
+
58
+ def forward(self, x):
59
+
60
+ x = self.norm(x)
61
+
62
+ return self.fn(x)
63
+
64
+ class PermuteToFrom(nn.Module):
65
+
66
+ def __init__(self, permutation, fn):
67
+ super().__init__()
68
+
69
+ self.fn = fn
70
+ _, inv_permutation = sort_and_return_indices(permutation)
71
+ self.permutation = permutation
72
+ self.inv_permutation = inv_permutation
73
+
74
+ def forward(self, x, **kwargs):
75
+
76
+ axial = x.permute(*self.permutation).contiguous()
77
+ shape = axial.shape
78
+ *_, t, d = shape
79
+ axial = axial.reshape(-1, t, d)
80
+ axial = self.fn(axial, **kwargs)
81
+ axial = axial.reshape(*shape)
82
+ axial = axial.permute(*self.inv_permutation).contiguous()
83
+
84
+ return axial
85
+
86
+ class AxialPositionalEmbedding(nn.Module):
87
+ def __init__(self, dim, shape, emb_dim_index = 1):
88
+ super().__init__()
89
+ parameters = []
90
+ total_dimensions = len(shape) + 2
91
+ ax_dim_indexes = [i for i in range(1, total_dimensions) if i != emb_dim_index]
92
+
93
+ self.num_axials = len(shape)
94
+
95
+ for i, (axial_dim, axial_dim_index) in enumerate(zip(shape, ax_dim_indexes)):
96
+ shape = [1] * total_dimensions
97
+ shape[emb_dim_index] = dim
98
+ shape[axial_dim_index] = axial_dim
99
+ parameter = nn.Parameter(torch.randn(*shape))
100
+ setattr(self, f'param_{i}', parameter)
101
+
102
+ def forward(self, x):
103
+
104
+ for i in range(self.num_axials):
105
+ x = x + getattr(self, f'param_{i}')
106
+
107
+ return x
108
+
109
+ class SelfAttention(nn.Module):
110
+ def __init__(self, dim, heads, dim_heads=None, drop=0):
111
+ super().__init__()
112
+ self.dim_heads = (dim // heads) if dim_heads is None else dim_heads
113
+ dim_hidden = self.dim_heads * heads
114
+ self.drop_rate = drop
115
+ self.heads = heads
116
+ self.to_q = nn.Linear(dim, dim_hidden, bias = False)
117
+ self.to_kv = nn.Linear(dim, 2 * dim_hidden, bias = False)
118
+ self.to_out = nn.Linear(dim_hidden, dim)
119
+ self.proj_drop = DropPath(drop)
120
+
121
+ def forward(self, x, kv = None):
122
+ kv = x if kv is None else kv
123
+ q, k, v = (self.to_q(x), *self.to_kv(kv).chunk(2, dim=-1))
124
+ b, t, d, h, e = *q.shape, self.heads, self.dim_heads
125
+ merge_heads = lambda x: x.reshape(b, -1, h, e).transpose(1, 2).reshape(b * h, -1, e)
126
+
127
+ q, k, v = map(merge_heads, (q, k, v))
128
+ dots = torch.einsum('bie,bje->bij', q, k) * (e ** -0.5)
129
+ dots = dots.softmax(dim=-1)
130
+
131
+ out = torch.einsum('bij,bje->bie', dots, v)
132
+ out = out.reshape(b, h, -1, e).transpose(1, 2).reshape(b, -1, d)
133
+ out = self.to_out(out)
134
+ out = self.proj_drop(out)
135
+
136
+ return out
137
+
138
+ class AxialTransformerBlock(nn.Module):
139
+ def __init__(self,
140
+ dim,
141
+ axial_pos_emb_shape,
142
+ pos_embed,
143
+ heads = 8,
144
+ dim_heads = None,
145
+ drop = 0.,
146
+ drop_path_rate=0.,
147
+ ):
148
+ super().__init__()
149
+
150
+ dim_index = 1
151
+
152
+ permutations = calculate_permutations(2, dim_index)
153
+
154
+ self.pos_emb = AxialPositionalEmbedding(dim, axial_pos_emb_shape, dim_index) if pos_embed else nn.Identity()
155
+
156
+ self.height_attn, self.width_attn = nn.ModuleList([PermuteToFrom(permutation, PreNorm(dim, SelfAttention(dim, heads, dim_heads, drop=drop))) for permutation in permutations])
157
+
158
+ self.FFN = nn.Sequential(
159
+ ChanLayerNorm(dim),
160
+ nn.Conv2d(dim, dim * 4, 3, padding = 1),
161
+ nn.GELU(),
162
+ DropPath(drop),
163
+ nn.Conv2d(dim * 4, dim, 3, padding = 1),
164
+ DropPath(drop),
165
+
166
+ ChanLayerNorm(dim),
167
+ nn.Conv2d(dim, dim * 4, 3, padding = 1),
168
+ nn.GELU(),
169
+ DropPath(drop),
170
+ nn.Conv2d(dim * 4, dim, 3, padding = 1),
171
+ DropPath(drop),
172
+ )
173
+
174
+ self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
175
+
176
+ def forward(self, x):
177
+
178
+ x = self.pos_emb(x)
179
+ x = x + self.drop_path(self.height_attn(x))
180
+ x = x + self.drop_path(self.width_attn(x))
181
+ x = x + self.drop_path(self.FFN(x))
182
+
183
+ return x
184
+
185
+ def pair(t):
186
+
187
+ return t if isinstance(t, tuple) else (t, t)
188
+
189
+ def _gelu_ignore_parameters(*args, **kwargs) -> nn.Module:
190
+
191
+ activation = nn.GELU()
192
+
193
+ return activation
194
+
195
+ class DoubleConv(nn.Module):
196
+
197
+ def __init__(
198
+ self,
199
+ in_channels: int,
200
+ out_channels: int,
201
+ downscale: bool = False,
202
+ act_layer: Type[nn.Module] = nn.GELU,
203
+ norm_layer: Type[nn.Module] = nn.BatchNorm2d,
204
+ drop_path: float = 0.,
205
+ ) -> None:
206
+
207
+ super(DoubleConv, self).__init__()
208
+
209
+ self.drop_path_rate: float = drop_path
210
+
211
+ if act_layer == nn.GELU:
212
+ act_layer = _gelu_ignore_parameters
213
+
214
+ self.main_path = nn.Sequential(
215
+ norm_layer(in_channels),
216
+ nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=(1, 1)),
217
+ DepthwiseSeparableConv(in_chs=in_channels, out_chs=out_channels, stride=2 if downscale else 1,
218
+ act_layer=act_layer, norm_layer=norm_layer, drop_path_rate=drop_path),
219
+ SqueezeExcite(in_chs=out_channels, rd_ratio=0.25),
220
+ nn.Conv2d(in_channels=out_channels, out_channels=out_channels, kernel_size=(1, 1))
221
+ )
222
+
223
+ if downscale:
224
+ self.skip_path = nn.Sequential(
225
+ nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2)),
226
+ nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=(1, 1))
227
+ )
228
+ else:
229
+ self.skip_path = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=(1, 1))
230
+
231
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
232
+
233
+ output = self.main_path(x)
234
+
235
+ if self.drop_path_rate > 0.:
236
+ output = drop_path(output, self.drop_path_rate, self.training)
237
+
238
+ x = output + self.skip_path(x)
239
+
240
+ return x
241
+
242
+
243
+ class DeconvModule(nn.Module):
244
+
245
+ def __init__(self,
246
+ in_channels,
247
+ out_channels,
248
+ norm_layer=nn.BatchNorm2d,
249
+ act_layer=nn.Mish,
250
+ kernel_size=4,
251
+ scale_factor=2):
252
+ super(DeconvModule, self).__init__()
253
+
254
+ assert (kernel_size - scale_factor >= 0) and\
255
+ (kernel_size - scale_factor) % 2 == 0,\
256
+ f'kernel_size should be greater than or equal to scale_factor '\
257
+ f'and (kernel_size - scale_factor) should be even numbers, '\
258
+ f'while the kernel size is {kernel_size} and scale_factor is '\
259
+ f'{scale_factor}.'
260
+
261
+ stride = scale_factor
262
+ padding = (kernel_size - scale_factor) // 2
263
+ deconv = nn.ConvTranspose2d(
264
+ in_channels,
265
+ out_channels,
266
+ kernel_size=kernel_size,
267
+ stride=stride,
268
+ padding=padding)
269
+
270
+ norm = norm_layer(out_channels)
271
+ activate = act_layer()
272
+ self.deconv_upsamping = nn.Sequential(deconv, norm, activate)
273
+
274
+ def forward(self, x):
275
+
276
+ out = self.deconv_upsamping(x)
277
+
278
+ return out
279
+
280
+ class Stage(nn.Module):
281
+
282
+ def __init__(self,
283
+ image_size: int,
284
+ depth: int,
285
+ in_channels: int,
286
+ out_channels: int,
287
+ type_name: str,
288
+ pos_embed: bool,
289
+ num_heads: int = 32,
290
+ drop: float = 0.,
291
+ drop_path: Union[List[float], float] = 0.,
292
+ act_layer: Type[nn.Module] = nn.GELU,
293
+ norm_layer: Type[nn.Module] = nn.BatchNorm2d,
294
+ ):
295
+ super().__init__()
296
+ self.type_name = type_name
297
+
298
+ if self.type_name == "encoder":
299
+
300
+ self.conv = DoubleConv(
301
+ in_channels=in_channels,
302
+ out_channels=out_channels,
303
+ downscale=True,
304
+ act_layer=act_layer,
305
+ norm_layer=norm_layer,
306
+ drop_path=drop_path[0],
307
+ )
308
+
309
+ self.blocks = nn.Sequential(*[
310
+ AxialTransformerBlock(
311
+ dim=out_channels,
312
+ axial_pos_emb_shape=pair(image_size),
313
+ heads = num_heads,
314
+ drop = drop,
315
+ drop_path_rate=drop_path[index],
316
+ dim_heads = None,
317
+ pos_embed=pos_embed
318
+ )
319
+ for index in range(depth)
320
+ ])
321
+
322
+ elif self.type_name == "decoder":
323
+
324
+ self.upsample = DeconvModule(
325
+ in_channels=in_channels,
326
+ out_channels=out_channels,
327
+ norm_layer=norm_layer,
328
+ act_layer=act_layer
329
+ )
330
+
331
+ self.conv = DoubleConv(
332
+ in_channels=in_channels,
333
+ out_channels=out_channels,
334
+ downscale=False,
335
+ act_layer=act_layer,
336
+ norm_layer=norm_layer,
337
+ drop_path=drop_path[0],
338
+ )
339
+
340
+ self.blocks = nn.Sequential(*[
341
+ AxialTransformerBlock(
342
+ dim=out_channels,
343
+ axial_pos_emb_shape=pair(image_size),
344
+ heads = num_heads,
345
+ drop = drop,
346
+ drop_path_rate=drop_path[index],
347
+ dim_heads = None,
348
+ pos_embed=pos_embed
349
+ )
350
+ for index in range(depth)
351
+ ])
352
+
353
+ def forward(self, x, skip=None):
354
+
355
+ if self.type_name == "encoder":
356
+ x = self.conv(x)
357
+ x = self.blocks(x)
358
+
359
+ elif self.type_name == "decoder":
360
+ x = self.upsample(x)
361
+ x = torch.cat([skip, x], dim=1)
362
+ x = self.conv(x)
363
+ x = self.blocks(x)
364
+
365
+ return x
366
+
367
+ class FinalExpand(nn.Module):
368
+ def __init__(
369
+ self,
370
+ in_channels,
371
+ embed_dim,
372
+ out_channels,
373
+ norm_layer,
374
+ act_layer,
375
+ ):
376
+ super().__init__()
377
+ self.upsample = DeconvModule(
378
+ in_channels=in_channels,
379
+ out_channels=embed_dim,
380
+ norm_layer=norm_layer,
381
+ act_layer=act_layer
382
+ )
383
+
384
+ self.conv = nn.Sequential(
385
+ nn.Conv2d(in_channels=embed_dim*2, out_channels=embed_dim, kernel_size=3, stride=1, padding=1),
386
+ act_layer(),
387
+ nn.Conv2d(in_channels=embed_dim, out_channels=embed_dim, kernel_size=3, stride=1, padding=1),
388
+ act_layer(),
389
+ )
390
+
391
+ def forward(self, skip, x):
392
+ x = self.upsample(x)
393
+ x = torch.cat([skip, x], dim=1)
394
+ x = self.conv(x)
395
+
396
+ return x
397
+
398
+ class polarisnet(nn.Module):
399
+ def __init__(
400
+ self,
401
+ image_size=224,
402
+ in_channels=1,
403
+ out_channels=1,
404
+ embed_dim=64,
405
+ depths=[2,2,2,2],
406
+ channels=[64,128,256,512],
407
+ num_heads = 16,
408
+ drop=0.,
409
+ drop_path=0.1,
410
+ act_layer=nn.GELU,
411
+ norm_layer=nn.BatchNorm2d,
412
+ pos_embed=False
413
+ ):
414
+
415
+ super(polarisnet, self).__init__()
416
+ self.num_stages = len(depths)
417
+ self.num_features = channels[-1]
418
+ self.embed_dim = channels[0]
419
+
420
+ self.conv_first = nn.Sequential(
421
+ nn.Conv2d(in_channels=in_channels, out_channels=embed_dim, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
422
+ act_layer(),
423
+ nn.Conv2d(in_channels=embed_dim, out_channels=embed_dim, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
424
+ act_layer(),
425
+ )
426
+
427
+ drop_path = torch.linspace(0.0, drop_path, sum(depths)).tolist()
428
+ encoder_stages = []
429
+
430
+ for index in range(self.num_stages):
431
+
432
+ encoder_stages.append(
433
+ Stage(
434
+ image_size=image_size//(pow(2,1+index)),
435
+ depth=depths[index],
436
+ in_channels=embed_dim if index == 0 else channels[index - 1],
437
+ out_channels=channels[index],
438
+ num_heads=num_heads,
439
+ drop=drop,
440
+ drop_path=drop_path[sum(depths[:index]):sum(depths[:index + 1])],
441
+ act_layer=act_layer,
442
+ norm_layer=norm_layer,
443
+ type_name = "encoder",
444
+ pos_embed=pos_embed
445
+ )
446
+ )
447
+
448
+ self.encoder_stages = nn.ModuleList(encoder_stages)
449
+
450
+ decoder_stages = []
451
+
452
+ for index in range(self.num_stages-1):
453
+
454
+ decoder_stages.append(
455
+ Stage(
456
+ image_size=image_size//(pow(2,self.num_stages-index-1)),
457
+ depth=depths[self.num_stages - index - 2],
458
+ in_channels=channels[self.num_stages - index - 1],
459
+ out_channels=channels[self.num_stages - index - 2],
460
+ num_heads=num_heads,
461
+ drop=drop,
462
+ drop_path=drop_path[sum(depths[:(self.num_stages-2-index)]):sum(depths[:(self.num_stages-2-index) + 1])],
463
+ act_layer=act_layer,
464
+ norm_layer=norm_layer,
465
+ type_name = "decoder",
466
+ pos_embed=pos_embed
467
+ )
468
+ )
469
+
470
+ self.decoder_stages = nn.ModuleList(decoder_stages)
471
+
472
+ self.norm = norm_layer(self.num_features)
473
+ self.norm_up= norm_layer(self.embed_dim)
474
+
475
+ self.up = FinalExpand(
476
+ in_channels=channels[0],
477
+ embed_dim=embed_dim,
478
+ out_channels=embed_dim,
479
+ norm_layer=norm_layer,
480
+ act_layer=act_layer
481
+ )
482
+
483
+ self.output = nn.Conv2d(embed_dim, out_channels, kernel_size=3, padding=1)
484
+
485
+ def encoder_forward(self, x: torch.Tensor) -> torch.Tensor:
486
+
487
+ outs = []
488
+ x = self.conv_first(x)
489
+
490
+ for stage in self.encoder_stages:
491
+ outs.append(x)
492
+ x = stage(x)
493
+
494
+ x = self.norm(x)
495
+
496
+ return x, outs
497
+
498
+ def decoder_forward(self, x: torch.Tensor, x_downsample: list) -> torch.Tensor:
499
+
500
+ for inx, stage in enumerate(self.decoder_stages):
501
+ x = stage(x, x_downsample[len(x_downsample)-1-inx])
502
+
503
+ x = self.norm_up(x)
504
+
505
+ return x
506
+
507
+ def up_x4(self, x: torch.Tensor, x_downsample: list):
508
+ x = self.up(x_downsample[0],x)
509
+ x = self.output(x)
510
+
511
+ return x
512
+
513
+ def forward(self, x):
514
+ x, x_downsample = self.encoder_forward(x)
515
+ x = self.decoder_forward(x,x_downsample)
516
+ x = self.up_x4(x,x_downsample)
517
+
518
+ return x
519
+
520
+ if __name__ == '__main__':
521
+ net = polarisnet(in_channels=1, embed_dim=64, pos_embed=True).cuda()
522
+
523
+ X = torch.randn(5, 1, 224, 224).cuda()
524
+ y = net(X)
525
+ print(y.shape)
526
+
polaris/model/sft_loop.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cae9e9a28e5c3ff0d328934c066d275371d5301db084a914431198134f66ada2
3
+ size 547572280
polaris/polaris.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # My code has references to the following repositories:
2
+ # RefHiC: https://github.com/BlanchetteLab/RefHiC(Analysis code)
3
+ # Axial Attention: https://github.com/lucidrains/axial-attention (Model architecture)
4
+ # Peakachu: https://github.com/tariks/peakachu (Calculate intra reads)
5
+ # Thanks a lot for their implement.
6
+ # --------------------------------------------------------
7
+
8
+ import click
9
+ from polaris.loopScore import score
10
+ from polaris.loopDev import dev
11
+ from polaris.loopPool import pool
12
+ from polaris.loop import pred
13
+ from polaris.utils.util_cool2bcool import cool2bcool
14
+ from polaris.utils.util_pileup import pileup
15
+ from polaris.utils.util_depth import depth
16
+
17
+ @click.group()
18
+ def cli():
19
+ '''
20
+ Polaris
21
+
22
+ A Versatile Framework for Chromatin Loop Annotation in Bulk and Single-cell Hi-C Data
23
+ '''
24
+ pass
25
+
26
+ @cli.group()
27
+ def loop():
28
+ '''Loop annotation.
29
+
30
+ \b
31
+ Annotate loops from chromosomal contact maps.
32
+ '''
33
+ pass
34
+
35
+ @cli.group()
36
+ def util():
37
+ '''Utilities.
38
+
39
+ \b
40
+ Utilities for analysis and visualization.'''
41
+ pass
42
+
43
+ loop.add_command(pred)
44
+ loop.add_command(score)
45
+ loop.add_command(dev)
46
+ loop.add_command(pool)
47
+
48
+ util.add_command(depth)
49
+ util.add_command(cool2bcool)
50
+ util.add_command(pileup)
51
+
52
+
53
+ if __name__ == '__main__':
54
+ cli()
polaris/utils/util_bcooler.py ADDED
@@ -0,0 +1,347 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import cooler
2
+ import numpy as np
3
+ from types import SimpleNamespace
4
+ import random
5
+ import sys
6
+
7
+ def shuffleIFWithCount(df):
8
+ shuf=df[['count','balanced']].sample(frac=1)
9
+ df[['count','balanced']]=shuf[['count','balanced']].to_numpy()
10
+ return df
11
+
12
+ def shuffleIF(df):
13
+ if len(df)<10:
14
+ df = shuffleIFWithCount(df)
15
+ return df
16
+ min=np.min(df['bin1_id'])
17
+ max=np.max(df['bin1_id'])
18
+ distance = df['distance'].iloc[0]
19
+ bin1_id = np.random.randint(min, high=max, size=int(len(df)*1.5))
20
+ bin2_id = bin1_id + distance
21
+ pair_id = set(zip(bin1_id,bin2_id))
22
+ if len(pair_id)<len(df)-50:
23
+ bin1_id = np.random.randint(min, high=max, size=len(df))
24
+ bin2_id = bin1_id + distance
25
+ extra_pair_id = set(zip(bin1_id,bin2_id))
26
+ pair_id.update(extra_pair_id)
27
+ if len(pair_id)<len(df):
28
+ df = df.sample(len(pair_id))
29
+ pair_id = list(pair_id)
30
+ random.shuffle(pair_id)
31
+ pair_id=np.asarray(pair_id[:len(df)])
32
+ df['bin1_id']=pair_id[:,0]
33
+ df['bin2_id'] = pair_id[:,1]
34
+ return df
35
+
36
+ class bandmatrix():
37
+ def __init__(self, pixels, extent, max_distance_bins=None, bins=None, info=None):
38
+ self.extent = extent
39
+ self.max_distance_bins = max_distance_bins
40
+ self.bmatrix = np.zeros((extent[1] - extent[0], max_distance_bins))
41
+ self.offset = extent[0]
42
+ self.bmatrix[pixels['bin1_id'] - self.offset, (pixels['bin2_id'] - pixels['bin1_id']).abs()] = pixels[
43
+ 'balanced']
44
+ self.diag_mean = np.nanmean(self.bmatrix, axis=0)
45
+ np.nan_to_num(self.bmatrix, copy=False)
46
+
47
+ self.bins = bins
48
+ self.bp2bin = \
49
+ bins['start'].reset_index(drop=False).rename(columns={"start": "bp", "index": "bin"}).set_index(
50
+ 'bp').to_dict()[
51
+ 'bin']
52
+ self.resol = self.bins.iloc[0]['end'] - self.bins.iloc[0]['start']
53
+ self.info = info
54
+ self.bin2bias = np.zeros(self.extent[1] - self.extent[0] + 1)
55
+ if 'full_sum' in self.info:
56
+ self.totalRC = self.info['full_sum']
57
+ elif 'sum' in info:
58
+ self.totalRC = self.info['sum']
59
+ else:
60
+ self.totalRC = None
61
+ self.bin2bias = np.zeros((extent[1] - extent[0]))
62
+ for k, v in bins.to_dict()['weight'].items():
63
+ self.bin2bias[k - self.offset] = v
64
+ self.bin2bias = np.nan_to_num(self.bin2bias)
65
+
66
+ self.continousRows = {'start_bp': np.inf, 'end_bp': -1, 'O_matrix': None, 'OE_matrix': None, 'bias': None,
67
+ 'offset_bin': 0}
68
+ self.continousRows = SimpleNamespace(**self.continousRows)
69
+
70
+ def __bandedRows2fullRows(self, x):
71
+ """
72
+ coverting rows in bandedMatrix to upper triangle (+ necessary lower triangle) fullMatrix
73
+ x???? x???x000
74
+ x@@xx ?x@@xx00
75
+ x#xxx --> ?@x#xxx0
76
+ xxxxx ?@#xxxxx
77
+ """
78
+ b, h, w = x.shape
79
+ output = np.zeros((b, h, h + w))
80
+ output[:b, :h, :w] = x
81
+ output = output.reshape(b, -1)[:, :-h].reshape(b, h, -1)[:, :, :h + w]
82
+ i_lower = np.tril_indices(h, -1)
83
+ for i in range(b):
84
+ output[i][i_lower] = output[i].swapaxes(-1, -2)[i_lower]
85
+ return output
86
+
87
+ def __relative_right_shift(self, x):
88
+ """
89
+ .........xxxxxx xxxxxx0000000000
90
+ ........xxxxxx. xxxxxx.000000000
91
+ .......xxxxxx.. ---> xxxxxx..00000000
92
+ ......xxxxxx... xxxxxx...0000000
93
+ .....xxxxxx.... xxxxxx....000000
94
+ """
95
+ b, h, w = x.shape
96
+ output = np.zeros((b, h, 2 * w))
97
+ output[:b, :h, :w] = x
98
+ return output.reshape(b, -1)[:, :-h].reshape(b, h, -1)[:, :, h - 1:]
99
+
100
+ def __tril_block(self, top, left, bottom, right, type='o'):
101
+ """
102
+ fetch data in lower triangular part without main diagonal
103
+ Parameters:
104
+ top,left,bottom,right : block coords. left/right < 0
105
+ type : o [observe], oe [o/e], b [both]
106
+ """
107
+
108
+ if left >= 0 or right >= 0:
109
+ raise Exception("Trying to access data outside lower triangular part with tril_block")
110
+
111
+ height = bottom - top
112
+ top, bottom = top + left, bottom + right
113
+ left, right = -right, -left
114
+
115
+ if top < 0 or bottom > self.bmatrix.shape[0] - 1:
116
+ raise Exception("Accessing values outside the contact map ... valid region:" +
117
+ str(10 * self.resol) + '~' + str((self.extent[1] - self.extent[0] - 10) * self.resol))
118
+
119
+ O = self.bmatrix[top:bottom + 1, left:right + 1]
120
+
121
+ if type == 'o':
122
+ out = self.__relative_right_shift(O[None].swapaxes(-1, 1)).swapaxes(-1, 1)[:, :height + 1, :]
123
+ elif type == 'oe':
124
+ OE = O / self.diag_mean[left:right + 1]
125
+ out = self.__relative_right_shift(OE[None].swapaxes(-1, 1)).swapaxes(-1, 1)[:, :height + 1, :]
126
+ else:
127
+ OE = O / self.diag_mean[left:right + 1]
128
+ out = np.concatenate((O[None], OE[None]))
129
+ out = self.__relative_right_shift(out.swapaxes(-1, 1)).swapaxes(-1, 1)[:, :height + 1, :]
130
+
131
+ return out[..., ::-1]
132
+
133
+ def rows(self, firstRow, lastRow, type='o', returnBias=False):
134
+ """
135
+ fetch rows [firstRow,lastRow] of contacts
136
+ Parameters
137
+ ----------
138
+ firstRow : inclusive first row in bp
139
+ lastRow : inclusive last row in bp
140
+ type : o [observe], oe [o/e], b [both]
141
+ returnBias : If true, return bias in an array for bins [first row,last row + max_distance_bins)
142
+ """
143
+ firstRow = firstRow // self.resol * self.resol
144
+ lastRow = lastRow // self.resol * self.resol
145
+ ORows = None
146
+ OERows = None
147
+ if firstRow < 0 or lastRow < 0 or firstRow > (self.extent[1] - self.extent[0]) * self.resol or lastRow > (
148
+ self.extent[1] - self.extent[0]) * self.resol:
149
+ raise Exception("Accessing values outside the contact map ... valid region: 0 ~ "
150
+ + str((self.extent[1] - self.extent[0]) * self.resol))
151
+
152
+ firstRowRelativeBin = self.bp2bin[firstRow] - self.offset
153
+ lastRowRelativeBin = self.bp2bin[lastRow] - self.offset
154
+ ORows = self.bmatrix[firstRowRelativeBin:lastRowRelativeBin + 1, :][None]
155
+
156
+ if type == 'o':
157
+ outRows = ORows
158
+ elif type == 'oe':
159
+ OERows = (ORows / self.diag_mean)
160
+ outRows = OERows
161
+ elif type == 'b':
162
+ OERows = (ORows / self.diag_mean)
163
+ outRows = np.concatenate((ORows, OERows), axis=0)
164
+
165
+ outRows = self.__bandedRows2fullRows(outRows)
166
+
167
+ if returnBias:
168
+ bias = self.bin2bias[firstRowRelativeBin:lastRowRelativeBin + self.max_distance_bins]
169
+ # print('bias.shape',bias.shape)
170
+ # p2ll = self.p2ll(output[-1,:,:],cw=3) # prefer to use obs to compuate p2ll
171
+ return outRows, bias
172
+
173
+ return outRows
174
+
175
+ def __squareFromContinousRows(self, xCenter, yCenter, w, type='o', meta=True):
176
+ """
177
+ fetch a (2w+1)*(2w+1) square of contacts centered at (xCenter,yCenter) from continousrows efficiently
178
+ Parameters
179
+ ----------
180
+ xCenter : xCenter in bp
181
+ yCenter : yCenter in bp
182
+ w : block width = 2w+1, in bins
183
+ type : o [observe], oe [o/e], b [both]
184
+ """
185
+
186
+ if xCenter < self.continousRows.start_bp or xCenter > self.continousRows.end_bp:
187
+ print('miss')
188
+ rowStep = 1000
189
+ startRow_bp = np.max([0, xCenter // (rowStep * self.resol) * (rowStep - 2 * w) * self.resol])
190
+ endRow_bp = np.min(
191
+ [startRow_bp + (rowStep + 2 * w) * self.resol, (self.extent[1] - self.offset - 1) * self.resol])
192
+ mat, bias = self.rows(startRow_bp, endRow_bp, type='b', returnBias=True)
193
+
194
+ self.continousRows.start_bp = startRow_bp
195
+ self.continousRows.end_bp = endRow_bp
196
+ self.continousRows.O_matrix = mat[0, :, :]
197
+ self.continousRows.OE_matrix = mat[1, :, :]
198
+ self.continousRows.bias = bias
199
+ else:
200
+ print('hit')
201
+
202
+ xCenterRelativeBin = (xCenter - self.continousRows.start_bp) // self.resol
203
+ yCenterRelativeBin = (yCenter - self.continousRows.start_bp) // self.resol
204
+
205
+ # = {'start_bp': v, 'end_bp': v, 'O_matrix': None, 'OE_matrix': None, 'bias':None, 'offset_bin': 0}
206
+ if type == 'o':
207
+ output = self.continousRows.O_matrix[xCenterRelativeBin - w:xCenterRelativeBin + w + 1,
208
+ yCenterRelativeBin - w:yCenterRelativeBin + w + 1][None]
209
+ elif type == 'oe':
210
+ output = self.continousRows.OE_matrix[xCenterRelativeBin - w:xCenterRelativeBin + w + 1,
211
+ yCenterRelativeBin - w:yCenterRelativeBin + w + 1][None]
212
+ else:
213
+ OEsquare = self.continousRows.OE_matrix[xCenterRelativeBin - w:xCenterRelativeBin + w + 1,
214
+ yCenterRelativeBin - w:yCenterRelativeBin + w + 1][None]
215
+ Osquare = self.continousRows.O_matrix[xCenterRelativeBin - w:xCenterRelativeBin + w + 1,
216
+ yCenterRelativeBin - w:yCenterRelativeBin + w + 1][None]
217
+ output = np.concatenate((Osquare, OEsquare))
218
+
219
+ if meta:
220
+ xBias = self.continousRows.bias[xCenterRelativeBin - w:xCenterRelativeBin + w + 1]
221
+ yBias = self.continousRows.bias[yCenterRelativeBin - w:yCenterRelativeBin + w + 1]
222
+ bias = np.concatenate((xBias, yBias))
223
+ p2ll,crk = self.p2ll(output[-1, :, :], cw=3) # prefer to use obs to compuate p2ll
224
+ return output, np.concatenate((bias, [self.totalRC, p2ll,yCenterRelativeBin,crk]))
225
+ return output
226
+
227
+ def p2ll(self, x, cw=3):
228
+ """
229
+ P2LL for a peak.
230
+ Parameters:
231
+ x : sqaure matrix, peak and its surrandings
232
+ cw : lower-left corner width
233
+ """
234
+ c = x.shape[0] // 2
235
+ llcorner = x[-cw:, :cw].flatten()
236
+ if sum(llcorner) == 0:
237
+ return 0,np.sum(x[c,c]>x[c-1:c+2,c-1:c+2])
238
+ return x[c, c] / (sum(llcorner) / len(llcorner)),np.sum(x[c,c]>x[c-1:c+2,c-1:c+2])
239
+
240
+ def square(self, xCenter, yCenter, w, type='o', meta=True, cache=False):
241
+ """
242
+ fetch a (2w+1)*(2w+1) square of contacts centered at (xCenter,yCenter)
243
+ Parameters
244
+ ----------
245
+ xCenter : xCenter in bp
246
+ yCenter : yCenter in bp
247
+ w : block width = 2w+1, in bins
248
+ type : o [observe], oe [o/e], b [both]
249
+ """
250
+ # print(xCenter,yCenter)
251
+ tril = None
252
+ xCenter = xCenter // self.resol * self.resol
253
+ yCenter = yCenter // self.resol * self.resol
254
+ # if xCenter > yCenter:
255
+ # tmp = xCenter
256
+ # xCenter = yCenter
257
+ # yCenter = tmp
258
+
259
+ # if xCenter - w * self.resol < 0 or yCenter - w * self.resol < 0 or \
260
+ # xCenter + w * self.resol > (
261
+ # self.extent[1] - self.extent[0] - 1) * self.resol or yCenter + w * self.resol > (
262
+ # self.extent[1] - self.extent[0] - 1) * self.resol:
263
+ # raise Exception("Accessing values outside the contact map ... valid region: 0 ~ "
264
+ # + str((self.extent[1] - self.extent[0]) * self.resol))
265
+
266
+ # if cache:
267
+ # # print("cache")
268
+ # return self.__squareFromContinousRows(xCenter, yCenter, w, type, meta)
269
+
270
+ xCenterRelativeBin = self.bp2bin[xCenter] - self.offset
271
+ yCenterRelativeBin = self.bp2bin[yCenter] - self.offset - xCenterRelativeBin
272
+
273
+ # if yCenterRelativeBin + 2 * w >= self.max_distance_bins:
274
+ # raise Exception("max distance in this bcool file is ", self.max_distance_bins * self.resol)
275
+ topleft = [xCenterRelativeBin - w, yCenterRelativeBin - 2 * w]
276
+ bottomright = [xCenterRelativeBin + w, yCenterRelativeBin + 2 * w]
277
+
278
+ if topleft[1] < 0:
279
+ tril = (topleft[0], topleft[1], bottomright[0], -1)
280
+ topleft[1] = 0
281
+ tril_part = self.__tril_block(tril[0], tril[1], tril[2], tril[3], type)
282
+
283
+ Osquare = self.bmatrix[topleft[0]:bottomright[0] + 1, topleft[1]:bottomright[1] + 1]
284
+
285
+ if type == 'o':
286
+ Osquare = Osquare[None]
287
+ if tril is not None:
288
+ Osquare = np.concatenate((tril_part, Osquare), axis=-1)
289
+ output = self.__relative_right_shift(Osquare)[:, :, :2 * w + 1]
290
+ elif type == 'oe':
291
+ OEsquare = (Osquare / self.diag_mean[topleft[1]:bottomright[1] + 1])[None]
292
+ if tril is not None:
293
+ OEsquare = np.concatenate((tril_part, OEsquare), axis=-1)
294
+ output = self.__relative_right_shift(OEsquare)[:, :, :2 * w + 1]
295
+ else:
296
+ OEsquare = Osquare / self.diag_mean[topleft[1]:bottomright[1] + 1]
297
+ output = np.concatenate((Osquare[None], OEsquare[None]))
298
+ if tril is not None:
299
+ output = np.concatenate((tril_part, output), axis=-1)
300
+ output = self.__relative_right_shift(output)[:, :, :2 * w + 1]
301
+ if meta:
302
+ xBias = self.bin2bias[self.bp2bin[xCenter] - self.offset - w:self.bp2bin[xCenter] - self.offset + w + 1]
303
+ yBias = self.bin2bias[self.bp2bin[yCenter] - self.offset - w:self.bp2bin[yCenter] - self.offset + w + 1]
304
+ bias = np.concatenate((xBias, yBias))
305
+
306
+ p2ll,crk = self.p2ll(output[-1, :, :], cw=3) # prefer to use obs to compuate p2ll
307
+ return output, np.concatenate((bias, [self.totalRC, p2ll,yCenterRelativeBin,crk]))
308
+ return output
309
+
310
+ class bcool(cooler.Cooler):
311
+ def __init__(self, store):
312
+ super().__init__(store)
313
+
314
+ def bchr(self, chrom, max_distance=None, annotate=True,decoy=False,restrictDecoy=False):
315
+ '''
316
+ get banded matrix for a given chrom
317
+ '''
318
+ balance = True
319
+ resol = self.info['bin-size']
320
+ if max_distance is not None and 'max_distance' in self.info and max_distance > self.info['max_distance']:
321
+ raise Exception("max distance in this bcool file is ", self.info['max_distance'])
322
+ else:
323
+ if 'max_distance' in self.info:
324
+ max_distance = self.info['max_distance']
325
+ else:
326
+ max_distance = 3000000
327
+ pixels = self.matrix(balance=balance, as_pixels=True).fetch(chrom)
328
+ pixels=pixels[(pixels['bin2_id']-pixels['bin1_id']).abs()<max_distance//resol].reset_index(drop=True)
329
+
330
+ if decoy:
331
+ np.random.seed(0)
332
+ pixels['distance']=(pixels['bin2_id']-pixels['bin1_id']).abs()
333
+ if restrictDecoy:
334
+ pixels = pixels.groupby('distance').apply(shuffleIFWithCount)
335
+ else:
336
+ pixels=pixels.groupby('distance').apply(shuffleIF)
337
+
338
+
339
+ if annotate:
340
+ bins = self.bins().fetch(chrom)
341
+ info = self.info
342
+ else:
343
+ bins = None
344
+ info = None
345
+ extent = self.extent(chrom)
346
+ bmatrix = bandmatrix(pixels, extent, max_distance // resol, bins, info)
347
+ return bmatrix
polaris/utils/util_cool2bcool.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Modified from RefHiC: https://github.com/BlanchetteLab/RefHiC(Analysis code)
2
+ # --------------------------------------------------------------
3
+
4
+ import click
5
+ import cooler
6
+ import h5py
7
+ from cooler.create._create import write_pixels,write_indexes,index_bins,index_pixels,prepare_pixels,PIXEL_DTYPES,_set_h5opts,write_info
8
+ from cooler.util import get_meta
9
+ import posixpath
10
+
11
+ @click.command()
12
+ @click.option('-u', type=int, default=3000000, help='distance upperbund [bp] [default=3000000]')
13
+ @click.option('--resol',default=None,help='comma separated resols for output')
14
+ @click.argument('mcool', type=str,required=True)
15
+ @click.argument('bcool', type=str,required=True)
16
+ def cool2bcool(mcool, bcool,u,resol):
17
+ '''covert a .mcool file to a .bcool file'''
18
+ h5opts = _set_h5opts(None)
19
+ copy = ['bins', 'chroms']
20
+ Ofile = h5py.File(bcool, 'w')
21
+ Ifile = h5py.File(mcool, 'r')
22
+
23
+ if resol is None:
24
+ resols = [r.split('/')[-1] for r in cooler.fileops.list_coolers(mcool)]
25
+ else:
26
+ resols = resol.split(',')
27
+ # copy bins and chroms
28
+ for grp in Ifile:
29
+ Ofile.create_group(grp)
30
+ for subgrp in Ifile[grp]:
31
+ if subgrp in resols:
32
+ Ofile[grp].create_group(subgrp)
33
+ for ssubgrp in Ifile[grp][subgrp]:
34
+ if ssubgrp in copy:
35
+ Ofile.copy(Ifile[grp + '/' + subgrp + '/' + ssubgrp], grp + '/' + subgrp + '/' + ssubgrp)
36
+ Ofile.flush()
37
+ Ifile.close()
38
+
39
+ for group_path in ['/resolutions/'+str(r) for r in resols]:
40
+ c = cooler.Cooler(mcool + '::' + group_path)
41
+ nnz_src = c.info['nnz']
42
+ n_bins = c.info['nbins']
43
+ n_chroms = c.info['nchroms']
44
+ bins = c.bins()[:]
45
+ pixels = []
46
+ info = c.info
47
+ info['subformat'] = 'bcool'
48
+ info['max_distance'] = u
49
+ info['full_nnz'] = info['nnz']
50
+ info['full_sum'] = info['sum']
51
+
52
+ # collect pixels
53
+ for lo, hi in cooler.util.partition(0, nnz_src, nnz_src // 100):
54
+ pixel = c.pixels(join=False)[lo:hi].reset_index(drop=True)
55
+ bins1 = bins.iloc[pixel['bin1_id']][['chrom', 'start']].reset_index(drop=True)
56
+ bins2 = bins.iloc[pixel['bin2_id']][['chrom', 'start']].reset_index(drop=True)
57
+ pixel = pixel[
58
+ (bins1['chrom'] == bins2['chrom']) & ((bins1['start'] - bins2['start']).abs() < u)].reset_index(
59
+ drop=True)
60
+ pixels.append(pixel)
61
+
62
+ columns = list(pixels[0].columns.values)
63
+ meta = get_meta(columns, dict(PIXEL_DTYPES), default_dtype=float)
64
+
65
+ # write pixels
66
+ with h5py.File(bcool, "r+") as f:
67
+ h5 = f[group_path]
68
+ grp = h5.create_group("pixels")
69
+ max_size = n_bins * (n_bins - 1) // 2 + n_bins
70
+ prepare_pixels(grp, n_bins, max_size, meta.columns, dict(meta.dtypes), h5opts)
71
+
72
+ target = posixpath.join(group_path, 'pixels')
73
+ nnz, ncontacts = write_pixels(bcool, target, columns, pixels, h5opts, lock=None)
74
+ info['nnz'] = nnz
75
+ info['sum'] = ncontacts
76
+
77
+ # write indexes
78
+ with h5py.File(bcool, "r+") as f:
79
+ h5 = f[group_path]
80
+ grp = h5.create_group("indexes")
81
+ chrom_offset = index_bins(h5["bins"], n_chroms, n_bins)
82
+ bin1_offset = index_pixels(h5["pixels"], n_bins, nnz)
83
+ write_indexes(grp, chrom_offset, bin1_offset, h5opts)
84
+ write_info(h5, info)
85
+
86
+
87
+ if __name__ == '__main__':
88
+ cool2bcool()
polaris/utils/util_data.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+ import warnings
3
+ import numpy as np
4
+ from scipy.sparse import coo_matrix
5
+ from torch.utils.data import Dataset
6
+ from scipy.sparse import SparseEfficiencyWarning
7
+ warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
8
+
9
+ def getLocal(mat, i, jj, w, N):
10
+ if i >= 0 and jj >= 0 and i+w <= N and jj+w <= N:
11
+ mat = mat[i:i+w,jj:jj+w].toarray()
12
+ # print(f"global: {mat.shape}")
13
+ return mat[None,...]
14
+ # pad_width = ((up, down), (left, right))
15
+ slice_pos = [[i, i+w], [jj, jj+w]]
16
+ pad_width = [[0, 0], [0, 0]]
17
+ if i < 0:
18
+ pad_width[0][0] = -i
19
+ slice_pos[0][0] = 0
20
+ if jj < 0:
21
+ pad_width[1][0] = -jj
22
+ slice_pos[1][0] = 0
23
+ if i+w > N:
24
+ pad_width[0][1] = i+w-N
25
+ slice_pos[0][1] = N
26
+ if jj+w > N:
27
+ pad_width[1][1] = jj+w-N
28
+ slice_pos[1][1] = N
29
+ _mat = mat[slice_pos[0][0]:slice_pos[0][1],slice_pos[1][0]:slice_pos[1][1]].toarray()
30
+ padded_mat = np.pad(_mat, pad_width, mode='constant', constant_values=0)
31
+ # print(f"global: {padded_mat.shape}",slice_pos, pad_width)
32
+ return padded_mat[None,...]
33
+
34
+ def upperCoo2symm(row,col,data,N=None):
35
+ # print(np.max(row),np.max(col),N)
36
+ if N:
37
+ shape=(N,N)
38
+ else:
39
+ shape=(row.max() + 1,col.max() + 1)
40
+
41
+ sparse_matrix = coo_matrix((data, (row, col)), shape=shape)
42
+ symm = sparse_matrix + sparse_matrix.T
43
+ diagVal = symm.diagonal(0)/2
44
+ symm = symm.tocsr()
45
+ symm.setdiag(diagVal)
46
+ return symm
47
+
48
+ def shuffleIFWithCount(df):
49
+ shuffled_df = df.copy()
50
+ shuffled_df[['oe', 'balanced']] = df[['oe', 'balanced']].sample(frac=1).reset_index(drop=True)
51
+ return shuffled_df
52
+
53
+ def shuffleIF(df):
54
+ if len(df)<10:
55
+ df = shuffleIFWithCount(df)
56
+ return df
57
+ min=np.min(df['bin1_id'])
58
+ max=np.max(df['bin1_id'])
59
+ distance = df['distance'].iloc[0]
60
+ bin1_id = np.random.randint(min, high=max, size=int(len(df)*1.5))
61
+ bin2_id = bin1_id + distance
62
+ pair_id = set(zip(bin1_id,bin2_id))
63
+ if len(pair_id)<len(df)-50:
64
+ bin1_id = np.random.randint(min, high=max, size=len(df))
65
+ bin2_id = bin1_id + distance
66
+ extra_pair_id = set(zip(bin1_id,bin2_id))
67
+ pair_id.update(extra_pair_id)
68
+ if len(pair_id)<len(df):
69
+ df = df.sample(len(pair_id))
70
+ pair_id = list(pair_id)
71
+ random.shuffle(pair_id)
72
+ pair_id=np.asarray(pair_id[:len(df)])
73
+ df['bin1_id']=pair_id[:,0]
74
+ df['bin2_id'] = pair_id[:,1]
75
+ return df
76
+
77
+ class centerPredCoolDataset(Dataset):
78
+ def __init__(self, coolfile, cchrom, step=224, w=224, max_distance_bin=600, decoy=False, restrictDecoy=False, s=0.9, raw=False):
79
+ '''
80
+ Args:
81
+ step (int): the step of slide window moved and also the center crop size to predict
82
+ '''
83
+
84
+ self.s=s
85
+ oeMat, decoyOeMat, N = self._processCoolFile(coolfile, cchrom, decoy=decoy, restrictDecoy=restrictDecoy, raw=raw)
86
+ self.data, self.i, self.j = self._prepare_data(oeMat, N, step, w, max_distance_bin, decoyOeMat)
87
+ del oeMat, decoyOeMat
88
+
89
+ def _prepare_data(self, oeMat, N, step, w, max_distance_bin, decoyOeMat=None):
90
+ center_crop_size = step
91
+ start_point = -(w - center_crop_size) // 2
92
+ data, i_list, j_list = [], [], []
93
+ joffset = np.repeat(np.linspace(0, w, w, endpoint=False, dtype=int)[np.newaxis, :], w, axis=0)
94
+ ioffset = np.repeat(np.linspace(0, w, w, endpoint=False, dtype=int)[:, np.newaxis], w, axis=1)
95
+
96
+ for i in range(start_point, N - w - start_point, step):
97
+ _data, _i_list, _j_list = self._process_window(oeMat, i, step, w, N, joffset, ioffset, max_distance_bin, decoyOeMat)
98
+ data.extend(_data)
99
+ i_list.extend(_i_list)
100
+ j_list.extend(_j_list)
101
+
102
+ return data, i_list, j_list
103
+
104
+ def _process_window(self, oeMat, i, step, w, N, joffset, ioffset, max_distance_bin, decoyOeMat=None):
105
+ data, i_list, j_list = [], [], []
106
+ for j in range(0, max_distance_bin, step):
107
+ jj = j + i
108
+ # if jj + w <= N and i + w <= N:
109
+ _oeMat = getLocal(oeMat, i, jj, w, N)
110
+ if np.sum(_oeMat == 0) <= (w*w*self.s):
111
+ if decoyOeMat is not None:
112
+ _decoyOeMat = getLocal(decoyOeMat, i, jj, w, N)
113
+ data.append(np.vstack((_oeMat, _decoyOeMat)))
114
+ else:
115
+ data.append(_oeMat)
116
+
117
+ i_list.append(i + ioffset)
118
+ j_list.append(jj + joffset)
119
+ return data, i_list, j_list
120
+
121
+ def _processCoolFile(self, coolfile, cchrom, decoy=False, restrictDecoy=False, raw=False):
122
+ extent = coolfile.extent(cchrom)
123
+ N = extent[1] - extent[0]
124
+ if raw:
125
+ ccdata = coolfile.matrix(balance=False, sparse=True, as_pixels=True).fetch(cchrom)
126
+ v='count'
127
+ else:
128
+ ccdata = coolfile.matrix(balance=True, sparse=True, as_pixels=True).fetch(cchrom)
129
+ v='balanced'
130
+ ccdata['bin1_id'] -= extent[0]
131
+ ccdata['bin2_id'] -= extent[0]
132
+
133
+ ccdata['distance'] = ccdata['bin2_id'] - ccdata['bin1_id']
134
+ d_means = ccdata.groupby('distance')[v].transform('mean')
135
+ ccdata[v] = ccdata[v].fillna(0)
136
+
137
+ ccdata['oe'] = ccdata[v] / d_means
138
+ ccdata['oe'] = ccdata['oe'].fillna(0)
139
+ ccdata['oe'] = ccdata['oe'] / ccdata['oe'].max()
140
+ oeMat = upperCoo2symm(ccdata['bin1_id'].ravel(), ccdata['bin2_id'].ravel(), ccdata['oe'].ravel(), N)
141
+
142
+ decoyMat = None
143
+ if decoy:
144
+ decoydata = ccdata.copy(deep=True)
145
+ np.random.seed(0)
146
+ if restrictDecoy:
147
+ decoydata = decoydata.groupby('distance').apply(shuffleIF)
148
+ else:
149
+ decoydata = decoydata.groupby('distance').apply(shuffleIFWithCount)
150
+
151
+ decoyMat = upperCoo2symm(decoydata['bin1_id'].ravel(), decoydata['bin2_id'].ravel(), decoydata['oe'].ravel(), N)
152
+
153
+ return oeMat, decoyMat, N
154
+
155
+ def __len__(self):
156
+ return len(self.data)
157
+
158
+ def __getitem__(self, idx):
159
+ return self.i[idx], self.j[idx], self.data[idx]
polaris/utils/util_depth.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import click
2
+ import cooler
3
+ import numpy as np
4
+ from tqdm import tqdm
5
+ from multiprocessing import Pool
6
+
7
+ np.seterr(divide='ignore', invalid='ignore')
8
+
9
+ def process_chrom(args):
10
+ chrom_name, input_file, resol, mindis, exclude_self = args
11
+ try:
12
+ C = cooler.Cooler(f"{input_file}::resolutions/{resol}")
13
+ pixels = C.matrix(
14
+ balance=False, sparse=True, as_pixels=True).fetch(chrom_name)
15
+ bin_diff = pixels['bin2_id'] - pixels['bin1_id']
16
+ min_diff = max(mindis, 1) if exclude_self else mindis
17
+ mask = bin_diff >= min_diff
18
+ return pixels[mask]['count'].sum()
19
+ except Exception as e:
20
+ print(f"Error processing {chrom_name}: {e}")
21
+ return 0
22
+
23
+ @click.command()
24
+ @click.option('-c','--chrom', type=str, default=None, help='Comma separated chroms [all autosomes]')
25
+ @click.option('-md','--mindis', type=int, default=0, help='Min genomic distance in bins [0]')
26
+ @click.option('-r','--resol',type=int,required=True,help='Resolution (bp)')
27
+ @click.option('-i','--input', type=str,required=True,help='mcool file path')
28
+ @click.option('--exclude-self', is_flag=True, help='Exclude bin_diff=0 contacts')
29
+ def depth(input, resol, mindis, chrom, exclude_self):
30
+ """Calculate intra-chromosomal contacts with bin distance >= mindis"""
31
+ print(f'\n[polaris] Depth calculation START')
32
+
33
+ try:
34
+ C = cooler.Cooler(f"{input}::resolutions/{resol}")
35
+ except ValueError:
36
+ available_res = cooler.fileops.list_coolers(input)
37
+ raise ValueError(f"Resolution {resol} not found. Available: {available_res}")
38
+
39
+ chrom_list = chrom.split(',') if chrom else C.chromnames
40
+ invalid_chroms = [c for c in chrom_list if c not in C.chromnames]
41
+ if invalid_chroms:
42
+ raise ValueError(f"Invalid chromosomes: {invalid_chroms}. Valid: {C.chromnames}")
43
+
44
+ # 并行处理
45
+ with Pool(processes=min(len(chrom_list), 4)) as pool:
46
+ args_list = [(chrom, input, resol, mindis, exclude_self) for chrom in chrom_list]
47
+ results = list(tqdm(pool.imap(process_chrom, args_list), total=len(chrom_list), dynamic_ncols=True))
48
+ total_contacts = sum(results)
49
+
50
+ print(f"\n[polaris] Depth calculation FINISHED")
51
+ print(f"File: {input} (res={resol}bp)")
52
+ print(f"Chromosomes: {chrom_list}")
53
+ print(f"Minimum bin distance: {mindis}{', exclude self' if exclude_self else ''}")
54
+ print(f"Total intra contacts: {total_contacts:,}")
55
+
56
+ if __name__ == '__main__':
57
+ depth()
polaris/utils/util_loop.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class bedpewriter():
2
+ def __init__(self,file_path, resol, max_distance):
3
+ self.f = open(file_path,'w')
4
+ self.resol = resol
5
+ self.max_distance = max_distance
6
+ def write(self,chrom,x,y,prob):
7
+ for i in range(len(x)):
8
+ # if x[i] < y[i] and y[i]-x[i] > 11*self.resol and y[i] - x[i] < self.max_distance:
9
+ if x[i] < y[i] and y[i] - x[i] < self.max_distance:
10
+ self.f.write(chrom+'\t'+str(x[i])+'\t'+str(x[i]+self.resol)
11
+ +'\t'+chrom+'\t'+str(y[i])+'\t'+str(y[i]+self.resol)
12
+ +'\t'+str(prob[i])+'\n')
polaris/utils/util_pileup.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import click
3
+ import pandas as pd
4
+ from polaris.utils.util_bcooler import bcool
5
+ from matplotlib import pylab as plt
6
+ from matplotlib.colors import LinearSegmentedColormap
7
+ cmap=LinearSegmentedColormap.from_list('wr',["w", "r"], N=256)
8
+
9
+ def p2LL(x, cw=3):
10
+ """
11
+ P2LL for a peak.
12
+ Parameters:
13
+ x : sqaure matrix, peak and its surrandings
14
+ cw : lower-left corner width
15
+ """
16
+ c = x.shape[0] // 2
17
+ llcorner = x[-cw:, :cw].flatten()
18
+ if sum(llcorner) == 0:
19
+ return 0,np.sum(x[c,c]>x[c-1:c+2,c-1:c+2])
20
+ return x[c, c] / (sum(llcorner) / len(llcorner)),np.sum(x[c,c]>x[c-1:c+2,c-1:c+2])
21
+
22
+ @click.command()
23
+ @click.option('-w', type=int, default=10, help="window size (bins): (2w+1)x(2w+1) [10]")
24
+ @click.option('--savefig', type=str, default=None, help="save pileup plot to file [FOCI_pileup.png]")
25
+ @click.option('--p2ll', type=bool, default=False, help="compute p2ll [False]")
26
+ @click.option('--mindistance', type=int, default=None, help="min distance (bins) to skip, only for bedpe foci [2w+1]")
27
+ @click.option('--maxdistance', type=int, default=1e9, help="min distance (bins) to skip , only for bedpe foci [1e9]")
28
+ @click.option('--resol', type=int, default=5000, help="resolution [5000]")
29
+ @click.option('--oe', type=bool, default=True, help="O/E normalized [True]")
30
+ @click.argument('foci', type=str,required=True)
31
+ @click.argument('mcool', type=str,required=True)
32
+ def pileup(w,savefig,p2ll,mindistance,resol,maxdistance,foci,mcool,oe):
33
+ ''' 2D pileup contact maps around given foci
34
+
35
+ \b
36
+ FOCI format: bedpe file contains loops
37
+ \f
38
+ :param w:
39
+ :param savefig:
40
+ :param p2ll:
41
+ :param mindistance:
42
+ :param resol:
43
+ :param maxdistance:
44
+ :param foci:
45
+ :param mcool:
46
+ :param oe:
47
+ :return:
48
+ '''
49
+ if mindistance is None:
50
+ mindistance=2*w+1
51
+ if savefig is None:
52
+ savefig=foci+'_pileup.png'
53
+ bcoolFile = bcool(mcool + '::/resolutions/' + str(resol))
54
+ pileup=np.zeros((2 * w + 1, 2 * w + 1))
55
+ if '.bedpe' in foci:
56
+ filetype='bedpe'
57
+ else:
58
+ filetype = 'bed'
59
+ if oe:
60
+ oeType='oe'
61
+ else:
62
+ oeType='o'
63
+
64
+ foci = pd.read_csv(foci,sep='\t',header=None)
65
+
66
+ if filetype == 'bedpe':
67
+ foci=foci[foci[4]-foci[1]>mindistance*resol]
68
+ foci=foci[foci[4]-foci[1]<maxdistance*resol]
69
+
70
+
71
+ chroms=list(set(foci[0]))
72
+
73
+
74
+ n=0
75
+ for chrom in chroms:
76
+ fociChr=foci[foci[0]==chrom]
77
+ X = list(fociChr[1])
78
+ if filetype=='bedpe':
79
+ Y = list(fociChr[4])
80
+ else:
81
+ Y=X.copy()
82
+ bmatrix = bcoolFile.bchr(chrom,decoy=False)
83
+
84
+ for x,y in zip(X,Y):
85
+ mat,meta= bmatrix.square(x,y,w,oeType)
86
+ pileup+=mat[0,:,:]
87
+ n+=1
88
+ pileup/=n
89
+ plt.figure(figsize=(2, 2))
90
+ plt.imshow(pileup,cmap=cmap)
91
+ plt.xticks([])
92
+ plt.yticks([])
93
+ if p2ll:
94
+ plt.title('P2LL=' + "{:.2f}".format(p2LL(pileup)[0]), fontsize=12)
95
+ plt.savefig(savefig,dpi=600)
polaris/version.py ADDED
@@ -0,0 +1 @@
 
 
1
+ __version__ = '1.0.0'
setup.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # My code has references to the following repositories:
2
+ # RefHiC: https://github.com/BlanchetteLab/RefHiC(Analysis code)
3
+ # Axial Attention: https://github.com/lucidrains/axial-attention (Model architecture)
4
+ # Peakachu: https://github.com/tariks/peakachu (Calculate intra reads)
5
+ # Thanks a lot for their implement.
6
+
7
+ """
8
+ Setup script for Polaris.
9
+
10
+ A Versatile Framework for Chromatin Loop Annotation in Bulk and Single-cell Hi-C Data.
11
+ """
12
+
13
+ from setuptools import setup, find_packages
14
+
15
+ with open("README.md", "r") as readme:
16
+ long_des = readme.read()
17
+
18
+ setup(
19
+ name='polaris',
20
+ version='1.0.1',
21
+ author="Yusen HOU, Audrey Baguette, Mathieu Blanchette*, Yanlin Zhang*",
22
+ author_email="[email protected]",
23
+ description="A Versatile Framework for Chromatin Loop Annotation in Bulk and Single-cell Hi-C Data",
24
+ long_description=long_des,
25
+ long_description_content_type="text/markdown",
26
+ url="https://github.com/ai4nucleome/Polaris",
27
+ packages=['polaris'],
28
+ include_package_data=True,
29
+ install_requires=[
30
+ 'setuptools==75.1.0',
31
+ 'appdirs==1.4.4',
32
+ 'click==8.0.1',
33
+ 'cooler==0.8.11',
34
+ 'matplotlib==3.8.0',
35
+ 'numpy==1.22.4',
36
+ 'pandas==1.3.0',
37
+ 'scikit-learn==1.4.2',
38
+ 'scipy==1.7.3',
39
+ 'torch==2.2.2',
40
+ 'timm==0.6.12',
41
+ 'tqdm==4.65.0',
42
+ ],
43
+ entry_points={
44
+ 'console_scripts': [
45
+ 'polaris = polaris.polaris:cli',
46
+ ],
47
+ },
48
+ classifiers=[
49
+ "Programming Language :: Python :: 3",
50
+ "License :: OSI Approved :: MIT License",
51
+ "Intended Audience :: Science/Research",
52
+ "Topic :: Scientific/Engineering :: Bio-Informatics",
53
+ "Operating System :: OS Independent",
54
+ ],
55
+ python_requires='>=3.9',
56
+ )