KeTuTu commited on
Commit
2999286
1 Parent(s): 4a7d26d

Upload 46 files

Browse files

This is the first raw knowledge base of OmicVerse

ovrawm/t_anno_trans.txt ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Celltype annotation transfer in multi-omics
5
+ #
6
+ # In the field of multi-omics research, transferring cell type annotations from one data modality to another is a crucial step. For instance, when annotating cell types in single-cell ATAC sequencing (scATAC-seq) data, it's often desirable to leverage the cell type labels already annotated in single-cell RNA sequencing (scRNA-seq) data. This process involves integrating information from both scRNA-seq and scATAC-seq data modalities.
7
+ #
8
+ # GLUE is a prominent algorithm used for cross-modality integration, allowing researchers to combine data from different omics modalities effectively. However, GLUE does not inherently provide a method for transferring cell type labels from scRNA-seq to scATAC-seq data. To address this limitation, an approach was implemented in the omicverse platform using K-nearest neighbor (KNN) graphs.
9
+ #
10
+ # The KNN graph-based approach likely involves constructing KNN graphs separately for scRNA-seq and scATAC-seq data. In these graphs, each cell is connected to its K nearest neighbors based on certain similarity metrics, which could be calculated using gene expression profiles in scRNA-seq and accessibility profiles in scATAC-seq. Once these graphs are constructed, the idea is to transfer the cell type labels from the scRNA-seq side to the scATAC-seq side by assigning labels to scATAC-seq cells based on the labels of their KNN neighbors in the scRNA-seq graph.
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1aIMmSgyIw-PGjJ65WvMgz4Ob3EtoK_UV?usp=sharing
13
+
14
+ # In[3]:
15
+
16
+
17
+ import omicverse as ov
18
+ import matplotlib.pyplot as plt
19
+ import scanpy as sc
20
+ ov.ov_plot_set()
21
+
22
+
23
+ # ## Loading the data preprocessed with GLUE
24
+ #
25
+ # Here, we use two output files from the GLUE cross-modal integration, and their common feature is that they both have the `obsm['X_glue']` layer. And the rna have been annotated.
26
+
27
+ # In[4]:
28
+
29
+
30
+ rna=sc.read("data/analysis_lymph/rna-emb.h5ad")
31
+ atac=sc.read("data/analysis_lymph/atac-emb.h5ad")
32
+
33
+
34
+ # We can visualize the intergrated effect of GLUE with UMAP
35
+
36
+ # In[5]:
37
+
38
+
39
+ import scanpy as sc
40
+ combined=sc.concat([rna,atac],merge='same')
41
+ combined
42
+
43
+
44
+ # In[6]:
45
+
46
+
47
+ combined.obsm['X_mde']=ov.utils.mde(combined.obsm['X_glue'])
48
+
49
+
50
+ # We can see that the two layers are correctly aligned
51
+
52
+ # In[8]:
53
+
54
+
55
+ ov.utils.embedding(combined,
56
+ basis='X_mde',
57
+ color='domain',
58
+ title='Layers',
59
+ show=False,
60
+ palette=ov.utils.red_color,
61
+ frameon='small'
62
+ )
63
+
64
+
65
+ # And the RNA modality has an already annotated cell type label on it
66
+
67
+ # In[22]:
68
+
69
+
70
+ ov.utils.embedding(rna,
71
+ basis='X_mde',
72
+ color='major_celltype',
73
+ title='Cell type',
74
+ show=False,
75
+ #palette=ov.utils.red_color,
76
+ frameon='small'
77
+ )
78
+
79
+
80
+ # ## Celltype transfer
81
+ #
82
+ # We train a knn nearest neighbour classifier using `X_glue` features
83
+
84
+ # In[13]:
85
+
86
+
87
+ knn_transformer=ov.utils.weighted_knn_trainer(
88
+ train_adata=rna,
89
+ train_adata_emb='X_glue',
90
+ n_neighbors=15,
91
+ )
92
+
93
+
94
+ # In[14]:
95
+
96
+
97
+ labels,uncert=ov.utils.weighted_knn_transfer(
98
+ query_adata=atac,
99
+ query_adata_emb='X_glue',
100
+ label_keys='major_celltype',
101
+ knn_model=knn_transformer,
102
+ ref_adata_obs=rna.obs,
103
+ )
104
+
105
+
106
+ # We migrate the training results of the KNN classifier to atac. `unc` stands for uncertainty, with higher uncertainty demonstrating lower migration accuracy, suggesting that the cell in question may be a double-fate signature or some other type of cell.
107
+
108
+ # In[15]:
109
+
110
+
111
+ atac.obs["transf_celltype"]=labels.loc[atac.obs.index,"major_celltype"]
112
+ atac.obs["transf_celltype_unc"]=uncert.loc[atac.obs.index,"major_celltype"]
113
+
114
+
115
+ # In[24]:
116
+
117
+
118
+ atac.obs["major_celltype"]=atac.obs["transf_celltype"].copy()
119
+
120
+
121
+ # In[27]:
122
+
123
+
124
+ ov.utils.embedding(atac,
125
+ basis='X_umap',
126
+ color=['transf_celltype_unc','transf_celltype'],
127
+ #title='Cell type Un',
128
+ show=False,
129
+ palette=ov.palette()[11:],
130
+ frameon='small'
131
+ )
132
+
133
+
134
+ # ## Visualization
135
+ #
136
+ # We can merge atac and rna after migration annotation and observe on the umap plot whether the cell types are consistent after merging the modalities.
137
+
138
+ # In[28]:
139
+
140
+
141
+ import scanpy as sc
142
+ combined1=sc.concat([rna,atac],merge='same')
143
+ combined1
144
+
145
+
146
+ # In[29]:
147
+
148
+
149
+ combined1.obsm['X_mde']=ov.utils.mde(combined1.obsm['X_glue'])
150
+
151
+
152
+ # We found that the annotation was better, suggesting that the KNN nearest-neighbour classifier we constructed can effectively migrate cell type labels from RNA to ATAC.
153
+
154
+ # In[31]:
155
+
156
+
157
+ ov.utils.embedding(combined1,
158
+ basis='X_mde',
159
+ color=['domain','major_celltype'],
160
+ title=['Layers','Cell type'],
161
+ show=False,
162
+ palette=ov.palette()[11:],
163
+ frameon='small'
164
+ )
165
+
166
+
167
+ # In[ ]:
168
+
169
+
170
+
171
+
ovrawm/t_aucell.txt ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import omicverse as ov
2
+ import scanpy as sc
3
+ import scvelo as scv
4
+
5
+ ov.utils.ov_plot_set()
6
+
7
+ ov.utils.download_pathway_database()
8
+ ov.utils.download_geneid_annotation_pair()
9
+
10
+ adata = scv.datasets.pancreas()
11
+ adata
12
+
13
+ adata.X.max()
14
+
15
+ sc.pp.normalize_total(adata, target_sum=1e4)
16
+ sc.pp.log1p(adata)
17
+
18
+ adata.X.max()
19
+
20
+ pathway_dict=ov.utils.geneset_prepare('genesets/GO_Biological_Process_2021.txt',organism='Mouse')
21
+
22
+ ##Assest one geneset
23
+ geneset_name='response to vitamin (GO:0033273)'
24
+ ov.single.geneset_aucell(adata,
25
+ geneset_name=geneset_name,
26
+ geneset=pathway_dict[geneset_name])
27
+ sc.pl.embedding(adata,
28
+ basis='umap',
29
+ color=["{}_aucell".format(geneset_name)])
30
+
31
+ ##Assest more than one geneset
32
+ geneset_names=['response to vitamin (GO:0033273)','response to vitamin D (GO:0033280)']
33
+ ov.single.pathway_aucell(adata,
34
+ pathway_names=geneset_names,
35
+ pathways_dict=pathway_dict)
36
+ sc.pl.embedding(adata,
37
+ basis='umap',
38
+ color=[i+'_aucell' for i in geneset_names])
39
+
40
+ ##Assest test geneset
41
+ ov.single.geneset_aucell(adata,
42
+ geneset_name='Sox',
43
+ geneset=['Sox17', 'Sox4', 'Sox7', 'Sox18', 'Sox5'])
44
+ sc.pl.embedding(adata,
45
+ basis='umap',
46
+ color=["Sox_aucell"])
47
+
48
+ ##Assest all pathways
49
+ adata_aucs=ov.single.pathway_aucell_enrichment(adata,
50
+ pathways_dict=pathway_dict,
51
+ num_workers=8)
52
+
53
+ adata_aucs.obs=adata[adata_aucs.obs.index].obs
54
+ adata_aucs.obsm=adata[adata_aucs.obs.index].obsm
55
+ adata_aucs.obsp=adata[adata_aucs.obs.index].obsp
56
+ adata_aucs
57
+
58
+ adata_aucs.write_h5ad('data/pancreas_auce.h5ad',compression='gzip')
59
+
60
+ adata_aucs=sc.read('data/pancreas_auce.h5ad')
61
+
62
+ sc.pl.embedding(adata_aucs,
63
+ basis='umap',
64
+ color=geneset_names)
65
+
66
+ #adata_aucs.uns['log1p']['base']=None
67
+ sc.tl.rank_genes_groups(adata_aucs, 'clusters', method='t-test',n_genes=100)
68
+ sc.pl.rank_genes_groups_dotplot(adata_aucs,groupby='clusters',
69
+ cmap='Spectral_r',
70
+ standard_scale='var',n_genes=3)
71
+
72
+ degs = sc.get.rank_genes_groups_df(adata_aucs, group='Beta', key='rank_genes_groups', log2fc_min=2,
73
+ pval_cutoff=0.05)['names'].squeeze()
74
+ degs
75
+
76
+ import matplotlib.pyplot as plt
77
+ #fig, axes = plt.subplots(4,3,figsize=(12,9))
78
+ axes=sc.pl.embedding(adata_aucs,ncols=3,
79
+ basis='umap',show=False,return_fig=True,wspace=0.55,hspace=0.65,
80
+ color=['clusters']+degs.values.tolist(),
81
+ title=[ov.utils.plot_text_set(i,3,20)for i in ['clusters']+degs.values.tolist()])
82
+
83
+ axes.tight_layout()
84
+
85
+ adata.uns['log1p']['base']=None
86
+ sc.tl.rank_genes_groups(adata, 'clusters', method='t-test',n_genes=100)
87
+
88
+ res=ov.single.pathway_enrichment(adata,pathways_dict=pathway_dict,organism='Mouse',
89
+ group_by='clusters',plot=True)
90
+
91
+ ax=ov.single.pathway_enrichment_plot(res,plot_title='Enrichment',cmap='Reds',
92
+ xticklabels=True,cbar=False,square=True,vmax=10,
93
+ yticklabels=True,cbar_kws={'label': '-log10(qvalue)','shrink': 0.5,})
ovrawm/t_bulk_combat.txt ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Batch correction in Bulk RNA-seq or microarray data
5
+ #
6
+ # Variability in datasets are not only the product of biological processes: they are also the product of technical biases (Lander et al, 1999). ComBat is one of the most widely used tool for correcting those technical biases called batch effects.
7
+ #
8
+ # pyComBat (Behdenna et al, 2020) is a new Python implementation of ComBat (Johnson et al, 2007), a software widely used for the adjustment of batch effects in microarray data. While the mathematical framework is strictly the same, pyComBat:
9
+ #
10
+ # - has similar results in terms of batch effects correction;
11
+ # - is as fast or faster than the R implementation of ComBat and;
12
+ # - offers new tools for the community to participate in its development.
13
+ #
14
+ # Paper: [pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods](https://doi.org/10.1101/2020.03.17.995431)
15
+ #
16
+ # Code: https://github.com/epigenelabs/pyComBat
17
+ #
18
+ # Colab_Reproducibility:https://colab.research.google.com/drive/121bbIiI3j4pTZ3yA_5p8BRkRyGMMmNAq?usp=sharing
19
+
20
+ # In[7]:
21
+
22
+
23
+ import anndata
24
+ import pandas as pd
25
+ import omicverse as ov
26
+ ov.ov_plot_set()
27
+
28
+
29
+ # ## Loading dataset
30
+ #
31
+ # This minimal usage example illustrates how to use pyComBat in a default setting, and shows some results on ovarian cancer data, freely available on NCBI’s [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/), namely:
32
+ #
33
+ # - GSE18520
34
+ # - GSE66957
35
+ # - GSE69428
36
+ #
37
+ # The corresponding expression files are available on [GitHub](https://github.com/epigenelabs/pyComBat/tree/master/data).
38
+
39
+ # In[15]:
40
+
41
+
42
+ dataset_1 = pd.read_pickle("data/combat/GSE18520.pickle")
43
+ adata1=anndata.AnnData(dataset_1.T)
44
+ adata1.obs['batch']='1'
45
+ adata1
46
+
47
+
48
+ # In[16]:
49
+
50
+
51
+ dataset_2 = pd.read_pickle("data/combat/GSE66957.pickle")
52
+ adata2=anndata.AnnData(dataset_2.T)
53
+ adata2.obs['batch']='2'
54
+ adata2
55
+
56
+
57
+ # In[17]:
58
+
59
+
60
+ dataset_3 = pd.read_pickle("data/combat/GSE69428.pickle")
61
+ adata3=anndata.AnnData(dataset_3.T)
62
+ adata3.obs['batch']='3'
63
+ adata3
64
+
65
+
66
+ # We use the concat function to join the three datasets together and take the intersection for the same genes
67
+
68
+ # In[18]:
69
+
70
+
71
+ adata=anndata.concat([adata1,adata2,adata3],merge='same')
72
+ adata
73
+
74
+
75
+ # ## Removing batch effect
76
+
77
+ # In[31]:
78
+
79
+
80
+ ov.bulk.batch_correction(adata,batch_key='batch')
81
+
82
+
83
+ # ## Saving results
84
+ #
85
+ # Raw datasets
86
+
87
+ # In[70]:
88
+
89
+
90
+ raw_data=adata.to_df().T
91
+ raw_data.head()
92
+
93
+
94
+ # Removing Batch datasets
95
+
96
+ # In[71]:
97
+
98
+
99
+ removing_data=adata.to_df(layer='batch_correction').T
100
+ removing_data.head()
101
+
102
+
103
+ # save
104
+
105
+ # In[ ]:
106
+
107
+
108
+ raw_data.to_csv('raw_data.csv')
109
+ removing_data.to_csv('removing_data.csv')
110
+
111
+
112
+ # You can also save adata object
113
+
114
+ # In[ ]:
115
+
116
+
117
+ adata.write_h5ad('adata_batch.h5ad',compression='gzip')
118
+ #adata=ov.read('adata_batch.h5ad')
119
+
120
+
121
+ # ## Compare the dataset before and after correction
122
+ #
123
+ # We specify three different colours for three different datasets
124
+
125
+ # In[51]:
126
+
127
+
128
+ color_dict={
129
+ '1':ov.utils.red_color[1],
130
+ '2':ov.utils.blue_color[1],
131
+ '3':ov.utils.green_color[1],
132
+ }
133
+
134
+
135
+ # In[57]:
136
+
137
+
138
+ fig,ax=plt.subplots( figsize = (20,4))
139
+ bp=plt.boxplot(adata.to_df().T,patch_artist=True)
140
+ for i,batch in zip(range(adata.shape[0]),adata.obs['batch']):
141
+ bp['boxes'][i].set_facecolor(color_dict[batch])
142
+ ax.axis(False)
143
+ plt.show()
144
+
145
+
146
+ # In[58]:
147
+
148
+
149
+ fig,ax=plt.subplots( figsize = (20,4))
150
+ bp=plt.boxplot(adata.to_df(layer='batch_correction').T,patch_artist=True)
151
+ for i,batch in zip(range(adata.shape[0]),adata.obs['batch']):
152
+ bp['boxes'][i].set_facecolor(color_dict[batch])
153
+ ax.axis(False)
154
+ plt.show()
155
+
156
+
157
+ # In addition to using boxplots to observe the effect of batch removal, we can also use PCA to observe the effect of batch removal
158
+
159
+ # In[59]:
160
+
161
+
162
+ adata.layers['raw']=adata.X.copy()
163
+
164
+
165
+ # We first calculate the PCA on the raw dataset
166
+
167
+ # In[60]:
168
+
169
+
170
+ ov.pp.pca(adata,layer='raw',n_pcs=50)
171
+ adata
172
+
173
+
174
+ # We then calculate the PCA on the batch_correction dataset
175
+
176
+ # In[61]:
177
+
178
+
179
+ ov.pp.pca(adata,layer='batch_correction',n_pcs=50)
180
+ adata
181
+
182
+
183
+ # In[62]:
184
+
185
+
186
+ ov.utils.embedding(adata,
187
+ basis='raw|original|X_pca',
188
+ color='batch',
189
+ frameon='small')
190
+
191
+
192
+ # In[63]:
193
+
194
+
195
+ ov.utils.embedding(adata,
196
+ basis='batch_correction|original|X_pca',
197
+ color='batch',
198
+ frameon='small')
199
+
200
+
201
+ # In[ ]:
202
+
203
+
204
+
205
+
ovrawm/t_cellanno.txt ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Celltype auto annotation with SCSA
5
+ # Single-cell transcriptomics allows the analysis of thousands of cells in a single experiment and the identification of novel cell types, states and dynamics in a variety of tissues and organisms. Standard experimental protocols and analytical workflows have been developed to create single-cell transcriptomic maps from tissues.
6
+ #
7
+ # This tutorial focuses on how to interpret this data to identify cell types, states, and other biologically relevant patterns with the goal of creating annotated cell maps.
8
+ #
9
+ # Paper: [SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data](https://doi.org/10.3389/fgene.2020.00490)
10
+ #
11
+ # Code: https://github.com/bioinfo-ibms-pumc/SCSA
12
+ #
13
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1BC6hPS0CyBhNu0BYk8evu57-ua1bAS0T?usp=sharing
14
+ #
15
+ # <div class="admonition warning">
16
+ # <p class="admonition-title">Note</p>
17
+ # <p>
18
+ # The annotation with SCSA can't be used in rare celltype annotations
19
+ # </p>
20
+ # </div>
21
+ #
22
+ # ![scsa](https://www.frontiersin.org/files/Articles/524690/fgene-11-00490-HTML/image_m/fgene-11-00490-g001.jpg)
23
+
24
+ # In[1]:
25
+
26
+
27
+ import omicverse as ov
28
+ print(f'omicverse version:{ov.__version__}')
29
+ import scanpy as sc
30
+ print(f'scanpy version:{sc.__version__}')
31
+ ov.ov_plot_set()
32
+
33
+
34
+ # ## Loading data
35
+ #
36
+ # The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
37
+ #
38
+
39
+ # In[2]:
40
+
41
+
42
+ # !mkdir data
43
+ # !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
44
+ # !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
45
+ # !mkdir write
46
+
47
+
48
+ # Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5-based file format: `.h5ad`.
49
+
50
+ # In[3]:
51
+
52
+
53
+ adata = sc.read_10x_mtx(
54
+ 'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
55
+ var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
56
+ cache=True) # write a cache file for faster subsequent reading
57
+
58
+
59
+ # ## Data preprocessing
60
+ #
61
+ # Here, we use `ov.single.scanpy_lazy` to preprocess the raw data of scRNA-seq, it included filter the doublets cells, normalizing counts per cell, log1p, extracting highly variable genes, and cluster of cells calculation.
62
+ #
63
+ # But if you want to experience step-by-step preprocessing, we also provide more detailed preprocessing steps here, please refer to our [preprocess chapter](https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_preprocess/) for a detailed explanation.
64
+ #
65
+ # We stored the raw counts in `count` layers, and the raw data in `adata.raw.to_adata()`.
66
+
67
+ # In[4]:
68
+
69
+
70
+ #adata=ov.single.scanpy_lazy(adata)
71
+
72
+ #quantity control
73
+ adata=ov.pp.qc(adata,
74
+ tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
75
+ #normalize and high variable genes (HVGs) calculated
76
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
77
+
78
+ #save the whole genes and filter the non-HVGs
79
+ adata.raw = adata
80
+ adata = adata[:, adata.var.highly_variable_features]
81
+
82
+ #scale the adata.X
83
+ ov.pp.scale(adata)
84
+
85
+ #Dimensionality Reduction
86
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
87
+
88
+ #Neighbourhood graph construction
89
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
90
+ use_rep='scaled|original|X_pca')
91
+
92
+ #clusters
93
+ sc.tl.leiden(adata)
94
+
95
+ #Dimensionality Reduction for visualization(X_mde=X_umap+GPU)
96
+ adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
97
+ adata
98
+
99
+
100
+ # ## Cell annotate automatically
101
+ #
102
+ # We create a pySCSA object from the `adata`, and we need to set some parameter to annotate correctly.
103
+ #
104
+ # In normal annotate, we set `celltype`=`'normal'` and `target`=`'cellmarker'` or `'panglaodb'` to perform the cell annotate.
105
+ #
106
+ # But in cancer annotate, we need to set the `celltype`=`'cancer'` and `target`=`'cancersea'` to perform the cell annotate.
107
+ #
108
+ # <div class="admonition note">
109
+ # <p class="admonition-title">Note</p>
110
+ # <p>
111
+ # The annotation with SCSA need to download the database at first. It can be downloaded automatically. But sometimes you will have problems with network errors.
112
+ # </p>
113
+ # </div>
114
+ #
115
+ # The database can be downloaded from [figshare](https://figshare.com/ndownloader/files/41369037) or [Google Drive](https://drive.google.com/drive/folders/1pqyuCp8mTXDFRGUkX8iDdPAg45JHvheF?usp=sharing). And you need to set parameter `model_path`=`'path'`
116
+
117
+ # In[5]:
118
+
119
+
120
+ scsa=ov.single.pySCSA(adata=adata,
121
+ foldchange=1.5,
122
+ pvalue=0.01,
123
+ celltype='normal',
124
+ target='cellmarker',
125
+ tissue='All',
126
+ model_path='temp/pySCSA_2023_v2_plus.db'
127
+ )
128
+
129
+
130
+ # In the previous cell clustering we used the leiden algorithm, so here we specify that the type is set to leiden. if you are using louvain, please change it. And, we will annotate all clusters, if you only want to annotate a few of the classes, please follow `'[1]'`, `'[1,2,3]'`, `'[...]'` Enter in the format.
131
+ #
132
+ # `rank_rep` means the `sc.tl.rank_genes_groups(adata, clustertype, method='wilcoxon')`, if we provided the `rank_genes_groups` in adata.uns, `rank_rep` can be set as `False`
133
+
134
+ # In[6]:
135
+
136
+
137
+ anno=scsa.cell_anno(clustertype='leiden',
138
+ cluster='all',rank_rep=True)
139
+
140
+
141
+ # We can query only the better annotated results
142
+
143
+ # In[7]:
144
+
145
+
146
+ scsa.cell_auto_anno(adata,key='scsa_celltype_cellmarker')
147
+
148
+
149
+ # We can also use `panglaodb` as target to annotate the celltype
150
+
151
+ # In[8]:
152
+
153
+
154
+ scsa=ov.single.pySCSA(adata=adata,
155
+ foldchange=1.5,
156
+ pvalue=0.01,
157
+ celltype='normal',
158
+ target='panglaodb',
159
+ tissue='All',
160
+ model_path='temp/pySCSA_2023_v2_plus.db'
161
+
162
+ )
163
+
164
+
165
+ # In[9]:
166
+
167
+
168
+ res=scsa.cell_anno(clustertype='leiden',
169
+ cluster='all',rank_rep=True)
170
+
171
+
172
+ # We can query only the better annotated results
173
+
174
+ # In[10]:
175
+
176
+
177
+ scsa.cell_anno_print()
178
+
179
+
180
+ # In[11]:
181
+
182
+
183
+ scsa.cell_auto_anno(adata,key='scsa_celltype_panglaodb')
184
+
185
+
186
+ # Here, we introduce the dimensionality reduction visualisation function `ov.utils.embedding`, which is similar to `scanpy.pl.embedding`, except that when we set `frameon='small'`, we scale the axes to the bottom-left corner and scale the colourbar to the bottom-right corner.
187
+ #
188
+ # - adata: the anndata object
189
+ # - basis: the visualized embedding stored in adata.obsm
190
+ # - color: the visualized obs/var
191
+ # - legend_loc: the location of legend, if you set None, it will be visualized in right.
192
+ # - frameon: it can be set `small`, False or None
193
+ # - legend_fontoutline: the outline in the text of legend.
194
+ # - palette: Different categories of colours, we have a number of different colours preset in omicverse, including `ov.utils.palette()`, `ov.utils.red_color`, `ov.utils.blue_color`, `ov.utils.green_color`, `ov. utils.orange_color`. The preset colours can help you achieve a more beautiful visualisation.
195
+
196
+ # In[12]:
197
+
198
+
199
+ ov.utils.embedding(adata,
200
+ basis='X_mde',
201
+ color=['leiden','scsa_celltype_cellmarker','scsa_celltype_panglaodb'],
202
+ legend_loc='on data',
203
+ frameon='small',
204
+ legend_fontoutline=2,
205
+ palette=ov.utils.palette()[14:],
206
+ )
207
+
208
+
209
+ # If you want to draw stacked histograms of cell type proportions, you first need to colour the groups you intend to draw using `ov.utils.embedding`. Then use `ov.utils.plot_cellproportion` to specify the groups you want to plot, and you can see a plot of cell proportions in the different groups
210
+
211
+ # In[13]:
212
+
213
+
214
+ #Randomly designate the first 1000 cells as group B and the rest as group A
215
+ adata.obs['group']='A'
216
+ adata.obs.loc[adata.obs.index[:1000],'group']='B'
217
+ #Colored
218
+ ov.utils.embedding(adata,
219
+ basis='X_mde',
220
+ color=['group'],
221
+ frameon='small',legend_fontoutline=2,
222
+ palette=ov.utils.red_color,
223
+ )
224
+
225
+
226
+ # In[14]:
227
+
228
+
229
+ ov.utils.plot_cellproportion(adata=adata,celltype_clusters='scsa_celltype_cellmarker',
230
+ visual_clusters='group',
231
+ visual_name='group',figsize=(2,4))
232
+
233
+
234
+ # Of course, we also provide another downscaled visualisation of the graph using `ov.utils.plot_embedding_celltype`
235
+
236
+ # In[15]:
237
+
238
+
239
+ ov.utils.plot_embedding_celltype(adata,figsize=None,basis='X_mde',
240
+ celltype_key='scsa_celltype_cellmarker',
241
+ title=' Cell type',
242
+ celltype_range=(2,6),
243
+ embedding_range=(4,10),)
244
+
245
+
246
+ # We calculated the ratio of observed to expected cell numbers (Ro/e) for each cluster in different tissues to quantify the tissue preference of each cluster (Guo et al., 2018; Zhang et al., 2018). The expected cell num- bers for each combination of cell clusters and tissues were obtained from the chi-square test. One cluster was identified as being enriched in a specific tissue if Ro/e>1.
247
+ #
248
+ # The Ro/e function was wrote by `Haihao Zhang`.
249
+
250
+ # In[16]:
251
+
252
+
253
+ roe=ov.utils.roe(adata,sample_key='group',cell_type_key='scsa_celltype_cellmarker')
254
+
255
+
256
+ # In[40]:
257
+
258
+
259
+ import seaborn as sns
260
+ import matplotlib.pyplot as plt
261
+ fig, ax = plt.subplots(figsize=(2,4))
262
+
263
+ transformed_roe = roe.copy()
264
+ transformed_roe = transformed_roe.applymap(
265
+ lambda x: '+++' if x >= 2 else ('++' if x >= 1.5 else ('+' if x >= 1 else '+/-')))
266
+
267
+ sns.heatmap(roe, annot=transformed_roe, cmap='RdBu_r', fmt='',
268
+ cbar=True, ax=ax,vmin=0.5,vmax=1.5,cbar_kws={'shrink':0.5})
269
+ plt.xticks(fontsize=12)
270
+ plt.yticks(fontsize=12)
271
+
272
+ plt.xlabel('Group',fontsize=13)
273
+ plt.ylabel('Cell type',fontsize=13)
274
+ plt.title('Ro/e',fontsize=13)
275
+
276
+
277
+ # ## Cell annotate manually
278
+ #
279
+ # In order to compare the accuracy of our automatic annotations, we will here use marker genes to manually annotate the cluster and compare the accuracy of the pySCSA and manual.
280
+ #
281
+ # We need to prepare a marker's dict at first
282
+
283
+ # In[38]:
284
+
285
+
286
+ res_marker_dict={
287
+ 'Megakaryocyte':['ITGA2B','ITGB3'],
288
+ 'Dendritic cell':['CLEC10A','IDO1'],
289
+ 'Monocyte' :['S100A8','S100A9','LST1',],
290
+ 'Macrophage':['CSF1R','CD68'],
291
+ 'B cell':['MS4A1','CD79A','MZB1',],
292
+ 'NK/NKT cell':['GNLY','KLRD1'],
293
+ 'CD8+T cell':['CD8A','CD8B'],
294
+ 'Treg':['CD4','CD40LG','IL7R','FOXP3','IL2RA'],
295
+ 'CD4+T cell':['PTPRC','CD3D','CD3E'],
296
+
297
+ }
298
+
299
+
300
+ # We then calculated the expression of marker genes in each cluster and the fraction
301
+
302
+ # In[39]:
303
+
304
+
305
+ sc.tl.dendrogram(adata,'leiden')
306
+ sc.pl.dotplot(adata, res_marker_dict, 'leiden',
307
+ dendrogram=True,standard_scale='var')
308
+
309
+
310
+ # Based on the dotplot, we name each cluster according `ov.single.scanpy_cellanno_from_dict`
311
+
312
+ # In[40]:
313
+
314
+
315
+ # create a dictionary to map cluster to annotation label
316
+ cluster2annotation = {
317
+ '0': 'T cell',
318
+ '1': 'T cell',
319
+ '2': 'Monocyte',#Germ-cell(Oid)
320
+ '3': 'B cell',#Germ-cell(Oid)
321
+ '4': 'T cell',
322
+ '5': 'Macrophage',
323
+ '6': 'NKT cells',
324
+ '7': 'T cell',
325
+ '8':'Monocyte',
326
+ '9':'Dendritic cell',
327
+ '10':'Megakaryocyte',
328
+
329
+ }
330
+ ov.single.scanpy_cellanno_from_dict(adata,anno_dict=cluster2annotation,
331
+ clustertype='leiden')
332
+
333
+
334
+ # ## Compare the pySCSA and Manual
335
+ #
336
+ # We can see that the auto-annotation results are almost identical to the manual annotation, the only difference is between monocyte and macrophages, but in the previous auto-annotation results, pySCSA gives the option of `monocyte|macrophage`, so it can be assumed that pySCSA performs better on the pbmc3k data
337
+
338
+ # In[52]:
339
+
340
+
341
+ ov.utils.embedding(adata,
342
+ basis='X_mde',
343
+ color=['major_celltype','scsa_celltype_cellmarker'],
344
+ legend_loc='on data', frameon='small',legend_fontoutline=2,
345
+ palette=ov.utils.palette()[14:],
346
+ )
347
+
348
+
349
+ # We can use `get_celltype_marker` to obtain the marker of each celltype
350
+
351
+ # In[42]:
352
+
353
+
354
+ marker_dict=ov.single.get_celltype_marker(adata,clustertype='scsa_celltype_cellmarker')
355
+ marker_dict.keys()
356
+
357
+
358
+ # In[43]:
359
+
360
+
361
+ marker_dict['B cell']
362
+
363
+
364
+ # ## The tissue name in database
365
+ #
366
+ # For annotation of cell types in specific tissues, we can query the tissues available in the database using `get_model_tissue`.
367
+
368
+ # In[44]:
369
+
370
+
371
+ scsa.get_model_tissue()
372
+
373
+
374
+ # In[ ]:
375
+
376
+
377
+
378
+
ovrawm/t_cellfate.txt ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Identify the driver regulators of cell fate decisions
5
+ # CEFCON is a computational tool for deciphering driver regulators of cell fate decisions from single-cell RNA-seq data. It takes a prior gene interaction network and expression profiles from scRNA-seq data associated with a given developmental trajectory as inputs, and consists of three main components, including cell-lineage-specific gene regulatory network (GRN) construction, driver regulator identification and regulon-like gene module (RGM) identification.
6
+ #
7
+ # Check out [(Wang et al., Nature Communications, 2023)](https://www.nature.com/articles/s41467-023-44103-3) for the detailed methods and applications.
8
+ #
9
+ # Code: [https://github.com/WPZgithub/CEFCON](https://github.com/WPZgithub/CEFCON)
10
+ #
11
+
12
+ # In[1]:
13
+
14
+
15
+ import omicverse as ov
16
+ #print(f"omicverse version: {ov.__version__}")
17
+ import scanpy as sc
18
+ #print(f"scanpy version: {sc.__version__}")
19
+ import pandas as pd
20
+ from tqdm.auto import tqdm
21
+ ov.plot_set()
22
+
23
+
24
+ # # Data loading and processing
25
+ # Here, we use the mouse hematopoiesis data provided by [Nestorowa et al. (2016, Blood).](https://doi.org/10.1182/blood-2016-05-716480)
26
+ #
27
+ # **The scRNA-seq data requires processing to extract lineage information for the CEFCON analysis.** Please refer to the [original notebook](https://github.com/WPZgithub/CEFCON/blob/e74d2d248b88fb3349023d1a97d3cc8a52cc4060/notebooks/preprocessing_nestorowa16_data.ipynb) for detailed instructions on preprocessing scRNA-seq data.
28
+
29
+ # In[2]:
30
+
31
+
32
+ adata = ov.single.mouse_hsc_nestorowa16()
33
+ adata
34
+
35
+
36
+ # CEFCON fully exploit an available global and **context-free gene interaction network** as prior knowledge, from which we extract the cell-lineage-specific gene interactions according to the gene expression profiles derived from scRNA-seq data associated with a given developmental trajectory.
37
+ #
38
+ # You can download the prior network in the [zenodo](https://zenodo.org/records/8013900). **CEFCON only provides the prior network for human and mosue data anaylsis**. For other species, you should provide the prior network mannully.
39
+ #
40
+ # The author of CEFCON has provided several prior networks here; however, 'nichenet' yields the best results.
41
+
42
+ # In[3]:
43
+
44
+
45
+ prior_network = ov.single.load_human_prior_interaction_network(dataset='nichenet')
46
+
47
+
48
+ # **In the scRNA-seq analysis of human data, you should not run this step. Running it may change the gene symbol and result in errors.**
49
+ #
50
+ #
51
+ #
52
+ #
53
+
54
+ # In[4]:
55
+
56
+
57
+ # Convert the gene symbols of the prior gene interaction network to the mouse gene symbols
58
+ prior_network = ov.single.convert_human_to_mouse_network(prior_network,server_name='asia')
59
+ prior_network
60
+
61
+
62
+ # In[12]:
63
+
64
+
65
+ prior_network.to_csv('result/combined_network_Mouse.txt.gz',sep='\t')
66
+
67
+
68
+ # Alternatively, you can directly specify the file path of the input prior interaction network and import the specified file.
69
+
70
+ # In[3]:
71
+
72
+
73
+ #prior_network = './Reference_Networks/combined_network_Mouse.txt'
74
+ prior_network=ov.read('result/combined_network_Mouse.txt.gz',index_col=0)
75
+
76
+
77
+ # # Training CEFCON model
78
+ #
79
+ # We recommend using GRUOBI to solve the integer linear programming (ILP) problem when identifying driver genes. GUROBI is a commercial solver that requires licenses to run. Thankfully, it provides free licenses in academia, as well as trial licenses outside academia. If there is no problem about the licenses, you need to install the `gurobipy` package.
80
+ #
81
+ # If difficulties arise while using GUROBI, the non-commercial solver, SCIP, will be employed as an alternative. But the use of SCIP does not come with a guarantee of achieving a successful solutio
82
+ #
83
+ # **By default, the program will verify the availability of GRUOBI. If GRUOBI is not accessible, it will automatically switch the solver to SCIP.**
84
+ #
85
+
86
+ # In[4]:
87
+
88
+
89
+ CEFCON_obj = ov.single.pyCEFCON(adata, prior_network, repeats=5, solver='GUROBI')
90
+ CEFCON_obj
91
+
92
+
93
+ # Construct cell-lineage-specific GRNs
94
+
95
+ # In[5]:
96
+
97
+
98
+ CEFCON_obj.preprocess()
99
+
100
+
101
+ # Lineage-by-lineage computation:
102
+
103
+ # In[6]:
104
+
105
+
106
+ CEFCON_obj.train()
107
+
108
+
109
+ # In[9]:
110
+
111
+
112
+ # Idenytify driver regulators for each lineage
113
+ CEFCON_obj.predicted_driver_regulators()
114
+
115
+
116
+ # We can find out the driver regulators identified by CEFCON.
117
+
118
+ # In[10]:
119
+
120
+
121
+ CEFCON_obj.cefcon_results_dict['E_pseudotime'].driver_regulator.head()
122
+
123
+
124
+ # In[11]:
125
+
126
+
127
+ CEFCON_obj.predicted_RGM()
128
+
129
+
130
+ # # Downstream analysis
131
+
132
+ # In[12]:
133
+
134
+
135
+ CEFCON_obj.cefcon_results_dict['E_pseudotime']
136
+
137
+
138
+ # In[13]:
139
+
140
+
141
+ lineage = 'E_pseudotime'
142
+ result = CEFCON_obj.cefcon_results_dict[lineage]
143
+
144
+
145
+ # Plot gene embedding clusters
146
+
147
+ # In[20]:
148
+
149
+
150
+ gene_ad=sc.AnnData(result.gene_embedding)
151
+ sc.pp.neighbors(gene_ad, n_neighbors=30, use_rep='X')
152
+ # Higher resolutions lead to more communities, while lower resolutions lead to fewer communities.
153
+ sc.tl.leiden(gene_ad, resolution=1)
154
+ sc.tl.umap(gene_ad, n_components=2, min_dist=0.3)
155
+
156
+
157
+ # In[27]:
158
+
159
+
160
+ ov.utils.embedding(gene_ad,basis='X_umap',legend_loc='on data',
161
+ legend_fontsize=8, legend_fontoutline=2,
162
+ color='leiden',frameon='small',title='Leiden clustering using CEFCON\nderived gene embeddings')
163
+
164
+
165
+ # Plot influence scores of driver regulators
166
+
167
+ # In[40]:
168
+
169
+
170
+ import matplotlib.pyplot as plt
171
+ import seaborn as sns
172
+ data_for_plot = result.driver_regulator[result.driver_regulator['is_driver_regulator']]
173
+ data_for_plot = data_for_plot[0:20]
174
+
175
+ plt.figure(figsize=(2, 20 * 0.2))
176
+ sns.set_theme(style='ticks', font_scale=0.5)
177
+
178
+ ax = sns.barplot(x='influence_score', y=data_for_plot.index, data=data_for_plot, orient='h',
179
+ palette=sns.color_palette(f"ch:start=.5,rot=-.5,reverse=1,dark=0.4", n_colors=20))
180
+ ax.set_title(result.name)
181
+ ax.set_xlabel('Influence score')
182
+ ax.set_ylabel('Driver regulators')
183
+
184
+ ax.spines['left'].set_position(('outward', 10))
185
+ ax.spines['bottom'].set_position(('outward', 10))
186
+ plt.xticks(fontsize=12)
187
+ plt.yticks(fontsize=12)
188
+
189
+ plt.grid(False)
190
+ #设置spines可视化情况
191
+ ax.spines['top'].set_visible(False)
192
+ ax.spines['right'].set_visible(False)
193
+ ax.spines['bottom'].set_visible(True)
194
+ ax.spines['left'].set_visible(True)
195
+
196
+ plt.title('E_pseudotime',fontsize=12)
197
+ plt.xlabel('Influence score',fontsize=12)
198
+ plt.ylabel('Driver regulon',fontsize=12)
199
+
200
+ sns.despine()
201
+
202
+
203
+ # In[41]:
204
+
205
+
206
+ result.plot_driver_genes_Venn()
207
+
208
+
209
+ # Plot heat map of the activity matrix of RGMs
210
+
211
+ # In[42]:
212
+
213
+
214
+ adata_lineage = adata[adata.obs_names[adata.obs[result.name].notna()],:]
215
+
216
+ result.plot_RGM_activity_heatmap(cell_label=adata_lineage.obs['cell_type_finely'],
217
+ type='out',col_cluster=True,bbox_to_anchor=(1.48, 0.25))
218
+
ovrawm/t_cellfate_gene.txt ADDED
@@ -0,0 +1,468 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Timing-associated genes analysis with cellfategenie
5
+ #
6
+ # In our single-cell analysis, we analyse the underlying temporal state in the cell, which we call pseudotime. and identifying the genes associated with pseudotime becomes the key to unravelling models of gene dynamic regulation. In traditional analysis, we would use correlation coefficients, or gene dynamics model fitting. The correlation coefficient approach will have a preference for genes at the beginning and end of the time series, and the gene dynamics model requires RNA velocity information. Unbiased identification of chronosequence-related genes, as well as the need for no additional dependency information, has become a challenge in current chronosequence analyses.
7
+ #
8
+ # Here, we developed CellFateGenie, which first removes potential noise from the data through metacells, and then constructs an adaptive ridge regression model to find the minimum set of genes needed to satisfy the timing fit.CellFateGenie has similar accuracy to gene dynamics models while eliminating preferences for the start and end of the time series.
9
+ #
10
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1Q1Sk5lGCBGBWS5Bs2kncAq9ZbjaDzSR4?usp=sharing
11
+
12
+ # In[1]:
13
+
14
+
15
+ import omicverse as ov
16
+ import scvelo as scv
17
+ import matplotlib.pyplot as plt
18
+ ov.ov_plot_set()
19
+
20
+
21
+ # ## Data preprocessed
22
+ #
23
+ # We using dataset of dentategyrus in scvelo to demonstrate the timing-associated genes analysis. Firstly, We use `ov.pp.qc` and `ov.pp.preprocess` to preprocess the dataset.
24
+ #
25
+ # Then we use `ov.pp.scale` and `ov.pp.pca` to analysis the principal component of the data
26
+
27
+ # In[18]:
28
+
29
+
30
+ adata = scv.datasets.dentategyrus()
31
+ adata
32
+
33
+
34
+ # In[19]:
35
+
36
+
37
+ adata=ov.pp.qc(adata,
38
+ tresh={'mito_perc': 0.15, 'nUMIs': 500, 'detected_genes': 250},
39
+ )
40
+
41
+
42
+ # In[20]:
43
+
44
+
45
+ ov.utils.store_layers(adata,layers='counts')
46
+ adata
47
+
48
+
49
+ # In[21]:
50
+
51
+
52
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',
53
+ n_HVGs=2000)
54
+
55
+
56
+ # In[22]:
57
+
58
+
59
+ adata.raw = adata
60
+ adata = adata[:, adata.var.highly_variable_features]
61
+ adata
62
+
63
+
64
+ # In[23]:
65
+
66
+
67
+ ov.pp.scale(adata)
68
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
69
+
70
+ adata.obsm["X_mde_pca"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
71
+
72
+
73
+ # In[24]:
74
+
75
+
76
+ adata=adata.raw.to_adata()
77
+
78
+
79
+ # In[25]:
80
+
81
+
82
+ fig, ax = plt.subplots(figsize=(3,3))
83
+ ov.utils.embedding(adata,
84
+ basis='X_mde_pca',frameon='small',
85
+ color=['clusters'],show=False,ax=ax)
86
+
87
+
88
+ # ## Meta-cells calculated
89
+ #
90
+ # To reduce the noisy of the raw dataset and improve the accuracy of the regrssion model. We using `SEACells` to perform the Meta-cells calculated.
91
+
92
+ # In[451]:
93
+
94
+
95
+ import SEACells
96
+ adata=adata[adata.obs['clusters']!='Endothelial']
97
+ model = SEACells.core.SEACells(adata,
98
+ build_kernel_on='scaled|original|X_pca',
99
+ n_SEACells=200,
100
+ n_waypoint_eigs=10,
101
+ convergence_epsilon = 1e-5)
102
+
103
+
104
+ # In[452]:
105
+
106
+
107
+ model.construct_kernel_matrix()
108
+ M = model.kernel_matrix
109
+ # Initialize archetypes
110
+ model.initialize_archetypes()
111
+
112
+
113
+ # In[453]:
114
+
115
+
116
+ model.fit(min_iter=10, max_iter=50)
117
+
118
+
119
+ # The model will stop early, we can use `model.step` to force the model run additional iterations. Usually, 100 iters can get the best metacells.
120
+
121
+ # In[454]:
122
+
123
+
124
+ # Check for convergence
125
+ get_ipython().run_line_magic('matplotlib', 'inline')
126
+ model.plot_convergence()
127
+
128
+
129
+ # In[489]:
130
+
131
+
132
+ # You can force the model to run additional iterations step-wise using the .step() function
133
+ print(f'Run for {len(model.RSS_iters)} iterations')
134
+ for _ in range(10):
135
+ model.step()
136
+ print(f'Run for {len(model.RSS_iters)} iterations')
137
+
138
+
139
+ # In[490]:
140
+
141
+
142
+ # Check for convergence
143
+ get_ipython().run_line_magic('matplotlib', 'inline')
144
+ model.plot_convergence()
145
+
146
+
147
+ # In[491]:
148
+
149
+
150
+ get_ipython().run_line_magic('matplotlib', 'inline')
151
+ SEACells.plot.plot_2D(adata, key='X_mde_pca', colour_metacells=False,
152
+ figsize=(4,4),cell_size=20,title='Dentategyrus Metacells',
153
+ )
154
+
155
+
156
+ # We notice the shape of raw anndata not consistent with the HVGs anndata.
157
+
158
+ # In[492]:
159
+
160
+
161
+ adata.raw=adata.copy()
162
+
163
+
164
+ # And we use `SEACells.core.summarize_by_soft_SEACell` to get the normalized metacells
165
+
166
+ # In[493]:
167
+
168
+
169
+ SEACell_soft_ad = SEACells.core.summarize_by_soft_SEACell(adata, model.A_,
170
+ celltype_label='clusters',
171
+ summarize_layer='raw', minimum_weight=0.05)
172
+ SEACell_soft_ad
173
+
174
+
175
+ # We visualized the metacells with PCA and UMAP
176
+
177
+ # In[494]:
178
+
179
+
180
+ import scanpy as sc
181
+ SEACell_soft_ad.raw=SEACell_soft_ad.copy()
182
+ sc.pp.highly_variable_genes(SEACell_soft_ad, n_top_genes=2000, inplace=True)
183
+ SEACell_soft_ad=SEACell_soft_ad[:,SEACell_soft_ad.var.highly_variable]
184
+
185
+
186
+ # In[495]:
187
+
188
+
189
+ ov.pp.scale(SEACell_soft_ad)
190
+ ov.pp.pca(SEACell_soft_ad,layer='scaled',n_pcs=50)
191
+ sc.pp.neighbors(SEACell_soft_ad, use_rep='scaled|original|X_pca')
192
+ sc.tl.umap(SEACell_soft_ad)
193
+
194
+
195
+ # And we can use the raw color of anndata.
196
+
197
+ # In[496]:
198
+
199
+
200
+ SEACell_soft_ad.obs['celltype']=SEACell_soft_ad.obs['celltype'].astype('category')
201
+ SEACell_soft_ad.obs['celltype']=SEACell_soft_ad.obs['celltype'].cat.reorder_categories(adata.obs['clusters'].cat.categories)
202
+ SEACell_soft_ad.uns['celltype_colors']=adata.uns['clusters_colors']
203
+
204
+
205
+ # In[15]:
206
+
207
+
208
+ import matplotlib.pyplot as plt
209
+ fig, ax = plt.subplots(figsize=(3,3))
210
+ ov.utils.embedding(SEACell_soft_ad,
211
+ basis='X_umap',
212
+ color=["celltype"],
213
+ title='Meta Celltype',
214
+ frameon='small',
215
+ legend_fontsize=12,
216
+ #palette=ov.utils.palette()[11:],
217
+ ax=ax,
218
+ show=False)
219
+
220
+
221
+ # ## Pseudotime calculated
222
+ #
223
+ # Accurately calculating the pseudotime in metacells is another challenge we need to face, here we use pyVIA to complete the calculation of the pseudotime. Since the metacell has only 200 cells, we may not get proper proposed time series results by using the default parameters of pyVIA, so we manually adjust the relevant parameters.
224
+ #
225
+ # We need to set `jac_std_global`, `too_big_factor` and `knn` manually. If you know the origin cells, set the `root_user` is helpful too.
226
+
227
+ # In[ ]:
228
+
229
+
230
+ v0 = ov.single.pyVIA(adata=SEACell_soft_ad,adata_key='scaled|original|X_pca',
231
+ adata_ncomps=50, basis='X_umap',
232
+ clusters='celltype',knn=10, root_user=['nIPC','Neuroblast'],
233
+ dataset='group',
234
+ random_seed=112,is_coarse=True,
235
+ preserve_disconnected=True,
236
+ piegraph_arrow_head_width=0.05,piegraph_edgeweight_scalingfactor=2.5,
237
+ gene_matrix=SEACell_soft_ad.X,velo_weight=0.5,
238
+ edgebundle_pruning_twice=False, edgebundle_pruning=0.15,
239
+ jac_std_global=0.05,too_big_factor=0.05,
240
+ cluster_graph_pruning_std=1,
241
+ time_series=False,
242
+ )
243
+
244
+ v0.run()
245
+
246
+
247
+ # In[500]:
248
+
249
+
250
+ v0.get_pseudotime(SEACell_soft_ad)
251
+
252
+
253
+ # In[17]:
254
+
255
+
256
+ #v0.get_pseudotime(SEACell_soft_ad)
257
+ import matplotlib.pyplot as plt
258
+ fig, ax = plt.subplots(figsize=(3,3))
259
+ ov.utils.embedding(SEACell_soft_ad,
260
+ basis='X_umap',
261
+ color=["pt_via"],
262
+ title='Pseudotime',
263
+ frameon='small',
264
+ cmap='Reds',
265
+ #size=40,
266
+ legend_fontsize=12,
267
+ #palette=ov.utils.palette()[11:],
268
+ ax=ax,
269
+ show=False)
270
+
271
+
272
+ # Now we save the result of metacells for under analysis.
273
+
274
+ # In[502]:
275
+
276
+
277
+ SEACell_soft_ad.write_h5ad('data/tutorial_meta_den.h5ad',compression='gzip')
278
+
279
+
280
+ # In[2]:
281
+
282
+
283
+ SEACell_soft_ad=ov.utils.read('data/tutorial_meta_den.h5ad')
284
+
285
+
286
+ # ## Timing-associated genes analysis
287
+ #
288
+ # We have encapsulated the cellfategenie algorithm into omicverse, and we can simply use omicverse to analysis.
289
+
290
+ # In[3]:
291
+
292
+
293
+ cfg_obj=ov.single.cellfategenie(SEACell_soft_ad,pseudotime='pt_via')
294
+ cfg_obj.model_init()
295
+
296
+
297
+ # We used Adaptive Threshold Regression to calculate the minimum number of gene sets that would have the same accuracy as the regression model constructed for all genes.
298
+
299
+ # In[4]:
300
+
301
+
302
+ cfg_obj.ATR(stop=500,flux=0.01)
303
+
304
+
305
+ # In[5]:
306
+
307
+
308
+ fig,ax=cfg_obj.plot_filtering(color='#5ca8dc')
309
+ ax.set_title('Dentategyrus Metacells\nCellFateGenie')
310
+
311
+
312
+ # In[6]:
313
+
314
+
315
+ res=cfg_obj.model_fit()
316
+
317
+
318
+ # ## Visualization
319
+ #
320
+ # We prepared a series of function to visualize the result. we can use `plot_color_fitting` to observe the different cells how to transit with the pseudotime.
321
+
322
+ # In[7]:
323
+
324
+
325
+ cfg_obj.plot_color_fitting(type='raw',cluster_key='celltype')
326
+
327
+
328
+ # In[8]:
329
+
330
+
331
+ cfg_obj.plot_color_fitting(type='filter',cluster_key='celltype')
332
+
333
+
334
+ # ## Kendalltau test
335
+ #
336
+ # We can further narrow down the set of genes that satisfy the maximum regression coefficient. We used the kendalltau test to calculate the trend significance for each gene.
337
+
338
+ # In[9]:
339
+
340
+
341
+ kt_filter=cfg_obj.kendalltau_filter()
342
+ kt_filter.head()
343
+
344
+
345
+ # In[10]:
346
+
347
+
348
+ var_name=kt_filter.loc[kt_filter['pvalue']<kt_filter['pvalue'].mean()].index.tolist()
349
+ gt_obj=ov.single.gene_trends(SEACell_soft_ad,'pt_via',var_name)
350
+ gt_obj.calculate(n_convolve=10)
351
+
352
+
353
+ # In[11]:
354
+
355
+
356
+ print(f"Dimension: {len(var_name)}")
357
+
358
+
359
+ # In[12]:
360
+
361
+
362
+ fig,ax=gt_obj.plot_trend(color=ov.utils.blue_color[3])
363
+ ax.set_title(f'Dentategyrus meta\nCellfategenie',fontsize=13)
364
+
365
+
366
+ # In[14]:
367
+
368
+
369
+ g=ov.utils.plot_heatmap(SEACell_soft_ad,var_names=var_name,
370
+ sortby='pt_via',col_color='celltype',
371
+ n_convolve=10,figsize=(1,6),show=False)
372
+
373
+ g.fig.set_size_inches(2, 6)
374
+ g.fig.suptitle('CellFateGenie',x=0.25,y=0.83,
375
+ horizontalalignment='left',fontsize=12,fontweight='bold')
376
+ g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(),fontsize=12)
377
+ plt.show()
378
+
379
+
380
+ # ## Fate Genes
381
+ #
382
+ # Unlike traditional proposed timing analyses, CellFateGenie can also access key genes/gene sets during fate transitions
383
+
384
+ # In[26]:
385
+
386
+
387
+ gt_obj.cal_border_cell(SEACell_soft_ad,'pt_via','celltype')
388
+
389
+
390
+ # In[27]:
391
+
392
+
393
+ bordgene_dict=gt_obj.get_multi_border_gene(SEACell_soft_ad,'celltype',
394
+ threshold=0.5)
395
+
396
+
397
+ # We use `Granule immature` and `Granule mature` to try calculated the fate related genes.
398
+
399
+ # In[30]:
400
+
401
+
402
+ gt_obj.get_border_gene(SEACell_soft_ad,'celltype','Granule immature','Granule mature',
403
+ threshold=0.5)
404
+
405
+
406
+ # In comparison to the `get_border_gene` function, the `get_special_border_gene` function serves the purpose of extracting exclusive fate information from two specific cell types. However, it operates with a higher degree of stringency.
407
+
408
+ # In[33]:
409
+
410
+
411
+ gt_obj.get_special_border_gene(SEACell_soft_ad,'celltype','Granule immature','Granule mature')
412
+
413
+
414
+ # We can visualize these genes.
415
+
416
+ # In[36]:
417
+
418
+
419
+ import matplotlib.pyplot as plt
420
+ g=ov.utils.plot_heatmap(SEACell_soft_ad,
421
+ var_names=gt_obj.get_border_gene(SEACell_soft_ad,'celltype','Granule immature','Granule mature'),
422
+ sortby='pt_via',col_color='celltype',yticklabels=True,
423
+ n_convolve=10,figsize=(1,6),show=False)
424
+
425
+ g.fig.set_size_inches(2, 4)
426
+ g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(),fontsize=12)
427
+ plt.show()
428
+
429
+
430
+ # Similiarly, we can use `get_special_kernel_gene` or `get_kernel_gene` to obtain the celltype special genes.
431
+
432
+ # In[37]:
433
+
434
+
435
+ gt_obj.get_special_kernel_gene(SEACell_soft_ad,'celltype','Granule immature')
436
+
437
+
438
+ # In[42]:
439
+
440
+
441
+ gt_obj.get_kernel_gene(SEACell_soft_ad,
442
+ 'celltype','Granule immature',
443
+ threshold=0.3,
444
+ num_gene=10)
445
+
446
+
447
+ # In[43]:
448
+
449
+
450
+ import matplotlib.pyplot as plt
451
+ g=ov.utils.plot_heatmap(SEACell_soft_ad,
452
+ var_names=gt_obj.get_kernel_gene(SEACell_soft_ad,
453
+ 'celltype','Granule immature',
454
+ threshold=0.3,
455
+ num_gene=10),
456
+ sortby='pt_via',col_color='celltype',yticklabels=True,
457
+ n_convolve=10,figsize=(1,6),show=False)
458
+
459
+ g.fig.set_size_inches(2, 4)
460
+ g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(),fontsize=12)
461
+ plt.show()
462
+
463
+
464
+ # In[ ]:
465
+
466
+
467
+
468
+
ovrawm/t_cellfate_genesets.txt ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Timing-associated geneset analysis with cellfategenie
5
+ #
6
+ # In our single-cell analysis, we analyse the underlying temporal state in the cell, which we call pseudotime. and identifying the genes associated with pseudotime becomes the key to unravelling models of gene dynamic regulation. In traditional analysis, we would use correlation coefficients, or gene dynamics model fitting. The correlation coefficient approach will have a preference for genes at the beginning and end of the time series, and the gene dynamics model requires RNA velocity information. Unbiased identification of chronosequence-related genes, as well as the need for no additional dependency information, has become a challenge in current chronosequence analyses.
7
+ #
8
+ # Here, we developed CellFateGenie, which first removes potential noise from the data through metacells, and then constructs an adaptive ridge regression model to find the minimum set of genes needed to satisfy the timing fit.CellFateGenie has similar accuracy to gene dynamics models while eliminating preferences for the start and end of the time series.
9
+ #
10
+ # We provided the AUCell to evaluate the geneset of adata
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1upcKKZHsZMS78eOliwRAddbaZ9ICXSrc?usp=sharing
13
+
14
+ # In[ ]:
15
+
16
+
17
+ import omicverse as ov
18
+ import scvelo as scv
19
+ import matplotlib.pyplot as plt
20
+ ov.ov_plot_set()
21
+
22
+
23
+ # ## Data preprocessed
24
+ #
25
+ # We using dataset of dentategyrus in scvelo to demonstrate the timing-associated genes analysis. Firstly, We use `ov.pp.qc` and `ov.pp.preprocess` to preprocess the dataset.
26
+ #
27
+ # Then we use `ov.pp.scale` and `ov.pp.pca` to analysis the principal component of the data
28
+
29
+ # In[2]:
30
+
31
+
32
+ adata=ov.read('data/tutorial_meta_den.h5ad')
33
+ adata=adata.raw.to_adata()
34
+ adata
35
+
36
+
37
+ # ## Genesets evaluata
38
+
39
+ # In[3]:
40
+
41
+
42
+ import omicverse as ov
43
+ pathway_dict=ov.utils.geneset_prepare('../placenta/genesets/GO_Biological_Process_2021.txt',organism='Mouse')
44
+ len(pathway_dict.keys())
45
+
46
+
47
+ # In[ ]:
48
+
49
+
50
+ ##Assest all pathways
51
+ adata_aucs=ov.single.pathway_aucell_enrichment(adata,
52
+ pathways_dict=pathway_dict,
53
+ num_workers=8)
54
+
55
+
56
+ # In[11]:
57
+
58
+
59
+ adata_aucs.obs=adata[adata_aucs.obs.index].obs
60
+ adata_aucs.obsm=adata[adata_aucs.obs.index].obsm
61
+ adata_aucs.obsp=adata[adata_aucs.obs.index].obsp
62
+ adata_aucs.uns=adata[adata_aucs.obs.index].uns
63
+
64
+ adata_aucs
65
+
66
+
67
+ # ## Timing-associated genes analysis
68
+ #
69
+ # We have encapsulated the cellfategenie algorithm into omicverse, and we can simply use omicverse to analysis.
70
+
71
+ # In[12]:
72
+
73
+
74
+ cfg_obj=ov.single.cellfategenie(adata_aucs,pseudotime='pt_via')
75
+ cfg_obj.model_init()
76
+
77
+
78
+ # We used Adaptive Threshold Regression to calculate the minimum number of gene sets that would have the same accuracy as the regression model constructed for all genes.
79
+
80
+ # In[13]:
81
+
82
+
83
+ cfg_obj.ATR(stop=500)
84
+
85
+
86
+ # In[14]:
87
+
88
+
89
+ fig,ax=cfg_obj.plot_filtering(color='#5ca8dc')
90
+ ax.set_title('Dentategyrus Metacells\nCellFateGenie')
91
+
92
+
93
+ # In[15]:
94
+
95
+
96
+ res=cfg_obj.model_fit()
97
+
98
+
99
+ # ## Visualization
100
+ #
101
+ # We prepared a series of function to visualize the result. we can use `plot_color_fitting` to observe the different cells how to transit with the pseudotime.
102
+
103
+ # In[16]:
104
+
105
+
106
+ cfg_obj.plot_color_fitting(type='raw',cluster_key='celltype')
107
+
108
+
109
+ # In[17]:
110
+
111
+
112
+ cfg_obj.plot_color_fitting(type='filter',cluster_key='celltype')
113
+
114
+
115
+ # ## Kendalltau test
116
+ #
117
+ # We can further narrow down the set of genes that satisfy the maximum regression coefficient. We used the kendalltau test to calculate the trend significance for each gene.
118
+
119
+ # In[18]:
120
+
121
+
122
+ kt_filter=cfg_obj.kendalltau_filter()
123
+ kt_filter.head()
124
+
125
+
126
+ # In[21]:
127
+
128
+
129
+ var_name=kt_filter.loc[kt_filter['pvalue']<kt_filter['pvalue'].mean()].index.tolist()
130
+ gt_obj=ov.single.gene_trends(adata_aucs,'pt_via',var_name)
131
+ gt_obj.calculate(n_convolve=10)
132
+
133
+
134
+ # In[22]:
135
+
136
+
137
+ print(f"Dimension: {len(var_name)}")
138
+
139
+
140
+ # In[23]:
141
+
142
+
143
+ fig,ax=gt_obj.plot_trend(color=ov.utils.blue_color[3])
144
+ ax.set_title(f'Dentategyrus meta\nCellfategenie',fontsize=13)
145
+
146
+
147
+ # In[25]:
148
+
149
+
150
+ g=ov.utils.plot_heatmap(adata_aucs,var_names=var_name,
151
+ sortby='pt_via',col_color='celltype',
152
+ n_convolve=10,figsize=(1,6),show=False)
153
+
154
+ g.fig.set_size_inches(2, 6)
155
+ g.fig.suptitle('CellFateGenie',x=0.25,y=0.83,
156
+ horizontalalignment='left',fontsize=12,fontweight='bold')
157
+ g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(),fontsize=12)
158
+ plt.show()
159
+
160
+
161
+ # In[32]:
162
+
163
+
164
+ gw_obj1=ov.utils.geneset_wordcloud(adata=adata_aucs[:,var_name],
165
+ cluster_key='celltype',pseudotime='pt_via',figsize=(3,6))
166
+ gw_obj1.get()
167
+
168
+
169
+ # In[33]:
170
+
171
+
172
+ g=gw_obj1.plot_heatmap(figwidth=6,cmap='RdBu_r')
173
+ plt.suptitle('CellFateGenie',x=0.18,y=0.95,
174
+ horizontalalignment='left',fontsize=12,fontweight='bold')
175
+
176
+
177
+ # In[ ]:
178
+
179
+
180
+
181
+
ovrawm/t_cellphonedb.txt ADDED
@@ -0,0 +1,439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Cell interaction with CellPhoneDB
5
+ #
6
+ # CellPhoneDB is a publicly available repository of curated receptors, ligands and their interactions in HUMAN. CellPhoneDB can be used to search for a particular ligand/receptor, or interrogate your own single-cell transcriptomics data.
7
+ #
8
+ # We made three improvements in integrating the CellPhoneDB algorithm in OmicVerse:
9
+ #
10
+ # - We have added a tutorial on analysing `anndata` based on any `anndata`.
11
+ # - We added prettier heatmaps, chord diagrams and network diagrams for visualising relationships between cells.
12
+ # - We added visualisation of ligand receptor proteins in different groups
13
+ #
14
+ # Paper: [Single-cell reconstruction of the early maternal–fetal interface in humans](https://www.nature.com/articles/s41586-018-0698-6)
15
+ #
16
+ # Code: https://github.com/ventolab/CellphoneDB
17
+ #
18
+ # This notebook will demonstrate how to use CellPhoneDB on scRNA data and visualize it.
19
+
20
+ # In[1]:
21
+
22
+
23
+ import scanpy as sc
24
+ import matplotlib.pyplot as plt
25
+ import pandas as pd
26
+ import numpy as np
27
+ import omicverse as ov
28
+ import os
29
+
30
+ ov.plot_set()
31
+ #print(f'cellphonedb version{cellphonedb.__version__}')
32
+
33
+
34
+ # ## The EVT Data
35
+ #
36
+ # Th EVT data have finished the celltype annotation, it can be download from the tutorial of CellPhoneDB.
37
+ #
38
+ # Download: https://github.com/ventolab/CellphoneDB/blob/master/notebooks/data_tutorial.zip
39
+ #
40
+
41
+ # In[2]:
42
+
43
+
44
+ adata=sc.read('data/cpdb/normalised_log_counts.h5ad')
45
+ adata=adata[adata.obs['cell_labels'].isin(['eEVT','iEVT','EVT_1','EVT_2','DC','dNK1','dNK2','dNK3',
46
+ 'VCT','VCT_CCC','VCT_fusing','VCT_p','GC','SCT'])]
47
+ adata
48
+
49
+
50
+ # In[3]:
51
+
52
+
53
+ ov.pl.embedding(adata,
54
+ basis='X_umap',
55
+ color='cell_labels',
56
+ frameon='small',
57
+ palette=ov.pl.red_color+ov.pl.blue_color+ov.pl.green_color+ov.pl.orange_color+ov.pl.purple_color)
58
+
59
+
60
+ # In[4]:
61
+
62
+
63
+ adata.X.max()
64
+
65
+
66
+ # We can clearly see that the maximum value of the data is a floating point number less than 10. The fact that the maximum value is not an integer means that it has been normalised, and the fact that it is less than 10 means that it has been logarithmised. Note that our data cannot be `scaled`.
67
+
68
+ # ## Export the anndata object
69
+ #
70
+ # As the input to CellPhoneDB only requires the expression matrix and cell type, we extracted only the expression matrix and cell type from adata for the next step of analysis
71
+
72
+ # In[5]:
73
+
74
+
75
+ sc.pp.filter_cells(adata, min_genes=200)
76
+ sc.pp.filter_genes(adata, min_cells=3)
77
+ adata1=sc.AnnData(adata.X,obs=pd.DataFrame(index=adata.obs.index),
78
+ var=pd.DataFrame(index=adata.var.index))
79
+ adata1.write_h5ad('data/cpdb/norm_log.h5ad',compression='gzip')
80
+ adata1
81
+
82
+
83
+ # ## Export the meta info of cells
84
+ #
85
+ # we construct a `DataFrame` object to export the meta info of cells. In EVT adata object, the celltypes were stored in the `obs['cell_labels']`
86
+
87
+ # In[6]:
88
+
89
+
90
+ #meta导出
91
+ df_meta = pd.DataFrame(data={'Cell':list(adata[adata1.obs.index].obs.index),
92
+ 'cell_type':[ i for i in adata[adata1.obs.index].obs['cell_labels']]
93
+ })
94
+ df_meta.set_index('Cell', inplace=True)
95
+ df_meta.to_csv('data/cpdb/meta.tsv', sep = '\t')
96
+
97
+
98
+ # ## Cell interaction analysis
99
+ #
100
+ # Now, we prepare the meta info of cells `meta.tsv` and matrix of scRNA-eq `norm_log.h5ad`, we can use the method of CellPhoneDB to calculate the interaction of each celltype in scRNA-seq data.
101
+ #
102
+ # Importantly, to avoid a series of bugs, we set the absolute path for CellPhoneDB analysis. we use `os.getcwd() ` to get the path now analysis.
103
+
104
+ # In[7]:
105
+
106
+
107
+ import os
108
+ os.getcwd()
109
+
110
+
111
+ # Another thing to note is that we need to download the `cellphonedb.zip` file from `cellphonedb-data` for further analysis. I have placed it in the `data/CPDB` directory, but you can place it in any path you are interested in
112
+ #
113
+ # Downloads: https://github.com/ventolab/cellphonedb-data
114
+
115
+ # In[8]:
116
+
117
+
118
+ cpdb_file_path = '/Users/fernandozeng/Desktop/analysis/cellphonedb-data/cellphonedb.zip'
119
+ meta_file_path = os.getcwd()+'/data/cpdb/meta.tsv'
120
+ counts_file_path = os.getcwd()+'/data/cpdb/norm_log.h5ad'
121
+ microenvs_file_path = None
122
+ active_tf_path = None
123
+ out_path =os.getcwd()+'/data/cpdb/test_cellphone'
124
+
125
+
126
+ # Now we run `cpdb_statistical_analysis_method` to predicted the cell interaction in scRNA-seq
127
+
128
+ # In[9]:
129
+
130
+
131
+ from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
132
+
133
+ cpdb_results = cpdb_statistical_analysis_method.call(
134
+ cpdb_file_path = cpdb_file_path, # mandatory: CellphoneDB database zip file.
135
+ meta_file_path = meta_file_path, # mandatory: tsv file defining barcodes to cell label.
136
+ counts_file_path = counts_file_path, # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
137
+ counts_data = 'hgnc_symbol', # defines the gene annotation in counts matrix.
138
+ active_tfs_file_path = active_tf_path, # optional: defines cell types and their active TFs.
139
+ microenvs_file_path = microenvs_file_path, # optional (default: None): defines cells per microenvironment.
140
+ score_interactions = True, # optional: whether to score interactions or not.
141
+ iterations = 1000, # denotes the number of shufflings performed in the analysis.
142
+ threshold = 0.1, # defines the min % of cells expressing a gene for this to be employed in the analysis.
143
+ threads = 10, # number of threads to use in the analysis.
144
+ debug_seed = 42, # debug randome seed. To disable >=0.
145
+ result_precision = 3, # Sets the rounding for the mean values in significan_means.
146
+ pvalue = 0.05, # P-value threshold to employ for significance.
147
+ subsampling = False, # To enable subsampling the data (geometri sketching).
148
+ subsampling_log = False, # (mandatory) enable subsampling log1p for non log-transformed data inputs.
149
+ subsampling_num_pc = 100, # Number of componets to subsample via geometric skectching (dafault: 100).
150
+ subsampling_num_cells = 1000, # Number of cells to subsample (integer) (default: 1/3 of the dataset).
151
+ separator = '|', # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
152
+ debug = False, # Saves all intermediate tables employed during the analysis in pkl format.
153
+ output_path = out_path, # Path to save results.
154
+ output_suffix = None # Replaces the timestamp in the output files by a user defined string in the (default: None).
155
+ )
156
+
157
+
158
+ # In[10]:
159
+
160
+
161
+ ov.utils.save(cpdb_results,'data/cpdb/gex_cpdb_test.pkl')
162
+
163
+
164
+ # In[5]:
165
+
166
+
167
+ cpdb_results=ov.utils.load('data/cpdb/gex_cpdb_test.pkl')
168
+
169
+
170
+ # ## Network of celltype analysis
171
+ #
172
+ # It is worth noting that we will be using ov for all downstream analysis, starting with cell network analysis, where we provide the `ov.single.cpdb_network_cal` function to extract interactions, and the `ov.single.cpdb_plot_network` function for very elegant visualization
173
+
174
+ # In[6]:
175
+
176
+
177
+ interaction=ov.single.cpdb_network_cal(adata = adata,
178
+ pvals = cpdb_results['pvalues'],
179
+ celltype_key = "cell_labels",)
180
+
181
+
182
+ # In[7]:
183
+
184
+
185
+ interaction['interaction_edges'].head()
186
+
187
+
188
+ # In[8]:
189
+
190
+
191
+ ov.plot_set()
192
+
193
+
194
+ # In[9]:
195
+
196
+
197
+ fig, ax = plt.subplots(figsize=(4,4))
198
+ ov.pl.cpdb_heatmap(adata,interaction['interaction_edges'],celltype_key='cell_labels',
199
+ fontsize=11,
200
+ ax=ax,legend_kws={'fontsize':12,'bbox_to_anchor':(5, -0.9),'loc':'center left',})
201
+
202
+
203
+ # In[10]:
204
+
205
+
206
+ fig, ax = plt.subplots(figsize=(2,4))
207
+ ov.pl.cpdb_heatmap(adata,interaction['interaction_edges'],celltype_key='cell_labels',
208
+ source_cells=['EVT_1','EVT_2','dNK1','dNK2','dNK3'],
209
+ ax=ax,legend_kws={'fontsize':12,'bbox_to_anchor':(5, -0.9),'loc':'center left',})
210
+
211
+
212
+ # In[11]:
213
+
214
+
215
+ fig=ov.pl.cpdb_chord(adata,interaction['interaction_edges'],celltype_key='cell_labels',
216
+ count_min=60,fontsize=12,padding=50,radius=100,save=None,)
217
+ fig.show()
218
+
219
+
220
+ # In[12]:
221
+
222
+
223
+ fig, ax = plt.subplots(figsize=(4,4))
224
+ ov.pl.cpdb_network(adata,interaction['interaction_edges'],celltype_key='cell_labels',
225
+ counts_min=60,
226
+ nodesize_scale=5,
227
+ ax=ax)
228
+
229
+
230
+ # In[13]:
231
+
232
+
233
+ fig, ax = plt.subplots(figsize=(4,4))
234
+ ov.pl.cpdb_network(adata,interaction['interaction_edges'],celltype_key='cell_labels',
235
+ counts_min=60,
236
+ nodesize_scale=5,
237
+ source_cells=['EVT_1','EVT_2','dNK1','dNK2','dNK3'],
238
+ ax=ax)
239
+
240
+
241
+ # In[14]:
242
+
243
+
244
+ fig, ax = plt.subplots(figsize=(4,4))
245
+ ov.pl.cpdb_network(adata,interaction['interaction_edges'],celltype_key='cell_labels',
246
+ counts_min=60,
247
+ nodesize_scale=5,
248
+ target_cells=['EVT_1','EVT_2','dNK1','dNK2','dNK3'],
249
+ ax=ax)
250
+
251
+
252
+ # In[15]:
253
+
254
+
255
+ ov.single.cpdb_plot_network(adata=adata,
256
+ interaction_edges=interaction['interaction_edges'],
257
+ celltype_key='cell_labels',
258
+ nodecolor_dict=None,title='EVT Network',
259
+ edgeswidth_scale=25,nodesize_scale=10,
260
+ pos_scale=1,pos_size=10,figsize=(6,6),
261
+ legend_ncol=3,legend_bbox=(0.8,0.2),legend_fontsize=10)
262
+
263
+
264
+ # Sometimes, the whole network you don't want to use for analysis, the sub-network is useful for analysis. we can exacted the sub-network from it.
265
+ #
266
+ # We need to exacted the sub-interaction first, we assumed that the five celltypes `['EVT_1','EVT_2','dNK1','dNK2','dNK3']` which is interested
267
+
268
+ # In[16]:
269
+
270
+
271
+ sub_i=interaction['interaction_edges']
272
+ sub_i=sub_i.loc[sub_i['SOURCE'].isin(['EVT_1','EVT_2','dNK1','dNK2','dNK3'])]
273
+ sub_i=sub_i.loc[sub_i['TARGET'].isin(['EVT_1','EVT_2','dNK1','dNK2','dNK3'])]
274
+
275
+
276
+ # Then, we exacted the sub-anndata object
277
+
278
+ # In[17]:
279
+
280
+
281
+ sub_adata=adata[adata.obs['cell_labels'].isin(['EVT_1','EVT_2','dNK1','dNK2','dNK3'])]
282
+ sub_adata
283
+
284
+
285
+ # Now we plot the sub-interaction network between the cells in scRNA-seq
286
+
287
+ # In[18]:
288
+
289
+
290
+ ov.single.cpdb_plot_network(adata=sub_adata,
291
+ interaction_edges=sub_i,
292
+ celltype_key='cell_labels',
293
+ nodecolor_dict=None,title='Sub-EVT Network',
294
+ edgeswidth_scale=25,nodesize_scale=1,
295
+ pos_scale=1,pos_size=10,figsize=(5,5),
296
+ legend_ncol=3,legend_bbox=(0.8,0.2),legend_fontsize=10)
297
+
298
+
299
+ # In[19]:
300
+
301
+
302
+ fig=ov.pl.cpdb_chord(sub_adata,sub_i,celltype_key='cell_labels',
303
+ count_min=10,fontsize=12,padding=60,radius=100,save=None,)
304
+ fig.show()
305
+
306
+
307
+ # In[20]:
308
+
309
+
310
+ fig, ax = plt.subplots(figsize=(4,4))
311
+ ov.pl.cpdb_network(sub_adata,sub_i,celltype_key='cell_labels',
312
+ counts_min=10,
313
+ nodesize_scale=5,
314
+ ax=ax)
315
+
316
+
317
+ # In[21]:
318
+
319
+
320
+ fig, ax = plt.subplots(figsize=(3,3))
321
+ ov.pl.cpdb_heatmap(sub_adata,sub_i,celltype_key='cell_labels',
322
+ ax=ax,legend_kws={'fontsize':12,'bbox_to_anchor':(5, -0.9),'loc':'center left',})
323
+
324
+
325
+ # ## The ligand-receptor exacted
326
+ #
327
+ # We can set EVT as ligand or receptor to exacted the ligand-receptor proteins from the result of CellPhoneDB.
328
+
329
+ #
330
+ #
331
+ # The most important step is that we need to extract the results of the analysis with eEVT as the ligand, and here we use ov's function `ov.single.cpdb_exact_target`,`ov.single.cpdb_exact_source` to do this
332
+
333
+ # In[22]:
334
+
335
+
336
+ sub_means=ov.single.cpdb_exact_target(cpdb_results['means'],['eEVT','iEVT'])
337
+ sub_means=ov.single.cpdb_exact_source(sub_means,['dNK1','dNK2','dNK3'])
338
+ sub_means.head()
339
+
340
+
341
+ # In[23]:
342
+
343
+
344
+ ov.pl.cpdb_interacting_heatmap(adata=adata,
345
+ celltype_key='cell_labels',
346
+ means=cpdb_results['means'],
347
+ pvalues=cpdb_results['pvalues'],
348
+ source_cells=['dNK1','dNK2','dNK3'],
349
+ target_cells=['eEVT','iEVT'],
350
+ plot_secret=True,
351
+ min_means=3,
352
+ nodecolor_dict=None,
353
+ ax=None,
354
+ figsize=(2,6),
355
+ fontsize=10,)
356
+
357
+
358
+ # Sometimes we care about the expression of ligand in SOURCE and receptor in TARGET, we provide another function for getting the expression situation
359
+
360
+ # In[24]:
361
+
362
+
363
+ ov.pl.cpdb_group_heatmap(adata=adata,
364
+ celltype_key='cell_labels',
365
+ means=cpdb_results['means'],
366
+ cmap={'Target':'Blues','Source':'Reds'},
367
+ source_cells=['dNK1','dNK2','dNK3'],
368
+ target_cells=['eEVT','iEVT'],
369
+ plot_secret=True,
370
+ min_means=3,
371
+ nodecolor_dict=None,
372
+ ax=None,
373
+ figsize=(2,6),
374
+ fontsize=10,)
375
+
376
+
377
+ # We can also build Ligand, Receptor, SOURCE, and TARGET into a regulatory network, which is interesting.
378
+
379
+ # In[25]:
380
+
381
+
382
+ ov.pl.cpdb_interacting_network(adata=adata,
383
+ celltype_key='cell_labels',
384
+ means=cpdb_results['means'],
385
+ source_cells=['dNK1','dNK2','dNK3'],
386
+ target_cells=['eEVT','iEVT'],
387
+ means_min=1,
388
+ means_sum_min=1,
389
+ nodecolor_dict=None,
390
+ ax=None,
391
+ figsize=(6,6),
392
+ fontsize=10)
393
+
394
+
395
+ # Sometimes we want to analyse ligand-receptor pathway enrichment or function, so we need to extract ligand-receptor pairs from the significant ligand-receptors filtered out above, and omicverse provides an easy function `ov.single.cpdb_interaction_filtered` to do this here
396
+
397
+ # In[40]:
398
+
399
+
400
+ sub_means=sub_means.loc[~sub_means['gene_a'].isnull()]
401
+ sub_means=sub_means.loc[~sub_means['gene_b'].isnull()]
402
+ enrichr_genes=sub_means['gene_a'].tolist()+sub_means['gene_b'].tolist()
403
+
404
+
405
+ # A tutorial on enrichment you can find in the Bulk chapter of tutorials:
406
+ #
407
+ # https://omicverse.readthedocs.io/en/latest/Tutorials-bulk/t_deg/ or https://starlitnightly.github.io/omicverse/Tutorials-bulk/t_deg/
408
+
409
+ # In[ ]:
410
+
411
+
412
+ pathway_dict=ov.utils.geneset_prepare('genesets/GO_Biological_Process_2023.txt',organism='Human')
413
+
414
+
415
+ # In[14]:
416
+
417
+
418
+ #deg_genes=dds.result.loc[dds.result['sig']!='normal'].index.tolist()
419
+ enr=ov.bulk.geneset_enrichment(gene_list=enrichr_genes,
420
+ pathways_dict=pathway_dict,
421
+ pvalue_type='auto',
422
+ organism='human')
423
+
424
+
425
+ # In[20]:
426
+
427
+
428
+ ov.plot_set()
429
+ ov.bulk.geneset_plot(enr,figsize=(2,4),fig_title='GO-Bio(EVT)',
430
+ cax_loc=[2, 0.45, 0.5, 0.02],num=8,
431
+ bbox_to_anchor_used=(-0.25, -13),custom_ticks=[10,100],
432
+ cmap='Greens')
433
+
434
+
435
+ # In[ ]:
436
+
437
+
438
+
439
+
ovrawm/t_cluster.txt ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Clustering space
5
+ #
6
+ # In this tutorial, we will explore how to run the Supervised clustering, unsupervised clustering, and amortized Latent Dirichlet Allocation (LDA) model implementation in `omicverse` with `GaussianMixture`,`Leiden/Louvain` and `MiRA`.
7
+ #
8
+ # In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem.
9
+ #
10
+ # LDA is a topic modelling method first introduced in the natural language processing field. By treating each cell as a document and each gene expression count as a word, we can carry over the method to the single-cell biology field.
11
+ #
12
+ # Below, we will train the model over a dataset, plot the topics over a UMAP of the reference set, and inspect the topics for characteristic gene sets.
13
+ #
14
+ # ## Preprocess data
15
+ #
16
+ # As an example, we apply differential kinetic analysis to dentate gyrus neurogenesis, which comprises multiple heterogeneous subpopulations.
17
+ #
18
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1d_szq-y-g7O0C5rJgK22XC7uWTUNrYpK?usp=sharing
19
+
20
+ # In[1]:
21
+
22
+
23
+ import omicverse as ov
24
+ import scanpy as sc
25
+ import scvelo as scv
26
+ ov.plot_set()
27
+
28
+
29
+ # In[2]:
30
+
31
+
32
+ import scvelo as scv
33
+ adata=scv.datasets.dentategyrus()
34
+ adata
35
+
36
+
37
+ # In[3]:
38
+
39
+
40
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,)
41
+ adata.raw = adata
42
+ adata = adata[:, adata.var.highly_variable_features]
43
+ ov.pp.scale(adata)
44
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
45
+
46
+
47
+ # Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells. In our experience, often a rough estimate of the number of PCs does fine.
48
+
49
+ # In[4]:
50
+
51
+
52
+ ov.utils.plot_pca_variance_ratio(adata)
53
+
54
+
55
+ # ## Unsupervised clustering
56
+ #
57
+ # The Leiden algorithm is as an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ([Du et al., 2018, Freytag et al., 2018, Weber and Robinson, 2016]). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred.
58
+ #
59
+ # We, therefore, propose to use the Leiden algorithm[Traag et al., 2019] on single-cell k-nearest-neighbour (KNN) graphs to cluster single-cell datasets.
60
+ #
61
+ # Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset.
62
+ #
63
+ # Here, we set `method='leiden'` to cluster the cells using `Leiden`
64
+ #
65
+
66
+ # In[5]:
67
+
68
+
69
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
70
+ use_rep='scaled|original|X_pca')
71
+ ov.utils.cluster(adata,method='leiden',resolution=1)
72
+
73
+
74
+ # In[6]:
75
+
76
+
77
+ ov.utils.embedding(adata,basis='X_umap',
78
+ color=['clusters','leiden'],
79
+ frameon='small',wspace=0.5)
80
+
81
+
82
+ # We can also set `method='louvain'` to cluster the cells using `Louvain`
83
+
84
+ # In[7]:
85
+
86
+
87
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
88
+ use_rep='scaled|original|X_pca')
89
+ ov.utils.cluster(adata,method='louvain',resolution=1)
90
+
91
+
92
+ # In[8]:
93
+
94
+
95
+ ov.utils.embedding(adata,basis='X_umap',
96
+ color=['clusters','louvain'],
97
+ frameon='small',wspace=0.5)
98
+
99
+
100
+ # ## Supervised clustering
101
+ #
102
+ # In addition to clustering using unsupervised clustering methods, we can also try supervised clustering methods, such as Gaussian mixture model clustering, which is a supervised clustering algorithm that works better in machine learning
103
+ #
104
+ # Gaussian mixture models can be used to cluster unlabeled data in much the same way as k-means. There are, however, a couple of advantages to using Gaussian mixture models over k-means.
105
+ #
106
+ # First and foremost, k-means does not account for variance. By variance, we are referring to the width of the bell shape curve.
107
+ #
108
+ # The second difference between k-means and Gaussian mixture models is that the former performs hard classification whereas the latter performs soft classification. In other words, k-means tells us what data point belong to which cluster but won’t provide us with the probabilities that a given data point belongs to each of the possible clusters.
109
+ #
110
+ # Here, we set `method='GMM'` to cluster the cells using `GaussianMixture`,`n_components` means the PCs to be used in clustering, `covariance_type` means the Gaussian Mixture Models (`diagonal`, `spherical`, `tied` and `full` covariance matrices supported). More arguments could found in [`sklearn.mixture.GaussianMixture`](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)
111
+ #
112
+
113
+ # In[9]:
114
+
115
+
116
+ ov.utils.cluster(adata,use_rep='scaled|original|X_pca',
117
+ method='GMM',n_components=21,
118
+ covariance_type='full',tol=1e-9, max_iter=1000, )
119
+
120
+
121
+ # In[10]:
122
+
123
+
124
+ ov.utils.embedding(adata,basis='X_umap',
125
+ color=['clusters','gmm_cluster'],
126
+ frameon='small',wspace=0.5)
127
+
128
+
129
+ # ## Latent Dirichlet Allocation (LDA) model implementation
130
+ #
131
+ # Topic models, like Latent Dirichlet Allocation (LDA), have traditionally been used to decompose a corpus of text into topics - or themes - composed of words that often appear together in documents. Documents, in turn, are modeled as a mixture of topics based on the words they contain.
132
+ #
133
+ # MIRA extends these ideas to single-cell genomics data, where topics are groups of genes that are co-expressed or cis-regulatory elements that are co-accessible, and cells are a mixture of these regulatory modules.
134
+ #
135
+ # Here, we used `ov.utils.LDA_topic` to construct the model of MiRA.
136
+ #
137
+ # Particularly, and at a minimum, we must tell the model
138
+ #
139
+ # - feature_type: what type of features we are working with (either “expression” or “accessibility”)
140
+ # - highly_variable_key: which .var key to find our highly variable genes
141
+ # - counts_layer: which layer to get the raw counts from.
142
+ # - categorical_covariates, continuous_covariates: Technical variables influencing the generative process of the data. For example, a categorical technical factor may be the cells’ batch of origin, as shown here. A continous technical factor might be % of mitchondrial reads. For unbatched data, ignore these parameters.
143
+ # - learning_rate: for larger datasets, the default of 1e-3, 0.1 usually works well.
144
+
145
+ # In[11]:
146
+
147
+
148
+ LDA_obj=ov.utils.LDA_topic(adata,feature_type='expression',
149
+ highly_variable_key='highly_variable_features',
150
+ layers='counts',batch_key=None,learning_rate=1e-3)
151
+
152
+
153
+ # This method works by instantiating a special version of the CODAL model with far too many topics, which are gradually pruned if that topic is not needed to describe the data. The function returns the maximum contribution of each topic to any cell in the dataset. The predicted number of topics is given by the elbo of the maximum contribution curve, minus 1. A rule of thumb is that the last valid topic to include in the model is followed by a drop-off, after which all subsequent topics hover between 0.-0.05 maximum contributions.
154
+ #
155
+ # We set `NUM_TOPICS` to six to try.
156
+
157
+ # In[12]:
158
+
159
+
160
+ LDA_obj.plot_topic_contributions(6)
161
+
162
+
163
+ # We can observe that there are 13 TOPICs to be above the threshold line, so we set the new NUM_TOPIC to 12
164
+
165
+ # In[13]:
166
+
167
+
168
+ LDA_obj.predicted(13)
169
+
170
+
171
+ # One can plot the distribution of topics across cells to see how the latent space reflects changes in cell state:
172
+
173
+ # In[14]:
174
+
175
+
176
+ ov.plot_set()
177
+ ov.utils.embedding(adata, basis='X_umap',color = LDA_obj.model.topic_cols, cmap='BuPu', ncols=4,
178
+ add_outline=True, frameon='small',)
179
+
180
+
181
+ # In[15]:
182
+
183
+
184
+ ov.utils.embedding(adata,basis='X_umap',
185
+ color=['clusters','LDA_cluster'],
186
+ frameon='small',wspace=0.5)
187
+
188
+
189
+ # Here we are, proposing another idea of categorisation. We use cells with LDA greater than 0.4 as a primitive class, and then train a random forest classification model, and then use the random forest classification model to classify cells with LDA less than 0.5 to get a more accurate
190
+
191
+ # In[20]:
192
+
193
+
194
+ LDA_obj.get_results_rfc(adata,use_rep='scaled|original|X_pca',
195
+ LDA_threshold=0.4,num_topics=13)
196
+
197
+
198
+ # In[21]:
199
+
200
+
201
+ ov.utils.embedding(adata,basis='X_umap',
202
+ color=['LDA_cluster_rfc','LDA_cluster_clf'],
203
+ frameon='small',wspace=0.5)
204
+
205
+
206
+ # ## cNMF
207
+ #
208
+ # More detail could be found in https://starlitnightly.github.io/omicverse/Tutorials-single/t_cnmf/
209
+
210
+ # In[32]:
211
+
212
+
213
+ adata.X.toarray()
214
+
215
+
216
+ # In[ ]:
217
+
218
+
219
+ import numpy as np
220
+ ## Initialize the cnmf object that will be used to run analyses
221
+ cnmf_obj = ov.single.cNMF(adata,components=np.arange(5,11), n_iter=20, seed=14, num_highvar_genes=2000,
222
+ output_dir='example_dg1/cNMF', name='dg_cNMF')
223
+ ## Specify that the jobs are being distributed over a single worker (total_workers=1) and then launch that worker
224
+ cnmf_obj.factorize(worker_i=0, total_workers=4)
225
+ cnmf_obj.combine(skip_missing_files=True)
226
+ cnmf_obj.k_selection_plot(close_fig=False)
227
+
228
+
229
+ # In[35]:
230
+
231
+
232
+ selected_K = 7
233
+ density_threshold = 2.00
234
+ cnmf_obj.consensus(k=selected_K,
235
+ density_threshold=density_threshold,
236
+ show_clustering=True,
237
+ close_clustergram_fig=False)
238
+ result_dict = cnmf_obj.load_results(K=selected_K, density_threshold=density_threshold)
239
+ cnmf_obj.get_results(adata,result_dict)
240
+
241
+
242
+ # In[36]:
243
+
244
+
245
+ ov.pl.embedding(adata, basis='X_umap',color=result_dict['usage_norm'].columns,
246
+ use_raw=False, ncols=3, vmin=0, vmax=1,frameon='small')
247
+
248
+
249
+ # In[40]:
250
+
251
+
252
+ cnmf_obj.get_results_rfc(adata,result_dict,
253
+ use_rep='scaled|original|X_pca',
254
+ cNMF_threshold=0.5)
255
+
256
+
257
+ # In[41]:
258
+
259
+
260
+ ov.pl.embedding(
261
+ adata,
262
+ basis="X_umap",
263
+ color=['cNMF_cluster_rfc','cNMF_cluster_clf'],
264
+ frameon='small',
265
+ #title="Celltypes",
266
+ #legend_loc='on data',
267
+ legend_fontsize=14,
268
+ legend_fontoutline=2,
269
+ #size=10,
270
+ #legend_loc=True,
271
+ add_outline=False,
272
+ #add_outline=True,
273
+ outline_color='black',
274
+ outline_width=1,
275
+ show=False,
276
+ )
277
+
278
+
279
+ # ## Evaluation the clustering space
280
+ #
281
+ # Rand index adjusted for chance. The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
282
+
283
+ # In[42]:
284
+
285
+
286
+ from sklearn.metrics.cluster import adjusted_rand_score
287
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
288
+ print('Leiden, Adjusted rand index = %.2f' %ARI)
289
+
290
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['louvain'])
291
+ print('Louvain, Adjusted rand index = %.2f' %ARI)
292
+
293
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['gmm_cluster'])
294
+ print('GMM, Adjusted rand index = %.2f' %ARI)
295
+
296
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['LDA_cluster'])
297
+ print('LDA, Adjusted rand index = %.2f' %ARI)
298
+
299
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['LDA_cluster_rfc'])
300
+ print('LDA_rfc, Adjusted rand index = %.2f' %ARI)
301
+
302
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['LDA_cluster_clf'])
303
+ print('LDA_clf, Adjusted rand index = %.2f' %ARI)
304
+
305
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['cNMF_cluster_rfc'])
306
+ print('cNMF_rfc, Adjusted rand index = %.2f' %ARI)
307
+
308
+ ARI = adjusted_rand_score(adata.obs['clusters'], adata.obs['cNMF_cluster_clf'])
309
+ print('cNMF_clf, Adjusted rand index = %.2f' %ARI)
310
+
311
+
312
+ # We can find that the LDA topic model is the most effective among the above clustering algorithms, but it also takes the longest computation time and requires GPU resources. We notice that the Gaussian mixture model is second only to the LDA topic model. The GMM will be a great choice in omicverse's future clustering algorithms for spatial transcriptomics.
ovrawm/t_cluster_space.txt ADDED
@@ -0,0 +1,399 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Spatial clustering and denoising expressions
5
+ #
6
+ # Spatial clustering, which shares an analogy with single-cell clustering, has expanded the scope of tissue physiology studies from cell-centroid to structure-centroid with spatially resolved transcriptomics (SRT) data.
7
+ #
8
+ # Here, we presented four spatial clustering methods in OmicVerse.
9
+ #
10
+ # We made three improvements in integrating the `GraphST`,`BINARY`,`CAST` and `STAGATE` algorithm in OmicVerse:
11
+ # - We removed the preprocessing that comes with `GraphST` and used the preprocessing consistent with all SRTs in OmicVerse
12
+ # - We optimised the dimensional display of `GraphST`, and PCA is considered a self-contained computational step.
13
+ # - We implemented `mclust` using Python, removing the R language dependency.
14
+ # - We provided a unified interface `ov.space.cluster`, the user can use the function interface at once to complete all the simultaneous
15
+ #
16
+ # If you found this tutorial helpful, please cite `GraphST`,`BINARY`,`CAST` and `STAGATE` and `OmicVerse`:
17
+ #
18
+ # - Long, Y., Ang, K.S., Li, M. et al. Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat Commun 14, 1155 (2023). https://doi.org/10.1038/s41467-023-36796-3
19
+ # - Lin S, Cui Y, Zhao F, Yang Z, Song J, Yao J, et al. Complete spatially resolved gene expression is not necessary for identifying spatial domains. Cell Genomics. 2024;4:100565.
20
+ # - Tang, Z., Luo, S., Zeng, H. et al. Search and match across spatial omics samples at single-cell resolution. Nat Methods 21, 1818–1829 (2024). https://doi.org/10.1038/s41592-024-02410-7
21
+ # - Dong, K., Zhang, S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat Commun 13, 1739 (2022). https://doi.org/10.1038/s41467-022-29439-6
22
+ #
23
+ #
24
+
25
+ # In[1]:
26
+
27
+
28
+ import omicverse as ov
29
+ #print(f"omicverse version: {ov.__version__}")
30
+ import scanpy as sc
31
+ #print(f"scanpy version: {sc.__version__}")
32
+ ov.plot_set()
33
+
34
+
35
+ # ## Preprocess data
36
+ #
37
+ # Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.
38
+ #
39
+ # This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.
40
+
41
+ # In[2]:
42
+
43
+
44
+ adata = sc.read_visium(path='data', count_file='151676_filtered_feature_bc_matrix.h5')
45
+ adata.var_names_make_unique()
46
+
47
+
48
+ # <div class="admonition warning">
49
+ # <p class="admonition-title">Note</p>
50
+ # <p>
51
+ # We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
52
+ # </p>
53
+ # </div>
54
+ #
55
+ # ```python
56
+ # #adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
57
+ # #adata.raw = adata
58
+ # #adata = adata[:, adata.var.highly_variable_features]
59
+ # ```
60
+
61
+ # In[3]:
62
+
63
+
64
+ sc.pp.calculate_qc_metrics(adata, inplace=True)
65
+ adata = adata[:,adata.var['total_counts']>100]
66
+ adata=ov.space.svg(adata,mode='prost',n_svgs=3000,target_sum=1e4,platform="visium",)
67
+ adata
68
+
69
+
70
+ # In[5]:
71
+
72
+
73
+ adata.write('data/cluster_svg.h5ad',compression='gzip')
74
+
75
+
76
+ # In[2]:
77
+
78
+
79
+ adata=ov.read('data/cluster_svg.h5ad',compression='gzip')
80
+
81
+
82
+ # (Optional) We read the ground truth area of our spatial data
83
+ #
84
+ # This step is not mandatory to run, in the tutorial, it's just to demonstrate the accuracy of our clustering effect, and in your own tasks, there is often no Ground_truth
85
+
86
+ # In[3]:
87
+
88
+
89
+ # read the annotation
90
+ import pandas as pd
91
+ import os
92
+ Ann_df = pd.read_csv(os.path.join('data', '151676_truth.txt'), sep='\t', header=None, index_col=0)
93
+ Ann_df.columns = ['Ground Truth']
94
+ adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
95
+ sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
96
+
97
+
98
+ # ## Method1: GraphST
99
+ #
100
+ # GraphST was rated as one of the best spatial clustering algorithms on Nature Method 2024.04, so we first tried to call GraphST for spatial domain identification in OmicVerse.
101
+
102
+ # In[4]:
103
+
104
+
105
+ methods_kwargs={}
106
+ methods_kwargs['GraphST']={
107
+ 'device':'cuda:0',
108
+ 'n_pcs':30
109
+ }
110
+
111
+ adata=ov.space.clusters(adata,
112
+ methods=['GraphST'],
113
+ methods_kwargs=methods_kwargs,
114
+ lognorm=1e4)
115
+
116
+
117
+ # In[11]:
118
+
119
+
120
+ ov.utils.cluster(adata,use_rep='graphst|original|X_pca',method='mclust',n_components=10,
121
+ modelNames='EEV', random_state=112,
122
+ )
123
+ adata.obs['mclust_GraphST'] = ov.utils.refine_label(adata, radius=50, key='mclust')
124
+ adata.obs['mclust_GraphST']=adata.obs['mclust_GraphST'].astype('category')
125
+
126
+
127
+ # In[12]:
128
+
129
+
130
+ res=ov.space.merge_cluster(adata,groupby='mclust_GraphST',use_rep='graphst|original|X_pca',
131
+ threshold=0.2,plot=True)
132
+
133
+
134
+ # In[13]:
135
+
136
+
137
+ sc.pl.spatial(adata, color=['mclust_GraphST','mclust_GraphST_tree','mclust','Ground Truth'])
138
+
139
+
140
+ # We can also use `mclust_R` to cluster the spatial domain, but this method need to install `rpy2` at first.
141
+ #
142
+ # The use of the mclust algorithm requires the rpy2 package and the mclust package. See https://pypi.org/project/rpy2/ and https://cran.r-project.org/web/packages/mclust/index.html for detail.
143
+
144
+ # In[14]:
145
+
146
+
147
+ ov.utils.cluster(adata,use_rep='graphst|original|X_pca',method='mclust_R',n_components=10,
148
+ random_state=42,
149
+ )
150
+ adata.obs['mclust_R_GraphST'] = ov.utils.refine_label(adata, radius=30, key='mclust_R')
151
+ adata.obs['mclust_R_GraphST']=adata.obs['mclust_R_GraphST'].astype('category')
152
+ res=ov.space.merge_cluster(adata,groupby='mclust_R_GraphST',use_rep='graphst|original|X_pca',
153
+ threshold=0.2,plot=True)
154
+
155
+
156
+ # In[15]:
157
+
158
+
159
+ sc.pl.spatial(adata, color=['mclust_R_GraphST','mclust_R_GraphST_tree','mclust','Ground Truth'])
160
+
161
+
162
+ # ## Method2: BINARY
163
+ #
164
+ # BINARY outperforms existing methods across various SRT data types while using significantly less input information.
165
+ #
166
+ # If your data is very large, or very sparse, I believe BINARY would be a great choice.
167
+
168
+ # In[3]:
169
+
170
+
171
+ methods_kwargs={}
172
+ methods_kwargs['BINARY']={
173
+ 'use_method':'KNN',
174
+ 'cutoff':6,
175
+ 'obs_key':'BINARY_sample',
176
+ 'use_list':None,
177
+ 'pos_weight':10,
178
+ 'device':'cuda:0',
179
+ 'hidden_dims':[512, 30],
180
+ 'n_epochs': 1000,
181
+ 'lr': 0.001,
182
+ 'key_added': 'BINARY',
183
+ 'gradient_clipping': 5,
184
+ 'weight_decay': 0.0001,
185
+ 'verbose': True,
186
+ 'random_seed':0,
187
+ 'lognorm':1e4,
188
+ 'n_top_genes':2000,
189
+ }
190
+ adata=ov.space.clusters(adata,
191
+ methods=['BINARY'],
192
+ methods_kwargs=methods_kwargs)
193
+
194
+
195
+ # if you want to use R's `mclust`, you can use `ov.utils.cluster`.
196
+ #
197
+ # But you need to install `rpy2` and `mclust` at first.
198
+
199
+ # In[4]:
200
+
201
+
202
+ ov.utils.cluster(adata,use_rep='BINARY',method='mclust_R',n_components=10,
203
+ random_state=42,
204
+ )
205
+ adata.obs['mclust_BINARY'] = ov.utils.refine_label(adata, radius=30, key='mclust_R')
206
+ adata.obs['mclust_BINARY']=adata.obs['mclust_BINARY'].astype('category')
207
+
208
+
209
+ # In[5]:
210
+
211
+
212
+ res=ov.space.merge_cluster(adata,groupby='mclust_BINARY',use_rep='BINARY',
213
+ threshold=0.01,plot=True)
214
+
215
+
216
+ # In[6]:
217
+
218
+
219
+ sc.pl.spatial(adata, color=['mclust_BINARY','mclust_BINARY_tree','mclust','Ground Truth'])
220
+
221
+
222
+ # In[10]:
223
+
224
+
225
+ ov.utils.cluster(adata,use_rep='BINARY',method='mclust',n_components=10,
226
+ modelNames='EEV', random_state=42,
227
+ )
228
+ adata.obs['mclustpy_BINARY'] = ov.utils.refine_label(adata, radius=30, key='mclust')
229
+ adata.obs['mclustpy_BINARY']=adata.obs['mclustpy_BINARY'].astype('category')
230
+
231
+
232
+ # In[13]:
233
+
234
+
235
+ adata.obs['mclustpy_BINARY']=adata.obs['mclustpy_BINARY'].astype('category')
236
+ res=ov.space.merge_cluster(adata,groupby='mclustpy_BINARY',use_rep='BINARY',
237
+ threshold=0.013,plot=True)
238
+
239
+
240
+ # In[14]:
241
+
242
+
243
+ sc.pl.spatial(adata, color=['mclustpy_BINARY','mclustpy_BINARY_tree','mclust','Ground Truth'])
244
+ #adata.obs['mclust_BINARY'] = ov.utils.refine_label(adata, radius=30, key='mclust')
245
+ #adata.obs['mclust_BINARY']=adata.obs['mclust_BINARY'].astype('category')
246
+
247
+
248
+ # ## Method3: STAGATE
249
+ #
250
+ # STAGATE is designed for spatial clustering and denoising expressions of spatial resolved transcriptomics (ST) data.
251
+ #
252
+ # STAGATE learns low-dimensional latent embeddings with both spatial information and gene expressions via a graph attention auto-encoder. The method adopts an attention mechanism in the middle layer of the encoder and decoder, which adaptively learns the edge weights of spatial neighbor networks, and further uses them to update the spot representation by collectively aggregating information from its neighbors. The latent embeddings and the reconstructed expression profiles can be used to downstream tasks such as spatial domain identification, visualization, spatial trajectory inference, data denoising and 3D expression domain extraction.
253
+ #
254
+ # Dong, Kangning, and Shihua Zhang. “Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder.” Nature Communications 13.1 (2022): 1-12.
255
+ #
256
+ #
257
+ # Here, we used `ov.space.pySTAGATE` to construct a STAGATE object to train the model.
258
+ #
259
+
260
+ # In[12]:
261
+
262
+
263
+ methods_kwargs={}
264
+ methods_kwargs['STAGATE']={
265
+ 'num_batch_x':3,'num_batch_y':2,
266
+ 'spatial_key':['X','Y'],'rad_cutoff':200,
267
+ 'num_epoch':1000,'lr':0.001,
268
+ 'weight_decay':1e-4,'hidden_dims':[512, 30],
269
+ 'device':'cuda:0',
270
+ #'n_top_genes':2000,
271
+ }
272
+
273
+ adata=ov.space.clusters(adata,
274
+ methods=['STAGATE'],
275
+ methods_kwargs=methods_kwargs)
276
+
277
+
278
+ # In[36]:
279
+
280
+
281
+ ov.utils.cluster(adata,use_rep='STAGATE',method='mclust_R',n_components=10,
282
+ random_state=112,
283
+ )
284
+ adata.obs['mclust_R_STAGATE'] = ov.utils.refine_label(adata, radius=30, key='mclust_R')
285
+ adata.obs['mclust_R_STAGATE']=adata.obs['mclust_R_STAGATE'].astype('category')
286
+ res=ov.space.merge_cluster(adata,groupby='mclust_R_STAGATE',use_rep='STAGATE',
287
+ threshold=0.005,plot=True)
288
+
289
+
290
+ # In[37]:
291
+
292
+
293
+ sc.pl.spatial(adata, color=['mclust_R_STAGATE','mclust_R_STAGATE_tree','mclust_R','Ground Truth'])
294
+
295
+
296
+ # ### Denoising
297
+
298
+ # In[52]:
299
+
300
+
301
+ adata.var.sort_values('PI',ascending=False).head(5)
302
+
303
+
304
+ # In[53]:
305
+
306
+
307
+ plot_gene = 'MBP'
308
+ import matplotlib.pyplot as plt
309
+ fig, axs = plt.subplots(1, 2, figsize=(8, 4))
310
+ sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
311
+ sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
312
+
313
+
314
+ # ## Method4: CAST
315
+ #
316
+ # CAST would be a great algorithm if your spatial transcriptome is at single-cell resolution and in multiple slices.
317
+
318
+ # In[38]:
319
+
320
+
321
+ methods_kwargs={}
322
+ methods_kwargs['CAST']={
323
+ 'output_path_t':'result/CAST_gas/output',
324
+ 'device':'cuda:0',
325
+ 'gpu_t':0
326
+ }
327
+ adata=ov.space.clusters(adata,
328
+ methods=['CAST'],
329
+ methods_kwargs=methods_kwargs)
330
+
331
+
332
+ # In[39]:
333
+
334
+
335
+ ov.utils.cluster(adata,use_rep='X_cast',method='mclust',n_components=10,
336
+ modelNames='EEV', random_state=42,
337
+ )
338
+ adata.obs['mclust_CAST'] = ov.utils.refine_label(adata, radius=50, key='mclust')
339
+ adata.obs['mclust_CAST']=adata.obs['mclust_CAST'].astype('category')
340
+
341
+
342
+ # In[40]:
343
+
344
+
345
+ res=ov.space.merge_cluster(adata,groupby='mclust_CAST',use_rep='X_cast',
346
+ threshold=0.1,plot=True)
347
+
348
+
349
+ # In[41]:
350
+
351
+
352
+ sc.pl.spatial(adata, color=['mclust_CAST','mclust_CAST_tree','mclust','Ground Truth'])
353
+
354
+
355
+ # In[42]:
356
+
357
+
358
+ adata
359
+
360
+
361
+ # ## Evaluate cluster
362
+ #
363
+ # We use ARI to evaluate the scoring of our clusters against the true values
364
+ #
365
+ # While it appears that STAGATE works best, note that this is only on this dataset.
366
+ # - If your data is spot-level resolution, GraphST, BINARY and STAGATE would be good algorithms to use
367
+ # - BINARY and CAST would be good algorithms if your data is NanoString or other single-cell resolution
368
+
369
+ # In[50]:
370
+
371
+
372
+ from sklearn.metrics.cluster import adjusted_rand_score
373
+
374
+ obs_df = adata.obs.dropna()
375
+ #GraphST
376
+ ARI = adjusted_rand_score(obs_df['mclust_GraphST'], obs_df['Ground Truth'])
377
+ print('mclust_GraphST: Adjusted rand index = %.2f' %ARI)
378
+
379
+ ARI = adjusted_rand_score(obs_df['mclust_R_GraphST'], obs_df['Ground Truth'])
380
+ print('mclust_R_GraphST: Adjusted rand index = %.2f' %ARI)
381
+
382
+ ARI = adjusted_rand_score(obs_df['mclust_R_STAGATE'], obs_df['Ground Truth'])
383
+ print('mclust_STAGATE: Adjusted rand index = %.2f' %ARI)
384
+
385
+ ARI = adjusted_rand_score(obs_df['mclust_BINARY'], obs_df['Ground Truth'])
386
+ print('mclust_BINARY: Adjusted rand index = %.2f' %ARI)
387
+
388
+ ARI = adjusted_rand_score(obs_df['mclustpy_BINARY'], obs_df['Ground Truth'])
389
+ print('mclustpy_BINARY: Adjusted rand index = %.2f' %ARI)
390
+
391
+ ARI = adjusted_rand_score(obs_df['mclust_CAST'], obs_df['Ground Truth'])
392
+ print('mclust_CAST: Adjusted rand index = %.2f' %ARI)
393
+
394
+
395
+ # In[ ]:
396
+
397
+
398
+
399
+
ovrawm/t_cnmf.txt ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Consensus Non-negative Matrix factorization (cNMF)
5
+ #
6
+ # cNMF is an analysis pipeline for inferring gene expression programs from single-cell RNA-Seq (scRNA-Seq) data.
7
+ #
8
+ # It takes a count matrix (N cells X G genes) as input and produces a (K x G) matrix of gene expression programs (GEPs) and a (N x K) matrix specifying the usage of each program for each cell in the data. You can read more about the method in the [github](https://github.com/dylkot/cNMF) and check out examples on dentategyrus.
9
+
10
+ # In[1]:
11
+
12
+
13
+ import scanpy as sc
14
+ import omicverse as ov
15
+ ov.plot_set()
16
+
17
+
18
+ # ## Loading dataset
19
+ #
20
+ # Here, we use the dentategyrus dataset as an example for cNMF.
21
+
22
+ # In[2]:
23
+
24
+
25
+ import scvelo as scv
26
+ adata=scv.datasets.dentategyrus()
27
+
28
+
29
+ # In[3]:
30
+
31
+
32
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)\nadata\n")
33
+
34
+
35
+ # In[23]:
36
+
37
+
38
+ ov.pp.scale(adata)
39
+ ov.pp.pca(adata)
40
+
41
+
42
+ # In[4]:
43
+
44
+
45
+ import matplotlib.pyplot as plt
46
+ from matplotlib import patheffects
47
+ fig, ax = plt.subplots(figsize=(4,4))
48
+ ov.pl.embedding(
49
+ adata,
50
+ basis="X_umap",
51
+ color=['clusters'],
52
+ frameon='small',
53
+ title="Celltypes",
54
+ #legend_loc='on data',
55
+ legend_fontsize=14,
56
+ legend_fontoutline=2,
57
+ #size=10,
58
+ ax=ax,
59
+ #legend_loc=True,
60
+ add_outline=False,
61
+ #add_outline=True,
62
+ outline_color='black',
63
+ outline_width=1,
64
+ show=False,
65
+ )
66
+
67
+
68
+ # ## Initialize and Training model
69
+
70
+ # In[5]:
71
+
72
+
73
+ import numpy as np
74
+ ## Initialize the cnmf object that will be used to run analyses
75
+ cnmf_obj = ov.single.cNMF(adata,components=np.arange(5,11), n_iter=20, seed=14, num_highvar_genes=2000,
76
+ output_dir='example_dg/cNMF', name='dg_cNMF')
77
+
78
+
79
+ # In[6]:
80
+
81
+
82
+ ## Specify that the jobs are being distributed over a single worker (total_workers=1) and then launch that worker
83
+ cnmf_obj.factorize(worker_i=0, total_workers=2)
84
+
85
+
86
+ # In[7]:
87
+
88
+
89
+ cnmf_obj.combine(skip_missing_files=True)
90
+
91
+
92
+ # ## Compute the stability and error at each choice of K to see if a clear choice jumps out.
93
+ #
94
+ # Please note that the maximum stability solution is not always the best choice depending on the application. However it is often a good starting point even if you have to investigate several choices of K
95
+
96
+ # In[8]:
97
+
98
+
99
+ cnmf_obj.k_selection_plot(close_fig=False)
100
+
101
+
102
+ # In this range, K=7 gave the most stable solution so we will begin by looking at that.
103
+ #
104
+ # The next step computes the consensus solution for a given choice of K. We first run it without any outlier filtering to see what that looks like. Setting the density threshold to anything >= 2.00 (the maximum possible distance between two unit vectors) ensures that nothing will be filtered.
105
+ #
106
+ # Then we run the consensus with a filter for outliers determined based on inspecting the histogram of distances between components and their nearest neighbors
107
+
108
+ # In[9]:
109
+
110
+
111
+ selected_K = 7
112
+ density_threshold = 2.00
113
+
114
+
115
+ # In[10]:
116
+
117
+
118
+ cnmf_obj.consensus(k=selected_K,
119
+ density_threshold=density_threshold,
120
+ show_clustering=True,
121
+ close_clustergram_fig=False)
122
+
123
+
124
+ # The above consensus plot shows that there is a substantial degree of concordance between the replicates with a few outliers. An outlier threshold of 0.1 seems appropriate
125
+
126
+ # In[11]:
127
+
128
+
129
+ density_threshold = 0.10
130
+
131
+
132
+ # In[12]:
133
+
134
+
135
+ cnmf_obj.consensus(k=selected_K,
136
+ density_threshold=density_threshold,
137
+ show_clustering=True,
138
+ close_clustergram_fig=False)
139
+
140
+
141
+ # ## Visualization the result
142
+
143
+ # In[13]:
144
+
145
+
146
+ import seaborn as sns
147
+ import matplotlib.pyplot as plt
148
+ from matplotlib import patheffects
149
+
150
+ from matplotlib import gridspec
151
+ import matplotlib.pyplot as plt
152
+
153
+ width_ratios = [0.2, 4, 0.5, 10, 1]
154
+ height_ratios = [0.2, 4]
155
+ fig = plt.figure(figsize=(sum(width_ratios), sum(height_ratios)))
156
+ gs = gridspec.GridSpec(len(height_ratios), len(width_ratios), fig,
157
+ 0.01, 0.01, 0.98, 0.98,
158
+ height_ratios=height_ratios,
159
+ width_ratios=width_ratios,
160
+ wspace=0, hspace=0)
161
+
162
+ D = cnmf_obj.topic_dist[cnmf_obj.spectra_order, :][:, cnmf_obj.spectra_order]
163
+ dist_ax = fig.add_subplot(gs[1,1], xscale='linear', yscale='linear',
164
+ xticks=[], yticks=[],xlabel='', ylabel='',
165
+ frameon=True)
166
+ dist_im = dist_ax.imshow(D, interpolation='none', cmap='viridis',
167
+ aspect='auto', rasterized=True)
168
+
169
+ left_ax = fig.add_subplot(gs[1,0], xscale='linear', yscale='linear', xticks=[], yticks=[],
170
+ xlabel='', ylabel='', frameon=True)
171
+ left_ax.imshow(cnmf_obj.kmeans_cluster_labels.values[cnmf_obj.spectra_order].reshape(-1, 1),
172
+ interpolation='none', cmap='Spectral', aspect='auto',
173
+ rasterized=True)
174
+
175
+ top_ax = fig.add_subplot(gs[0,1], xscale='linear', yscale='linear', xticks=[], yticks=[],
176
+ xlabel='', ylabel='', frameon=True)
177
+ top_ax.imshow(cnmf_obj.kmeans_cluster_labels.values[cnmf_obj.spectra_order].reshape(1, -1),
178
+ interpolation='none', cmap='Spectral', aspect='auto',
179
+ rasterized=True)
180
+
181
+ cbar_gs = gridspec.GridSpecFromSubplotSpec(3, 3, subplot_spec=gs[1, 2],
182
+ wspace=0, hspace=0)
183
+ cbar_ax = fig.add_subplot(cbar_gs[1,2], xscale='linear', yscale='linear',
184
+ xlabel='', ylabel='', frameon=True, title='Euclidean\nDistance')
185
+ cbar_ax.set_title('Euclidean\nDistance',fontsize=12)
186
+ vmin = D.min().min()
187
+ vmax = D.max().max()
188
+ fig.colorbar(dist_im, cax=cbar_ax,
189
+ ticks=np.linspace(vmin, vmax, 3),
190
+ )
191
+ cbar_ax.set_yticklabels(cbar_ax.get_yticklabels(),fontsize=12)
192
+
193
+
194
+ # In[14]:
195
+
196
+
197
+ density_filter = cnmf_obj.local_density.iloc[:, 0] < density_threshold
198
+ fig, hist_ax = plt.subplots(figsize=(4,4))
199
+
200
+ #hist_ax = fig.add_subplot(hist_gs[0,0], xscale='linear', yscale='linear',
201
+ # xlabel='', ylabel='', frameon=True, title='Local density histogram')
202
+ hist_ax.hist(cnmf_obj.local_density.values, bins=np.linspace(0, 1, 50))
203
+ hist_ax.yaxis.tick_right()
204
+
205
+ xlim = hist_ax.get_xlim()
206
+ ylim = hist_ax.get_ylim()
207
+ if density_threshold < xlim[1]:
208
+ hist_ax.axvline(density_threshold, linestyle='--', color='k')
209
+ hist_ax.text(density_threshold + 0.02, ylim[1] * 0.95, 'filtering\nthreshold\n\n', va='top')
210
+ hist_ax.set_xlim(xlim)
211
+ hist_ax.set_xlabel('Mean distance to k nearest neighbors\n\n%d/%d (%.0f%%) spectra above threshold\nwere removed prior to clustering'%(sum(~density_filter), len(density_filter), 100*(~density_filter).mean()))
212
+ hist_ax.set_title('Local density histogram')
213
+
214
+
215
+ # ## Explode the cNMF result
216
+ #
217
+ # We can load the results for a cNMF run with a given K and density filtering threshold like below
218
+
219
+ # In[15]:
220
+
221
+
222
+ result_dict = cnmf_obj.load_results(K=selected_K, density_threshold=density_threshold)
223
+
224
+
225
+ # In[16]:
226
+
227
+
228
+ result_dict['usage_norm'].head()
229
+
230
+
231
+ # In[17]:
232
+
233
+
234
+ result_dict['gep_scores'].head()
235
+
236
+
237
+ # In[18]:
238
+
239
+
240
+ result_dict['gep_tpm'].head()
241
+
242
+
243
+ # In[19]:
244
+
245
+
246
+ result_dict['top_genes'].head()
247
+
248
+
249
+ # We can extract cell classes directly based on the highest cNMF in each cell, but this has the disadvantage that it will lead to mixed cell classes if the heterogeneity of our data is not as strong as it should be.
250
+
251
+ # In[20]:
252
+
253
+
254
+ cnmf_obj.get_results(adata,result_dict)
255
+
256
+
257
+ # In[21]:
258
+
259
+
260
+ ov.pl.embedding(adata, basis='X_umap',color=result_dict['usage_norm'].columns,
261
+ use_raw=False, ncols=3, vmin=0, vmax=1,frameon='small')
262
+
263
+
264
+ # In[24]:
265
+
266
+
267
+ ov.pl.embedding(
268
+ adata,
269
+ basis="X_umap",
270
+ color=['cNMF_cluster'],
271
+ frameon='small',
272
+ #title="Celltypes",
273
+ #legend_loc='on data',
274
+ legend_fontsize=14,
275
+ legend_fontoutline=2,
276
+ #size=10,
277
+ #legend_loc=True,
278
+ add_outline=False,
279
+ #add_outline=True,
280
+ outline_color='black',
281
+ outline_width=1,
282
+ show=False,
283
+ )
284
+
285
+
286
+ # Here we are, proposing another idea of categorisation. We use cells with cNMF greater than 0.5 as a primitive class, and then train a random forest classification model, and then use the random forest classification model to classify cells with cNMF less than 0.5 to get a more accurate
287
+
288
+ # In[25]:
289
+
290
+
291
+ cnmf_obj.get_results_rfc(adata,result_dict,
292
+ use_rep='scaled|original|X_pca',
293
+ cNMF_threshold=0.5)
294
+
295
+
296
+ # In[27]:
297
+
298
+
299
+ ov.pl.embedding(
300
+ adata,
301
+ basis="X_umap",
302
+ color=['cNMF_cluster_rfc','cNMF_cluster_clf'],
303
+ frameon='small',
304
+ #title="Celltypes",
305
+ #legend_loc='on data',
306
+ legend_fontsize=14,
307
+ legend_fontoutline=2,
308
+ #size=10,
309
+ #legend_loc=True,
310
+ add_outline=False,
311
+ #add_outline=True,
312
+ outline_color='black',
313
+ outline_width=1,
314
+ show=False,
315
+ )
316
+
317
+
318
+ # In[25]:
319
+
320
+
321
+ plot_genes=[]
322
+ for i in result_dict['top_genes'].columns:
323
+ plot_genes+=result_dict['top_genes'][i][:3].values.reshape(-1).tolist()
324
+
325
+
326
+ # In[26]:
327
+
328
+
329
+ sc.pl.dotplot(adata,plot_genes,
330
+ "cNMF_cluster", dendrogram=False,standard_scale='var',)
331
+
ovrawm/t_commot_flowsig.txt ADDED
@@ -0,0 +1,395 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Spatial Communication
5
+ #
6
+ # Spatial communication is a point of interest for us for the Spatial Transcriptomics Society, and we would like to find the conduction process of spatial communication.
7
+ #
8
+ # Here, we introduce two method integrated in omicverse named `COMMOT` and `flowsig`.
9
+ #
10
+ # We made three improvements in integrating the `COMMOT` and `flowsig` algorithm in OmicVerse:
11
+ #
12
+ # - We reduced the installation conflict of `COMMOT` and `flowsig`, user only need to update OmicVerse to the latest version.
13
+ # - We optimized the visualization of `COMMOT` and `flowsig` and unified the data preprocessing process so that users don't need to struggle with different data processing flows.
14
+ # - We have fixed some bugs that could occur during function.
15
+ #
16
+ # If you found this tutorial helpful, please cite `COMMOT`, `flowsig` and OmicVerse:
17
+ #
18
+ # - Cang, Z., Zhao, Y., Almet, A.A. et al. Screening cell–cell communication in spatial transcriptomics via collective optimal transport. Nat Methods 20, 218–228 (2023). https://doi.org/10.1038/s41592-022-01728-4
19
+ # - Almet, A.A., Tsai, YC., Watanabe, M. et al. Inferring pattern-driving intercellular flows from single-cell and spatial transcriptomics. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02380-w
20
+
21
+ # In[1]:
22
+
23
+
24
+ import omicverse as ov
25
+ #print(f"omicverse version: {ov.__version__}")
26
+ import scanpy as sc
27
+ #print(f"scanpy version: {sc.__version__}")
28
+ ov.plot_set()
29
+
30
+
31
+ # ## Preprocess data
32
+ #
33
+ # Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.
34
+ #
35
+ # This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.
36
+
37
+ # In[40]:
38
+
39
+
40
+ adata = sc.read_visium(path='data', count_file='151676_filtered_feature_bc_matrix.h5')
41
+ adata.var_names_make_unique()
42
+
43
+
44
+ # <div class="admonition warning">
45
+ # <p class="admonition-title">Note</p>
46
+ # <p>
47
+ # We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
48
+ # </p>
49
+ # </div>
50
+ #
51
+ # ```python
52
+ # #adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
53
+ # #adata.raw = adata
54
+ # #adata = adata[:, adata.var.highly_variable_features]
55
+ # ```
56
+
57
+ # In[ ]:
58
+
59
+
60
+ sc.pp.calculate_qc_metrics(adata, inplace=True)
61
+ adata = adata[:,adata.var['total_counts']>100]
62
+ adata=ov.space.svg(adata,mode='prost',n_svgs=3000,target_sum=1e4,platform="visium",)
63
+ adata
64
+
65
+
66
+ # In[ ]:
67
+
68
+
69
+ adata.write('data/cluster_svg.h5ad',compression='gzip')
70
+
71
+
72
+ # In[3]:
73
+
74
+
75
+ #adata=ov.read('data/cluster_svg.h5ad',compression='gzip')
76
+
77
+
78
+ # ## Communication Analysis with COMMOT
79
+ #
80
+ # ### Spatial communication inference
81
+ #
82
+ # We will use the CellChatDB ligand-receptor database here. Only the secreted signaling LR pairs will be used.
83
+ #
84
+ # Jin, Suoqin, et al. “Inference and analysis of cell-cell communication using CellChat.” Nature communications 12.1 (2021): 1-20.
85
+
86
+ # In[42]:
87
+
88
+
89
+ df_cellchat = ov.externel.commot.pp.ligand_receptor_database(species='human',
90
+ signaling_type='Secreted Signaling',
91
+ database='CellChat')
92
+ print(df_cellchat.shape)
93
+
94
+
95
+ # We then filter the LR pairs to keep only the pairs with both ligand and receptor expressed in at least 5% of the spots.
96
+
97
+ # In[43]:
98
+
99
+
100
+ df_cellchat_filtered = ov.externel.commot.pp.filter_lr_database(df_cellchat,
101
+ adata,
102
+ min_cell_pct=0.05)
103
+ print(df_cellchat_filtered.shape)
104
+
105
+
106
+ # Now perform spatial communication inference for these 250 ligand-receptor pairs with a spatial distance limit of 500. CellChat database considers heteromeric units. The signaling results are stored as spot-by-spot matrices in the obsp slots. For example, the score for spot i signaling to spot j through the LR pair can be retrieved from `adata.obsp['commot-cellchat-Wnt4-Fzd4_Lrp6'][i,j]`.
107
+
108
+ # In[44]:
109
+
110
+
111
+ ov.externel.commot.tl.spatial_communication(adata,
112
+ database_name='cellchat',
113
+ df_ligrec=df_cellchat_filtered,
114
+ dis_thr=500, heteromeric=True,
115
+ pathway_sum=True)
116
+
117
+
118
+ # (Optional) We read the ground truth area of our spatial data
119
+ #
120
+ # This step is not mandatory to run, in the tutorial, it's just to demonstrate the accuracy of our clustering effect, and in your own tasks, there is often no Ground_truth
121
+ #
122
+ # <div class="admonition warning">
123
+ # <p class="admonition-title">Note</p>
124
+ # <p>
125
+ # You can also use Celltype and other annotated results in adata.obs, here is just a randomly selected type, there is no particular significance, in order to facilitate the visualization and study the signal
126
+ # </p>
127
+ # </div>
128
+
129
+ # In[45]:
130
+
131
+
132
+ # read the annotation
133
+ import pandas as pd
134
+ import os
135
+ Ann_df = pd.read_csv(os.path.join('data', '151676_truth.txt'), sep='\t', header=None, index_col=0)
136
+ Ann_df.columns = ['Ground_Truth']
137
+ adata.obs['Ground_Truth'] = Ann_df.loc[adata.obs_names, 'Ground_Truth']
138
+ Layer_color=['#283b5c', '#d8e17b', '#838e44', '#4e8991', '#d08c35', '#511a3a',
139
+ '#c2c2c2', '#dfc648']
140
+ sc.pl.spatial(adata, img_key="hires", color=["Ground_Truth"],palette=Layer_color)
141
+
142
+
143
+ # ### Visualize the communication signal in spatial space
144
+ #
145
+ # Determine the spatial direction of a signaling pathway, for example, the FGF pathway. The interpolated signaling directions for where the signals are sent by the spots and where the signals received by the spots are from are stored in `adata.obsm['commot_sender_vf-cellchat-FGF']` and `adata.obsm['commot_receiver_vf-cellchat-FGF']`, respectively.
146
+ #
147
+ # Taken together, our findings indicate that FGF signaling is crucial for cortical folding in gyrencephalic mammals and is a pivotal upstream regulator of the production of OSVZ progenitors. FGF signaling is the first signaling pathway found to regulate cortical folding.
148
+
149
+ # In[46]:
150
+
151
+
152
+ ct_color_dict=dict(zip(adata.obs['Ground_Truth'].cat.categories,
153
+ adata.uns['Ground_Truth_colors']))
154
+
155
+
156
+ # In[47]:
157
+
158
+
159
+ adata.uns['commot-cellchat-info']['df_ligrec'].head()
160
+
161
+
162
+ # In[48]:
163
+
164
+
165
+ import matplotlib.pyplot as plt
166
+ scale=0.000008
167
+ k=5
168
+ goal_pathway='FGF'
169
+ ov.externel.commot.tl.communication_direction(adata, database_name='cellchat', pathway_name=goal_pathway, k=k)
170
+ ov.externel.commot.pl.plot_cell_communication(adata, database_name='cellchat',
171
+ pathway_name='FGF', plot_method='grid',
172
+ background_legend=True,
173
+ scale=scale, ndsize=8, grid_density=0.4,
174
+ summary='sender', background='cluster',
175
+ clustering='Ground_Truth',
176
+ cluster_cmap=ct_color_dict,
177
+ cmap='Alphabet',
178
+ normalize_v = True, normalize_v_quantile=0.995)
179
+ plt.title(f'Pathway:{goal_pathway}',fontsize=13)
180
+ #plt.savefig('figures/TLE/TLE_cellchat_all_FGF.png',dpi=300,bbox_inches='tight')
181
+ #fig.savefig('pdf/TLE/control_cellchat_all_FGF.pdf',dpi=300,bbox_inches='tight')
182
+
183
+
184
+ # In[49]:
185
+
186
+
187
+ adata.write('data/151676_commot.h5ad',compression='gzip')
188
+
189
+
190
+ # In[2]:
191
+
192
+
193
+ adata=ov.read('data/151676_commot.h5ad')
194
+ adata
195
+
196
+
197
+ # ## Communication signal inference with flowsig
198
+ #
199
+ # ### Construct GEMs
200
+ # We now construct gene expression modules (GEMs) from the unnormalised count data. For ST data, we use `NMF`.
201
+
202
+ # In[3]:
203
+
204
+
205
+ adata.layers['normalized'] = adata.X.copy()
206
+
207
+ # We construct 10 gene expression modules using the raw cell count.
208
+ ov.externel.flowsig.pp.construct_gems_using_nmf(adata,
209
+ n_gems = 10,
210
+ layer_key = 'counts',
211
+ )
212
+
213
+
214
+ # If you want to study the genes in a GEM, we provide the `ov.externel.flowsig.ul.get_top_gem_genes` function for getting the genes in a specific GEM.
215
+
216
+ # In[4]:
217
+
218
+
219
+ goal_gem='GEM-5'
220
+ gem_gene=ov.externel.flowsig.ul.get_top_gem_genes(adata=adata,
221
+ gems=[goal_gem],
222
+ n_genes=100,
223
+ gene_type='all',
224
+ method = 'nmf',
225
+ )
226
+ gem_gene.head()
227
+
228
+
229
+ # ### Construct the flow expression matrices
230
+ #
231
+ # We construct augmented flow expression matrices for each condition that measure three types of variables:
232
+ #
233
+ # 1. Intercellular signal inflow, i.e., how much of a signal did a cell receive. For ST data, signal inflow is constructed by summing the received signals for each significant ligand inferred by COMMOT.
234
+ # 2. GEMs, which encapsulate intracellular information processing. We define these as cellwise membership to the GEM.
235
+ # Intercellular signal outflow, i.e., how much of a signal did a cell send. These are simply ligand gene expression.
236
+ # 3. The kay assumption of flowsig is that all intercellular information flows are directed from signal inflows to GEMs, from one GEM to another GEM, and from GEMs to signal outflows.
237
+ #
238
+ # For spatial data, we use COMMOT output directly to construct signal inflow expression and do not need knowledge about TF databases.
239
+
240
+ # In[5]:
241
+
242
+
243
+ commot_output_key = 'commot-cellchat'
244
+ # We first construct the potential cellular flows from the commot output
245
+ ov.externel.flowsig.pp.construct_flows_from_commot(adata,
246
+ commot_output_key,
247
+ gem_expr_key = 'X_gem',
248
+ scale_gem_expr = True,
249
+ flowsig_network_key = 'flowsig_network',
250
+ flowsig_expr_key = 'X_flow')
251
+
252
+
253
+ # For spatial data, we retain spatially informative variables, which we determine by calculating the Moran's I value for signal inflow and signal outflow variables. In case the spatial graph has not been calculated for this data yet, FlowSig will do so, meaning that we need to specify both the coordinate type, grid or generic, and in the case of the former, n_neighs, which in this case, is 8.
254
+ #
255
+ # Flow expression variables are defined to be spatially informative if their Moran's I value is above a specified threshold.
256
+
257
+ # In[6]:
258
+
259
+
260
+ # Then we subset for "spatially flowing" inflows and outflows
261
+ ov.externel.flowsig.pp.determine_informative_variables(adata,
262
+ flowsig_expr_key = 'X_flow',
263
+ flowsig_network_key = 'flowsig_network',
264
+ spatial = True,
265
+ moran_threshold = 0.15,
266
+ coord_type = 'grid',
267
+ n_neighbours = 8,
268
+ library_key = None)
269
+
270
+
271
+ # ### Learn intercellular flows
272
+ #
273
+ # For spatial data, where there are far fewer "control vs. perturbed" studies, we use the GSP method, which uses conditional independence testing and a greedy algorithm to learn the CPDAG containing directed arcs and undirected edges.
274
+ #
275
+ # For spatial data, we cannot bootstrap by resampling across individual cells because we would lose the additional layer of correlation contained in the spatial data. Rather, we divide the tissue up into spatial "blocks" and resample within blocks. This is known as block bootstrapping.
276
+ #
277
+ # To calculate the blocks, we used scikit-learn's k-means clustering method to generate 20 roughly equally sized spatial blocks.
278
+
279
+ # In[9]:
280
+
281
+
282
+ from sklearn.cluster import KMeans
283
+ import pandas as pd
284
+
285
+ kmeans = KMeans(n_clusters=10, random_state=0).fit(adata.obsm['spatial'])
286
+ adata.obs['spatial_kmeans'] = pd.Series(kmeans.labels_, dtype='category').values
287
+
288
+
289
+ # We use these blocks to learn the spatial intercellular flows.
290
+
291
+ # In[ ]:
292
+
293
+
294
+ # # Now we are ready to learn the network
295
+ ov.externel.flowsig.tl.learn_intercellular_flows(adata,
296
+ flowsig_key = 'flowsig_network',
297
+ flow_expr_key = 'X_flow',
298
+ use_spatial = True,
299
+ block_key = 'spatial_kmeans',
300
+ n_jobs = 4,
301
+ n_bootstraps = 500)
302
+
303
+
304
+ # ### Partially validate intercellular flow network
305
+ #
306
+ # Finally, we will remove any "false positive" edges. Noting that the CPDAG contains directed arcs and undirected arcs we do two things.
307
+ #
308
+ # First, we remove directed arcs that are not oriented from signal inflow to GEM, GEM to GEM, or from GEM to signal outflow and for undirected edges, we reorient them so that they obey the previous directionalities.
309
+
310
+ # In[8]:
311
+
312
+
313
+ # This part is key for reducing false positives
314
+ ov.externel.flowsig.tl.apply_biological_flow(adata,
315
+ flowsig_network_key = 'flowsig_network',
316
+ adjacency_key = 'adjacency',
317
+ validated_key = 'validated')
318
+
319
+
320
+ # Second, we will remove directed arcs whose bootstrapped frequencies are below a specified edge threshold as well as undirected edges whose total bootstrapped frequencies are below the same threshold. Because we did not have perturbation data, we specify a more stringent edge threshold.
321
+ #
322
+
323
+ # In[26]:
324
+
325
+
326
+ edge_threshold = 0.7
327
+
328
+ ov.externel.flowsig.tl.filter_low_confidence_edges(adata,
329
+ edge_threshold = edge_threshold,
330
+ flowsig_network_key = 'flowsig_network',
331
+ adjacency_key = 'adjacency_validated',
332
+ filtered_key = 'filtered')
333
+
334
+
335
+ # In[27]:
336
+
337
+
338
+ adata.write('data/cortex_commot_flowsig.h5ad',compression='gzip')
339
+
340
+
341
+ # In[2]:
342
+
343
+
344
+ #adata=ov.read('data/cortex_commot_flowsig.h5ad')
345
+
346
+
347
+ # ## Visualize the result of flowsig
348
+ #
349
+ # We can construct the directed NetworkX DiGraph object from adjacency_validated_filtered.
350
+
351
+ # In[3]:
352
+
353
+
354
+ flow_network = ov.externel.flowsig.tl.construct_intercellular_flow_network(adata,
355
+ flowsig_network_key = 'flowsig_network',
356
+ adjacency_key = 'adjacency_validated_filtered')
357
+
358
+
359
+ # ### Cell-specific GEM
360
+ #
361
+ # The first thing we need to be concerned about is which GEM, exactly, is relevant to the cell I want to study. Here, we use dotplot to visualize the expression of GEM in different cell types.
362
+
363
+ # In[4]:
364
+
365
+
366
+ flowsig_expr_key='X_gem'
367
+ X_flow = adata.obsm[flowsig_expr_key]
368
+ adata_subset = sc.AnnData(X=X_flow)
369
+ adata_subset.obs = adata.obs
370
+ adata_subset.var.index =[f'GEM-{i}' for i in range(1,len(adata_subset.var)+1)]
371
+
372
+
373
+ # In[5]:
374
+
375
+
376
+ import matplotlib.pyplot as plt
377
+ ax=sc.pl.dotplot(adata_subset, adata_subset.var.index, groupby='Ground_Truth',
378
+ dendrogram=True,standard_scale='var',cmap='Reds',show=False)
379
+ color_dict=dict(zip(adata.obs['Ground_Truth'].cat.categories,adata.uns['Ground_Truth_colors']))
380
+
381
+
382
+ # ### Visualize the flowsig network
383
+ #
384
+ # We fixed the network function provided by the author and provided a better visualization.
385
+
386
+ # In[7]:
387
+
388
+
389
+ ov.pl.plot_flowsig_network(flow_network=flow_network,
390
+ gem_plot=['GEM-2','GEM-7','GEM-1','GEM-3','GEM-4','GEM-5'],
391
+ figsize=(8,4),
392
+ curve_awarg={'eps':2},
393
+ node_shape={'GEM':'^','Sender':'o','Receptor':'o'},
394
+ ylim=(-0.5,0.5),xlim=(-3,3))
395
+
ovrawm/t_cytotrace.txt ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Prediction of absolute developmental potential using CytoTrace2
5
+ #
6
+ # CytoTRACE 2 is a computational method for predicting cellular potency categories and absolute developmental potential from single-cell RNA-sequencing data.
7
+ #
8
+ # Potency categories in the context of CytoTRACE 2 classify cells based on their developmental potential, ranging from totipotent and pluripotent cells with broad differentiation potential to lineage-restricted oligopotent, multipotent and unipotent cells capable of producing varying numbers of downstream cell types, and finally, differentiated cells, ranging from mature to terminally differentiated phenotypes.
9
+ #
10
+ # We made three improvements in integrating the CytoTrace2 algorithm in OmicVerse:
11
+ #
12
+ # - No additional packages to install, including R
13
+ # - We fixed a bug in multi-threaded pools to avoid potential error reporting
14
+ # - Native support for `anndata`, you don't need to export `input_file` and `annotation_file`.
15
+ #
16
+ # If you found this tutorial helpful, please cite CytoTrace2 and OmicVerse:
17
+ #
18
+ # Kang, M., Armenteros, J. J. A., Gulati, G. S., Gleyzer, R., Avagyan, S., Brown, E. L., Zhang, W., Usmani, A., Earland, N., Wu, Z., Zou, J., Fields, R. C., Chen, D. Y., Chaudhuri, A. A., & Newman, A. M. (2024). Mapping single-cell developmental potential in health and disease with interpretable deep learning. bioRxiv : the preprint server for biology, 2024.03.19.585637. https://doi.org/10.1101/2024.03.19.585637
19
+
20
+ # In[1]:
21
+
22
+
23
+ import omicverse as ov
24
+ ov.plot_set()
25
+
26
+
27
+ # ## Preprocess data
28
+ #
29
+ # As an example, we apply differential kinetic analysis to dentate gyrus neurogenesis, which comprises multiple heterogeneous subpopulations.
30
+
31
+ # In[2]:
32
+
33
+
34
+ import scvelo as scv
35
+ adata=scv.datasets.dentategyrus()
36
+ adata
37
+
38
+
39
+ # In[4]:
40
+
41
+
42
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)\nadata\n")
43
+
44
+
45
+ # ## Predict cytotrace2
46
+ #
47
+ # We need to import the two pre-trained models from CytoTrace2, see the download links for the models:
48
+ #
49
+ # - Figshare:
50
+ # https://figshare.com/ndownloader/files/47258749
51
+ #
52
+ # - or Github:
53
+ # https://github.com/digitalcytometry/cytotrace2/tree/main/cytotrace2_python/cytotrace2_py/resources/17_models_weights
54
+ # https://github.com/digitalcytometry/cytotrace2/tree/main/cytotrace2_python/cytotrace2_py/resources/5_models_weights
55
+ #
56
+ # All parameters are explained as follows:
57
+ # - adata: AnnData object containing the scRNA-seq data.
58
+ # - use_model_dir: Path to the directory containing the pre-trained model files.
59
+ # - species: The species of the input data. Default is "mouse".
60
+ # - batch_size: The number of cells to process in each batch. Default is 10000.
61
+ # - smooth_batch_size: The number of cells to process in each batch for smoothing. Default is 1000.
62
+ # - disable_parallelization: If True, disable parallel processing. Default is False.
63
+ # - max_cores: Maximum number of CPU cores to use for parallel processing. If None, all available cores will be used. Default is None.
64
+ # - max_pcs: Maximum number of principal components to use. Default is 200.
65
+ # - seed: Random seed for reproducibility. Default is 14.
66
+ # - output_dir: Directory to save the results. Default is 'cytotrace2_results'.
67
+
68
+ # In[5]:
69
+
70
+
71
+ results = ov.single.cytotrace2(adata,
72
+ use_model_dir="cymodels/5_models_weights",
73
+ species="mouse",
74
+ batch_size = 10000,
75
+ smooth_batch_size = 1000,
76
+ disable_parallelization = False,
77
+ max_cores = None,
78
+ max_pcs = 200,
79
+ seed = 14,
80
+ output_dir = 'cytotrace2_results'
81
+ )
82
+
83
+
84
+ # ## Visualizing
85
+ #
86
+ # Visualizing the results we can directly compare the predicted potency scores with the known developmental stage of the cells, seeing how the predictions meticulously align with the known biology. Take a look!
87
+
88
+ # In[8]:
89
+
90
+
91
+ ov.utils.embedding(adata,basis='X_umap',
92
+ color=['clusters','CytoTRACE2_Score'],
93
+ frameon='small',cmap='Reds',wspace=0.55)
94
+
95
+
96
+ # - Left: demonstrates the distribution of different cell types in UMAP space.
97
+ # - Right: demonstrates the CytoTRACE 2 scores of different cell types; cells with high scores are generally considered to have a higher pluripotency or undifferentiated state.
98
+
99
+ # In[9]:
100
+
101
+
102
+ ov.utils.embedding(adata,basis='X_umap',
103
+ color=['CytoTRACE2_Potency','CytoTRACE2_Relative'],
104
+ frameon='small',cmap='Reds',wspace=0.55)
105
+
106
+
107
+ # - Potency category:
108
+ # The UMAP embedding plot of predicted potency category reflects the discrete classification of cells into potency categories, taking possible values of Differentiated, Unipotent, Oligopotent, Multipotent, Pluripotent, and Totipotent.
109
+ # - Relative order:
110
+ # UMAP embedding of predicted relative order, which is based on absolute predicted potency scores normalized to the range 0 (more differentiated) to 1 (less differentiated). Provides the relative ordering of cells by developmental potential
ovrawm/t_deg.txt ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Different Expression Analysis
5
+ #
6
+ # An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the `gene_id` to `gene_name` of matrix first. When our dataset existed the batch effect, we can use the SizeFactors of DEseq2 to normalize it, and use `t-test` of `wilcoxon` to calculate the p-value of genes. Here we demonstrate this pipeline with a matrix from `featureCounts`. The same pipeline would generally be used to analyze any collection of RNA-seq tasks.
7
+ #
8
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1q5lDfJepbtvNtc1TKz-h4wGUifTZ3i0_?usp=sharing
9
+
10
+ # In[1]:
11
+
12
+
13
+ import omicverse as ov
14
+ import scanpy as sc
15
+ import matplotlib.pyplot as plt
16
+
17
+ ov.plot_set()
18
+
19
+
20
+ # ## Geneset Download
21
+ #
22
+ # When we need to convert a gene id, we need to prepare a mapping pair file. Here we have pre-processed 6 genome gtf files and generated mapping pairs including `T2T-CHM13`, `GRCh38`, `GRCh37`, `GRCm39`, `danRer7`, and `danRer11`. If you need to convert other id_mapping, you can generate your own mapping using gtf Place the files in the `genesets` directory.
23
+
24
+ # In[2]:
25
+
26
+
27
+ ov.utils.download_geneid_annotation_pair()
28
+
29
+
30
+ # Note that this dataset has not been processed in any way and is only exported by `featureCounts`, and Sequence alignment was performed from the genome file of CRCm39
31
+ #
32
+ # sample data can be download from: https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt
33
+
34
+ # In[3]:
35
+
36
+
37
+ #data=pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
38
+ data=ov.read('data/counts.txt',index_col=0,header=1)
39
+ #replace the columns `.bam` to ``
40
+ data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
41
+ data.head()
42
+
43
+
44
+ # ## ID mapping
45
+ #
46
+ # We performed the gene_id mapping by the mapping pair file `GRCm39` downloaded before.
47
+
48
+ # In[4]:
49
+
50
+
51
+ data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
52
+ data.head()
53
+
54
+
55
+ # ## Different expression analysis with ov
56
+ #
57
+ # We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to:
58
+ #
59
+ # - Read the raw count by featureCount or any other qualify methods.
60
+ # - Create an ov DEseq object.
61
+
62
+ # In[5]:
63
+
64
+
65
+ dds=ov.bulk.pyDEG(data)
66
+
67
+
68
+ # We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes
69
+
70
+ # In[6]:
71
+
72
+
73
+ dds.drop_duplicates_index()
74
+ print('... drop_duplicates_index success')
75
+
76
+
77
+ # We also need to remove the batch effect of the expression matrix, `estimateSizeFactors` of DEseq2 to be used to normalize our matrix
78
+
79
+ # In[7]:
80
+
81
+
82
+ dds.normalize()
83
+ print('... estimateSizeFactors and normalize success')
84
+
85
+
86
+ # Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups
87
+
88
+ # In[8]:
89
+
90
+
91
+ treatment_groups=['4-3','4-4']
92
+ control_groups=['1--1','1--2']
93
+ result=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
94
+ result.head()
95
+
96
+
97
+ # One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing.
98
+
99
+ # In[9]:
100
+
101
+
102
+ print(result.shape)
103
+ result=result.loc[result['log2(BaseMean)']>1]
104
+ print(result.shape)
105
+
106
+
107
+ # We also need to set the threshold of Foldchange, we prepare a method named `foldchange_set` to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually.
108
+
109
+ # In[10]:
110
+
111
+
112
+ # -1 means automatically calculates
113
+ dds.foldchange_set(fc_threshold=-1,
114
+ pval_threshold=0.05,
115
+ logp_max=6)
116
+
117
+
118
+ # ## Visualize the DEG result and specific genes
119
+ #
120
+ # To visualize the DEG result, we use `plot_volcano` to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input:
121
+ #
122
+ # - title: The title of volcano
123
+ # - figsize: The size of figure
124
+ # - plot_genes: The genes you interested
125
+ # - plot_genes_num: If you don't have interested genes, you can auto plot it.
126
+
127
+ # In[11]:
128
+
129
+
130
+ dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
131
+ plot_genes_num=8,plot_genes_fontsize=12,)
132
+
133
+
134
+ # To visualize the specific genes, we only need to use the `dds.plot_boxplot` function to finish it.
135
+
136
+ # In[12]:
137
+
138
+
139
+ dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups,
140
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
141
+ legend_bbox=(2,0.55))
142
+
143
+
144
+ # In[13]:
145
+
146
+
147
+ dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups,
148
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
149
+ legend_bbox=(2,0.55))
150
+
151
+
152
+ # ## Pathway enrichment analysis by ov
153
+ #
154
+ # Here we use the `gseapy` package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions
155
+ #
156
+ # Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use `ov.utils.download_pathway_database()` to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries
157
+
158
+ # In[37]:
159
+
160
+
161
+ ov.utils.download_pathway_database()
162
+
163
+
164
+ # In[14]:
165
+
166
+
167
+ pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse')
168
+
169
+
170
+ # Note that the `pvalue_type` we set to `auto`, this is because when the genesets we enrichment if too small, use the `adjusted pvalue` we can't get the correct result. So you can set `adjust` or `raw` to get the significant geneset.
171
+ #
172
+ # If you didn't have internet, please set `background` to all genes expressed in rna-seq,like:
173
+ #
174
+ # ```python
175
+ # enr=ov.bulk.geneset_enrichment(gene_list=deg_genes,
176
+ # pathways_dict=pathway_dict,
177
+ # pvalue_type='auto',
178
+ # background=dds.result.index.tolist(),
179
+ # organism='mouse')
180
+ # ```
181
+
182
+ # In[15]:
183
+
184
+
185
+ deg_genes=dds.result.loc[dds.result['sig']!='normal'].index.tolist()
186
+ enr=ov.bulk.geneset_enrichment(gene_list=deg_genes,
187
+ pathways_dict=pathway_dict,
188
+ pvalue_type='auto',
189
+ organism='mouse')
190
+
191
+
192
+ # To visualize the enrichment, we use `geneset_plot` to finish it
193
+
194
+ # In[21]:
195
+
196
+
197
+ ov.bulk.geneset_plot(enr,figsize=(2,5),fig_title='Wiki Pathway enrichment',
198
+ cax_loc=[2, 0.45, 0.5, 0.02],
199
+ bbox_to_anchor_used=(-0.25, -13),node_diameter=10,
200
+ custom_ticks=[5,7],text_knock=3,
201
+ cmap='Reds')
202
+
203
+
204
+ # ## Multi pathway enrichment
205
+ #
206
+ # In addition to pathway enrichment for a single database, OmicVerse supports enriching and visualizing multiple pathways at the same time, which is implemented using [`pyComplexHeatmap`](https://dingwb.github.io/PyComplexHeatmap/build/html/notebooks/gene_enrichment_analysis.html), and Citetation is welcome!
207
+
208
+ # In[22]:
209
+
210
+
211
+ pathway_dict=ov.utils.geneset_prepare('genesets/GO_Biological_Process_2023.txt',organism='Mouse')
212
+ enr_go_bp=ov.bulk.geneset_enrichment(gene_list=deg_genes,
213
+ pathways_dict=pathway_dict,
214
+ pvalue_type='auto',
215
+ organism='mouse')
216
+ pathway_dict=ov.utils.geneset_prepare('genesets/GO_Molecular_Function_2023.txt',organism='Mouse')
217
+ enr_go_mf=ov.bulk.geneset_enrichment(gene_list=deg_genes,
218
+ pathways_dict=pathway_dict,
219
+ pvalue_type='auto',
220
+ organism='mouse')
221
+ pathway_dict=ov.utils.geneset_prepare('genesets/GO_Cellular_Component_2023.txt',organism='Mouse')
222
+ enr_go_cc=ov.bulk.geneset_enrichment(gene_list=deg_genes,
223
+ pathways_dict=pathway_dict,
224
+ pvalue_type='auto',
225
+ organism='mouse')
226
+
227
+
228
+ # In[167]:
229
+
230
+
231
+ enr_dict={'BP':enr_go_bp,
232
+ 'MF':enr_go_mf,
233
+ 'CC':enr_go_cc}
234
+ colors_dict={
235
+ 'BP':ov.pl.red_color[1],
236
+ 'MF':ov.pl.green_color[1],
237
+ 'CC':ov.pl.blue_color[1],
238
+ }
239
+
240
+ ov.bulk.geneset_plot_multi(enr_dict,colors_dict,num=3,
241
+ figsize=(2,5),
242
+ text_knock=3,fontsize=8,
243
+ cmap='Reds'
244
+ )
245
+
246
+
247
+ # In[166]:
248
+
249
+
250
+ def geneset_plot_multi(enr_dict,colors_dict,num:int=5,fontsize=10,
251
+ fig_title:str='',fig_xlabel:str='Fractions of genes',
252
+ figsize:tuple=(2,4),cmap:str='YlGnBu',
253
+ text_knock:int=5,text_maxsize:int=20,ax=None,
254
+ ):
255
+ from PyComplexHeatmap import HeatmapAnnotation,DotClustermapPlotter,anno_label,anno_simple,AnnotationBase
256
+ for key in enr_dict.keys():
257
+ enr_dict[key]['Type']=key
258
+ enr_all=pd.concat([enr_dict[i].iloc[:num] for i in enr_dict.keys()],axis=0)
259
+ enr_all['Term']=[ov.utils.plot_text_set(i.split('(')[0],text_knock=text_knock,text_maxsize=text_maxsize) for i in enr_all.Term.tolist()]
260
+ enr_all.index=enr_all.Term
261
+ enr_all['Term1']=[i for i in enr_all.index.tolist()]
262
+ del enr_all['Term']
263
+
264
+ colors=colors_dict
265
+
266
+ left_ha = HeatmapAnnotation(
267
+ label=anno_label(enr_all.Type, merge=True,rotation=0,colors=colors,relpos=(1,0.8)),
268
+ Category=anno_simple(enr_all.Type,cmap='Set1',
269
+ add_text=False,legend=False,colors=colors),
270
+ axis=0,verbose=0,label_kws={'rotation':45,'horizontalalignment':'left','visible':False})
271
+ right_ha = HeatmapAnnotation(
272
+ label=anno_label(enr_all.Term1, merge=True,rotation=0,relpos=(0,0.5),arrowprops=dict(visible=True),
273
+ colors=enr_all.assign(color=enr_all.Type.map(colors)).set_index('Term1').color.to_dict(),
274
+ fontsize=fontsize,luminance=0.8,height=2),
275
+ axis=0,verbose=0,#label_kws={'rotation':45,'horizontalalignment':'left'},
276
+ orientation='right')
277
+ if ax==None:
278
+ fig, ax = plt.subplots(figsize=figsize)
279
+ else:
280
+ ax=ax
281
+ #plt.figure(figsize=figsize)
282
+ cm = DotClustermapPlotter(data=enr_all, x='fraction',y='Term1',value='logp',c='logp',s='num',
283
+ cmap=cmap,
284
+ row_cluster=True,#col_cluster=True,#hue='Group',
285
+ #cmap={'Group1':'Greens','Group2':'OrRd'},
286
+ vmin=-1*np.log10(0.1),vmax=-1*np.log10(1e-10),
287
+ #colors={'Group1':'yellowgreen','Group2':'orange'},
288
+ #marker={'Group1':'*','Group2':'$\\ast$'},
289
+ show_rownames=True,show_colnames=False,row_dendrogram=False,
290
+ col_names_side='top',row_names_side='right',
291
+ xticklabels_kws={'labelrotation': 30, 'labelcolor': 'blue','labelsize':fontsize},
292
+ #yticklabels_kws={'labelsize':10},
293
+ #top_annotation=col_ha,left_annotation=left_ha,right_annotation=right_ha,
294
+ left_annotation=left_ha,right_annotation=right_ha,
295
+ spines=False,
296
+ row_split=enr_all.Type,# row_split_gap=1,
297
+ #col_split=df_col.Group,col_split_gap=0.5,
298
+ verbose=1,legend_gap=10,
299
+ #dot_legend_marker='*',
300
+
301
+ xlabel='Fractions of genes',xlabel_side="bottom",
302
+ xlabel_kws=dict(labelpad=8,fontweight='normal',fontsize=fontsize+2),
303
+ # xlabel_bbox_kws=dict(facecolor=facecolor)
304
+ )
305
+ tesr=plt.gcf().axes
306
+ for ax in plt.gcf().axes:
307
+ if hasattr(ax, 'get_xlabel'):
308
+ if ax.get_xlabel() == 'Fractions of genes': # 假设 colorbar 有一个特定的标签
309
+ cbar = ax
310
+ cbar.grid(False)
311
+ if ax.get_ylabel() == 'logp': # 假设 colorbar 有一个特定的标签
312
+ cbar = ax
313
+ cbar.tick_params(labelsize=fontsize+2)
314
+ cbar.set_ylabel(r'$−Log_{10}(P_{adjusted})$',fontsize=fontsize+2)
315
+ cbar.grid(False)
316
+ return ax
317
+
318
+
319
+ # In[ ]:
320
+
321
+
322
+
323
+
ovrawm/t_deseq2.txt ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Different Expression Analysis with DEseq2
5
+ #
6
+ # An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the `gene_id` to `gene_name` of matrix first.
7
+ #
8
+ # Now we can use `PyDEseq2` to perform DESeq2 analysis like R
9
+ #
10
+ # Paper: [PyDESeq2: a python package for bulk RNA-seq differential expression analysis](https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1)
11
+ #
12
+ # Code: https://github.com/owkin/PyDESeq2
13
+ #
14
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1fZS-v0zdIYkXrEoIAM1X5kPoZVfVvY5h?usp=sharing
15
+
16
+ # In[1]:
17
+
18
+
19
+ import omicverse as ov
20
+ ov.utils.ov_plot_set()
21
+
22
+
23
+ # Note that this dataset has not been processed in any way and is only exported by `featureCounts`, and Sequence alignment was performed from the genome file of CRCm39
24
+
25
+ # In[2]:
26
+
27
+
28
+ data=ov.utils.read('https://raw.githubusercontent.com/Starlitnightly/Pyomic/master/sample/counts.txt',index_col=0,header=1)
29
+ #replace the columns `.bam` to ``
30
+ data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
31
+ data.head()
32
+
33
+
34
+ # ## ID mapping
35
+ #
36
+ # We performed the gene_id mapping by the mapping pair file `GRCm39` downloaded before.
37
+
38
+ # In[ ]:
39
+
40
+
41
+ ov.utils.download_geneid_annotation_pair()
42
+
43
+
44
+ # In[3]:
45
+
46
+
47
+ data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
48
+ data.head()
49
+
50
+
51
+ # ## Different expression analysis with ov
52
+ #
53
+ # We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to:
54
+ #
55
+ # - Read the raw count by featureCount or any other qualify methods.
56
+ # - Create an ov DEseq object.
57
+
58
+ # In[4]:
59
+
60
+
61
+ dds=ov.bulk.pyDEG(data)
62
+
63
+
64
+ # We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes
65
+
66
+ # In[5]:
67
+
68
+
69
+ dds.drop_duplicates_index()
70
+ print('... drop_duplicates_index success')
71
+
72
+
73
+ # Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups
74
+
75
+ # In[6]:
76
+
77
+
78
+ treatment_groups=['4-3','4-4']
79
+ control_groups=['1--1','1--2']
80
+ result=dds.deg_analysis(treatment_groups,control_groups,method='DEseq2')
81
+
82
+
83
+ # One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing.
84
+
85
+ # In[7]:
86
+
87
+
88
+ print(result.shape)
89
+ result=result.loc[result['log2(BaseMean)']>1]
90
+ print(result.shape)
91
+
92
+
93
+ # We also need to set the threshold of Foldchange, we prepare a method named `foldchange_set` to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually.
94
+
95
+ # In[8]:
96
+
97
+
98
+ # -1 means automatically calculates
99
+ dds.foldchange_set(fc_threshold=-1,
100
+ pval_threshold=0.05,
101
+ logp_max=10)
102
+
103
+
104
+ # ## Visualize the DEG result and specific genes
105
+ #
106
+ # To visualize the DEG result, we use `plot_volcano` to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input:
107
+ #
108
+ # - title: The title of volcano
109
+ # - figsize: The size of figure
110
+ # - plot_genes: The genes you interested
111
+ # - plot_genes_num: If you don't have interested genes, you can auto plot it.
112
+
113
+ # In[9]:
114
+
115
+
116
+ dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
117
+ plot_genes_num=8,plot_genes_fontsize=12,)
118
+
119
+
120
+ # To visualize the specific genes, we only need to use the `dds.plot_boxplot` function to finish it.
121
+
122
+ # In[10]:
123
+
124
+
125
+ dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups,
126
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
127
+ legend_bbox=(2,0.55))
128
+
129
+
130
+ # In[11]:
131
+
132
+
133
+ dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups,
134
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
135
+ legend_bbox=(2,0.55))
136
+
137
+
138
+ # ## Pathway enrichment analysis by Pyomic
139
+ #
140
+ # Here we use the `gseapy` package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions
141
+ #
142
+ # Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use `Pyomic.utils.download_pathway_database()` to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries
143
+
144
+ # In[13]:
145
+
146
+
147
+ ov.utils.download_pathway_database()
148
+
149
+
150
+ # In[14]:
151
+
152
+
153
+ pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse')
154
+
155
+
156
+ # To perform the GSEA analysis, we need to ranking the genes at first. Using `dds.ranking2gsea` can obtain a ranking gene's matrix sorted by -log10(padj).
157
+ #
158
+ # $Metric=\frac{-log_{10}(padj)}{sign(log2FC)}$
159
+
160
+ # In[15]:
161
+
162
+
163
+ rnk=dds.ranking2gsea()
164
+
165
+
166
+ # We used `ov.bulk.pyGSEA` to construst a GSEA object to perform enrichment.
167
+
168
+ # In[16]:
169
+
170
+
171
+ gsea_obj=ov.bulk.pyGSEA(rnk,pathway_dict)
172
+
173
+
174
+ # In[17]:
175
+
176
+
177
+ enrich_res=gsea_obj.enrichment()
178
+
179
+
180
+ # The results are stored in the `enrich_res` attribute.
181
+
182
+ # In[18]:
183
+
184
+
185
+ gsea_obj.enrich_res.head()
186
+
187
+
188
+ # To visualize the enrichment, we use `plot_enrichment` to do.
189
+ # - num: The number of enriched terms to plot. Default is 10.
190
+ # - node_size: A list of integers defining the size of nodes in the plot. Default is [5,10,15].
191
+ # - cax_loc: The location of the colorbar on the plot. Default is 2.
192
+ # - cax_fontsize: The fontsize of the colorbar label. Default is 12.
193
+ # - fig_title: The title of the plot. Default is an empty string.
194
+ # - fig_xlabel: The label of the x-axis. Default is 'Fractions of genes'.
195
+ # - figsize: The size of the plot. Default is (2,4).
196
+ # - cmap: The colormap to use for the plot. Default is 'YlGnBu'.
197
+
198
+ # In[19]:
199
+
200
+
201
+ gsea_obj.plot_enrichment(num=10,node_size=[10,20,30],
202
+ cax_fontsize=12,
203
+ fig_title='Wiki Pathway Enrichment',fig_xlabel='Fractions of genes',
204
+ figsize=(2,4),cmap='YlGnBu',
205
+ text_knock=2,text_maxsize=30,
206
+ cax_loc=[2.5, 0.45, 0.5, 0.02],
207
+ bbox_to_anchor_used=(-0.25, -13),node_diameter=10,)
208
+
209
+
210
+ # Not only the basic analysis, pyGSEA also help us to visualize the term with Ranked and Enrichment Score.
211
+ #
212
+ # We can select the number of term to plot, which stored in `gsea_obj.enrich_res.index`, the `0` is `Complement and Coagulation Cascades WP449` and the `1` is `Matrix Metalloproteinases WP441`
213
+
214
+ # In[20]:
215
+
216
+
217
+ gsea_obj.enrich_res.index[:5]
218
+
219
+
220
+ # We can set the `gene_set_title` to change the title of GSEA plot
221
+
222
+ # In[22]:
223
+
224
+
225
+ fig=gsea_obj.plot_gsea(term_num=1,
226
+ gene_set_title='Matrix Metalloproteinases',
227
+ figsize=(3,4),
228
+ cmap='RdBu_r',
229
+ title_fontsize=14,
230
+ title_y=0.95)
231
+
232
+
233
+ # In[ ]:
234
+
235
+
236
+
237
+
ovrawm/t_gptanno.txt ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Automatic cell type annotation with GPT/Other
5
+ #
6
+ # GPTCelltype, an open-source R software package to facilitate cell type annotation by GPT-4.
7
+ #
8
+ # We made three improvements in integrating the `GPTCelltype` algorithm in OmicVerse:
9
+ #
10
+ # - **Native support for Python**: Since GPTCelltype is an R language package, in order to make it conform to scverse's anndata ecosystem, we have rewritten the whole function so that it works perfectly under Python.
11
+ # - **More model support**: We provide more big models to choose from outside of Openai, e.g. Qwen(通义千问), Kimi, and more model support is available through the parameter `base_url` or `model_name`.
12
+ #
13
+ # If you found this tutorial helpful, please cite GPTCelltype and OmicVerse:
14
+ #
15
+ # Hou, W. and Ji, Z., 2023. Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis. [Nature Methods, 2024 March 25](https://link.springer.com/article/10.1038/s41592-024-02235-4?utm_source=rct_congratemailt&utm_medium=email&utm_campaign=oa_20240325&utm_content=10.1038/s41592-024-02235-4).
16
+
17
+ # In[3]:
18
+
19
+
20
+ import omicverse as ov
21
+ print(f'omicverse version:{ov.__version__}')
22
+ import scanpy as sc
23
+ print(f'scanpy version:{sc.__version__}')
24
+ ov.ov_plot_set()
25
+
26
+
27
+ # ## Loading data
28
+ #
29
+ # The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
30
+ #
31
+
32
+ # In[21]:
33
+
34
+
35
+ # !mkdir data
36
+ # !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
37
+ # !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
38
+ # !mkdir write
39
+
40
+
41
+ # Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5-based file format: `.h5ad`.
42
+
43
+ # In[3]:
44
+
45
+
46
+ adata = sc.read_10x_mtx(
47
+ 'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
48
+ var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
49
+ cache=True) # write a cache file for faster subsequent reading
50
+
51
+
52
+ # ## Data preprocessing
53
+ #
54
+ # Here, we use `ov.single.scanpy_lazy` to preprocess the raw data of scRNA-seq, it included filter the doublets cells, normalizing counts per cell, log1p, extracting highly variable genes, and cluster of cells calculation.
55
+ #
56
+ # But if you want to experience step-by-step preprocessing, we also provide more detailed preprocessing steps here, please refer to our [preprocess chapter](https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_preprocess/) for a detailed explanation.
57
+ #
58
+ # We stored the raw counts in `count` layers, and the raw data in `adata.raw.to_adata()`.
59
+
60
+ # In[ ]:
61
+
62
+
63
+ #adata=ov.single.scanpy_lazy(adata)
64
+
65
+ #quantity control
66
+ adata=ov.pp.qc(adata,
67
+ tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
68
+ #normalize and high variable genes (HVGs) calculated
69
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
70
+
71
+ #save the whole genes and filter the non-HVGs
72
+ adata.raw = adata
73
+ adata = adata[:, adata.var.highly_variable_features]
74
+
75
+ #scale the adata.X
76
+ ov.pp.scale(adata)
77
+
78
+ #Dimensionality Reduction
79
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
80
+
81
+ #Neighbourhood graph construction
82
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
83
+ use_rep='scaled|original|X_pca')
84
+
85
+ #clusters
86
+ sc.tl.leiden(adata)
87
+
88
+ #find marker
89
+ sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
90
+ sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
91
+ method='wilcoxon',use_raw=False,)
92
+
93
+ #Dimensionality Reduction for visualization(X_mde=X_umap+GPU)
94
+ adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
95
+ adata
96
+
97
+
98
+ # In[5]:
99
+
100
+
101
+ ov.pl.embedding(adata,
102
+ basis='X_mde',
103
+ color=['leiden'],
104
+ legend_loc='on data',
105
+ frameon='small',
106
+ legend_fontoutline=2,
107
+ palette=ov.utils.palette()[14:],
108
+ )
109
+
110
+
111
+ # ## GPT Celltype
112
+ #
113
+ # gptcelltype supports dictionary format input, we provide `omicverse.single.get_celltype_marker` to get the marker genes for each cell type as a dictionary.
114
+
115
+ # #### Using genes manually
116
+ #
117
+ # We can manually define a dictionary to determine the accuracy of the output
118
+
119
+ # In[25]:
120
+
121
+
122
+ import os
123
+ all_markers={'cluster1':['CD3D','CD3E'],
124
+ 'cluster2':['MS4A1']}
125
+
126
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
127
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
128
+ model='qwen-plus', provider='qwen',
129
+ topgenenumber=5)
130
+ result
131
+
132
+
133
+ # #### Get Genes for Each Cluster Automatically
134
+ #
135
+
136
+ # In[14]:
137
+
138
+
139
+ all_markers=ov.single.get_celltype_marker(adata,clustertype='leiden',rank=True,
140
+ key='rank_genes_groups',
141
+ foldchange=2,topgenenumber=5)
142
+ all_markers
143
+
144
+
145
+ # ### Option 1. Through OpenAI API (`provider` or `base_url`)
146
+ #
147
+ # Use `ov.single.gptcelltype` function to annotate cell types.
148
+ #
149
+ # You can simply set `provider` (or `base_url`) and `provider` parameters to provide the function with base url and the exact model which are required for model calling.
150
+
151
+ # In[17]:
152
+
153
+
154
+ import os
155
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
156
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
157
+ model='qwen-plus', provider='qwen',
158
+ topgenenumber=5)
159
+ result
160
+
161
+
162
+ # We can keep only the cell type of the output and remove other irrelevant information.
163
+
164
+ # In[18]:
165
+
166
+
167
+ new_result={}
168
+ for key in result.keys():
169
+ new_result[key]=result[key].split(': ')[-1].split(' (')[0].split('. ')[1]
170
+ new_result
171
+
172
+
173
+ # In[19]:
174
+
175
+
176
+ adata.obs['gpt_celltype'] = adata.obs['leiden'].map(new_result).astype('category')
177
+
178
+
179
+ # In[20]:
180
+
181
+
182
+ ov.pl.embedding(adata,
183
+ basis='X_mde',
184
+ color=['leiden','gpt_celltype'],
185
+ legend_loc='on data',
186
+ frameon='small',
187
+ legend_fontoutline=2,
188
+ palette=ov.utils.palette()[14:],
189
+ )
190
+
191
+
192
+ # ## More models
193
+ #
194
+ # Our implementation of `gptcelltype` in `omicverse` supports almost all large language models that support the `openai` api format.
195
+
196
+ # In[27]:
197
+
198
+
199
+ all_markers={'cluster1':['CD3D','CD3E'],
200
+ 'cluster2':['MS4A1']}
201
+
202
+
203
+ # ### Openai
204
+ #
205
+ # The OpenAI API uses API keys for authentication. You can create API keys at a user or service account level. Service accounts are tied to a "bot" individual and should be used to provision access for production systems. Each API key can be scoped to one of the following,
206
+ #
207
+ # - [User keys](https://platform.openai.com/account/api-keys) - Our legacy keys. Provides access to all organizations and all projects that user has been added to; access API Keys to view your available keys. We highly advise transitioning to project keys for best security practices, although access via this method is currently still supported.
208
+ #
209
+ # - Please select the model you need to use: [list of supported models](https://platform.openai.com/docs/models).
210
+ #
211
+
212
+ # In[28]:
213
+
214
+
215
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
216
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
217
+ model='gpt-4o', provider='openai',
218
+ topgenenumber=5)
219
+ result
220
+
221
+
222
+ # ### Qwen(通义千问)
223
+ #
224
+ # - Enabled DashScope service and obtained API-KEY: [Enabled DashScope and created API-KEY](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key).
225
+ #
226
+ # - We recommend you to configure API-KEY in environment variable to reduce the risk of API-KEY leakage, please refer to Configuring API-KEY through Environment Variable for the configuration method, you can also configure API-KEY in code, but the risk of leakage will be increased.
227
+ #
228
+ # - Please select the model you need to use: [list of supported models](https://help.aliyun.com/zh/dashscope/developer-reference/compatibility-of-openai-with-dashscope/?spm=a2c4g.11186623.0.i6#eadfc13038jd5).
229
+ #
230
+ # **简体中文**
231
+ #
232
+ # - 已开通灵积模型服务并获得API-KEY:[开通DashScope并创建API-KEY](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key)。
233
+ #
234
+ # - 我们推荐您将API-KEY配置到环境变量中以降低API-KEY的泄漏风险,配置方法可参考通过环境变量配置API-KEY。您也可以在代码中配置API-KEY,但是泄漏风险会提高。
235
+ #
236
+ # - 请选择您需要使用的模型:[支持的模型列表](https://help.aliyun.com/zh/dashscope/developer-reference/compatibility-of-openai-with-dashscope/?spm=a2c4g.11186623.0.i6#eadfc13038jd5)。
237
+ #
238
+
239
+ # In[26]:
240
+
241
+
242
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
243
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
244
+ model='qwen-plus', provider='qwen',
245
+ topgenenumber=5)
246
+ result
247
+
248
+
249
+ # ### Kimi(月之暗面)
250
+ #
251
+ # - You will need a Dark Side of the Moon API key to use our service. You can create an API key in [Console](https://platform.moonshot.cn/console).
252
+ #
253
+ # - Please select the model you need to use: [List of supported models](https://platform.moonshot.cn/docs/pricing#%E4%BA%A7%E5%93%81%E5%AE%9A%E4%BB%B7)
254
+ #
255
+ # **简体中文**
256
+ #
257
+ # - 你需要一个 月之暗面的 API 密钥��使用我们的服务。你可以在[控制台](https://platform.moonshot.cn/console)中创建一个 API 密钥。
258
+ #
259
+ # - 请选择您需要使用的模型:[支持的模型列表](https://platform.moonshot.cn/docs/pricing#%E4%BA%A7%E5%93%81%E5%AE%9A%E4%BB%B7)
260
+
261
+ # In[28]:
262
+
263
+
264
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
265
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
266
+ model='moonshot-v1-8k', provider='kimi',
267
+ topgenenumber=5)
268
+ result
269
+
270
+
271
+ # #### Other Models
272
+ #
273
+ # You can manually set the `base_url` parameter to specify other models that need to be used, note that the model needs to support Openai's parameters. Three examples are provided here (when you specify the `base_url` parameter, the `provider` parameter will be invalid):
274
+ #
275
+ # ```python
276
+ # if provider == 'openai':
277
+ # base_url = "https://api.openai.com/v1/"
278
+ # elif provider == 'kimi':
279
+ # base_url = "https://api.moonshot.cn/v1"
280
+ # elif provider == 'qwen':
281
+ # base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"
282
+ # ```
283
+
284
+ # In[30]:
285
+
286
+
287
+ os.environ['AGI_API_KEY'] = 'sk-**' # Replace with your actual API key
288
+ result = ov.single.gptcelltype(all_markers, tissuename='PBMC', speciename='human',
289
+ model='moonshot-v1-8k', base_url="https://api.moonshot.cn/v1",
290
+ topgenenumber=5)
291
+ result
292
+
293
+
294
+ # ### Option 2. Through Local LLM (`model_name`)
295
+
296
+ # Use `ov.single.gptcelltype_local` function to annotate cell types.
297
+ #
298
+ # You can simply set the `model_name` parameter.
299
+
300
+ # In[5]:
301
+
302
+
303
+ anno_model = 'path/to/your/local/LLM' # '~/models/Qwen2-7B-Instruct'
304
+
305
+ result = ov.single.gptcelltype_local(all_markers, tissuename='PBMC', speciename='human',
306
+ model_name=anno_model, topgenenumber=5)
307
+ result
308
+
309
+
310
+ # Note that you may encounter network problems that prevent you from downloading LLMs.
311
+ #
312
+ # In this case, please refer to https://zhuanlan.zhihu.com/p/663712983
313
+
314
+ # In[ ]:
315
+
316
+
317
+
318
+
319
+
320
+ # In[ ]:
321
+
322
+
323
+
324
+
325
+
326
+ # In[ ]:
327
+
328
+
329
+
330
+
ovrawm/t_mapping.txt ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Mapping single-cell profile onto spatial profile
5
+ #
6
+ # Tangram is a method for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. Tangram takes as input a single-cell dataset and a spatial dataset, collected from the same anatomical region/tissue type. Via integration, Tangram creates new spatial data by aligning the scRNAseq profiles in space. This allows to project every annotation in the scRNAseq (e.g. cell types, program usage) on space.
7
+ #
8
+ # The most common application of Tangram is to resolve cell types in space. Another usage is to correct gene expression from spatial data: as scRNA-seq data are less prone to dropout than (e.g.) Visium or Slide-seq, the “new” spatial data generated by Tangram resolve many more genes. As a result, we can visualize program usage in space, which can be used for ligand-receptor pair discovery or, more generally, cell-cell communication mechanisms. If cell segmentation is available, Tangram can be also used for deconvolution of spatial data. If your single cell are multimodal, Tangram can be used to spatially resolve other modalities, such as chromatin accessibility.
9
+ #
10
+ # Biancalani, T., Scalia, G., Buffoni, L. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods 18, 1352–1362 (2021). https://doi.org/10.1038/s41592-021-01264-7
11
+ #
12
+ # ![img](https://tangram-sc.readthedocs.io/en/latest/_images/tangram_overview.png)
13
+
14
+ # In[1]:
15
+
16
+
17
+ import omicverse as ov
18
+ #print(f"omicverse version: {ov.__version__}")
19
+ import scanpy as sc
20
+ #print(f"scanpy version: {sc.__version__}")
21
+ ov.utils.ov_plot_set()
22
+
23
+
24
+ # ## Prepared scRNA-seq
25
+ #
26
+ # Published scRNA-seq datasets of lymph nodes have typically lacked an adequate representation of germinal centre-associated immune cell populations due to age of patient donors. We, therefore, include scRNA-seq datasets spanning lymph nodes, spleen and tonsils in our single-cell reference to ensure that we captured the full diversity of immune cell states likely to exist in the spatial transcriptomic dataset.
27
+ #
28
+ # Here we download this dataset, import into anndata and change variable names to ENSEMBL gene identifiers.
29
+ #
30
+ # Link: https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
31
+
32
+ # In[2]:
33
+
34
+
35
+ adata_sc=ov.read('data/sc.h5ad')
36
+ import matplotlib.pyplot as plt
37
+ fig, ax = plt.subplots(figsize=(3,3))
38
+ ov.utils.embedding(
39
+ adata_sc,
40
+ basis="X_umap",
41
+ color=['Subset'],
42
+ title='Subset',
43
+ frameon='small',
44
+ #ncols=1,
45
+ wspace=0.65,
46
+ #palette=ov.utils.pyomic_palette()[11:],
47
+ show=False,
48
+ ax=ax
49
+ )
50
+
51
+
52
+ # For data quality control and preprocessing, we can easily use omicverse's own preprocessing functions to do so
53
+
54
+ # In[3]:
55
+
56
+
57
+ print("RAW",adata_sc.X.max())
58
+ adata_sc=ov.pp.preprocess(adata_sc,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
59
+ adata_sc.raw = adata_sc
60
+ adata_sc = adata_sc[:, adata_sc.var.highly_variable_features]
61
+ print("Normalize",adata_sc.X.max())
62
+
63
+
64
+ # ## Prepared stRNA-seq
65
+ #
66
+ # First let’s read spatial Visium data from 10X Space Ranger output. Here we use lymph node data generated by 10X and presented in [Kleshchevnikov et al (section 4, Fig 4)](https://www.biorxiv.org/content/10.1101/2020.11.15.378125v1). This dataset can be conveniently downloaded and imported using scanpy. See [this tutorial](https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_short_demo.html) for a more extensive and practical example of data loading (multiple visium samples).
67
+
68
+ # In[5]:
69
+
70
+
71
+ adata = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
72
+ adata.obs['sample'] = list(adata.uns['spatial'].keys())[0]
73
+ adata.var_names_make_unique()
74
+
75
+
76
+ # We used the same pre-processing steps as for scRNA-seq
77
+ #
78
+ # <div class="admonition warning">
79
+ # <p class="admonition-title">Note</p>
80
+ # <p>
81
+ # We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
82
+ # </p>
83
+ # </div>
84
+ #
85
+ # ```python
86
+ # #adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
87
+ # #adata.raw = adata
88
+ # #adata = adata[:, adata.var.highly_variable_features]
89
+ # ```
90
+
91
+ # In[6]:
92
+
93
+
94
+ sc.pp.calculate_qc_metrics(adata, inplace=True)
95
+ adata = adata[:,adata.var['total_counts']>100]
96
+ adata=ov.space.svg(adata,mode='prost',n_svgs=3000,target_sum=1e4,platform="visium",)
97
+ adata.raw = adata
98
+ adata = adata[:, adata.var.space_variable_features]
99
+ adata_sp=adata.copy()
100
+ adata_sp
101
+
102
+
103
+ # ## Tangram model
104
+ #
105
+ # Tangram is a Python package, written in PyTorch and based on scanpy, for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. The single-cell dataset and the spatial dataset should be collected from the same anatomical region/tissue type, ideally from a biological replicate, and need to share a set of genes.
106
+ #
107
+ # We can use `omicverse.space.Tangram` to apply the Tangram model.
108
+
109
+ # In[7]:
110
+
111
+
112
+ tg=ov.space.Tangram(adata_sc,adata_sp,clusters='Subset')
113
+
114
+
115
+ # The function maps iteratively as specified by num_epochs. We typically interrupt mapping after the score plateaus.
116
+ # - The score measures the similarity between the gene expression of the mapped cells vs spatial data on the training genes.
117
+ # - The default mapping mode is mode=`cells`, which is recommended to run on a GPU.
118
+ # - Alternatively, one can specify mode=`clusters` which averages the single cells beloning to the same cluster (pass annotations via cluster_label). This is faster, and is our chioce when scRNAseq and spatial data come from different specimens.
119
+ # - If you wish to run Tangram with a GPU, set device=`cuda:0` otherwise use the set device=`cpu`.
120
+ # - density_prior specifies the cell density within each spatial voxel. Use uniform if the spatial voxels are at single cell resolution (ie MERFISH). The default value, rna_count_based, assumes that cell density is proportional to the number of RNA molecules
121
+
122
+ # In[8]:
123
+
124
+
125
+ tg.train(mode="clusters",num_epochs=500,device="cuda:0")
126
+
127
+
128
+ # We can use `tg.cell2location()` to get the cell location in spatial spots.
129
+
130
+ # In[9]:
131
+
132
+
133
+ adata_plot=tg.cell2location()
134
+ adata_plot.obs.columns
135
+
136
+
137
+ # In[10]:
138
+
139
+
140
+ annotation_list=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC',
141
+ 'B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo']
142
+
143
+ sc.pl.spatial(adata_plot, cmap='magma',
144
+ # show first 8 cell types
145
+ color=annotation_list,
146
+ ncols=4, size=1.3,
147
+ img_key='hires',
148
+ # limit color scale at 99.2% quantile of cell abundance
149
+ #vmin=0, vmax='p99.2'
150
+ )
151
+
152
+
153
+ # In[11]:
154
+
155
+
156
+ color_dict=dict(zip(adata_sc.obs['Subset'].cat.categories,
157
+ adata_sc.uns['Subset_colors']))
158
+
159
+
160
+ # In[21]:
161
+
162
+
163
+ import matplotlib as mpl
164
+ clust_labels = annotation_list[:5]
165
+ clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
166
+
167
+ with mpl.rc_context({'figure.figsize': (8, 8),'axes.grid': False}):
168
+ fig = ov.pl.plot_spatial(
169
+ adata=adata_plot,
170
+ # labels to show on a plot
171
+ color=clust_col, labels=clust_labels,
172
+ show_img=True,
173
+ # 'fast' (white background) or 'dark_background'
174
+ style='fast',
175
+ # limit color scale at 99.2% quantile of cell abundance
176
+ max_color_quantile=0.992,
177
+ # size of locations (adjust depending on figure size)
178
+ circle_diameter=3,
179
+ reorder_cmap = [#0,
180
+ 1,2,3,4,6], #['yellow', 'orange', 'blue', 'green', 'purple', 'grey', 'white'],
181
+ colorbar_position='right',
182
+ palette=color_dict
183
+ )
184
+
185
+
186
+
187
+ # In[ ]:
188
+
189
+
190
+
191
+
ovrawm/t_metacells.txt ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Inference of MetaCell from Single-Cell RNA-seq
5
+ #
6
+ # Metacells are cell groupings derived from single-cell sequencing data that represent highly granular, distinct cell states. Here, we present single-cell aggregation of cell-states (SEACells), an algorithm for identifying metacells; overcoming the sparsity of single-cell data, while retaining heterogeneity obscured by traditional cell clustering.
7
+ #
8
+ # Paper: [SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data](https://www.nature.com/articles/s41587-023-01716-9)
9
+ #
10
+ # Code: https://github.com/dpeerlab/SEACells
11
+ #
12
+
13
+ # In[1]:
14
+
15
+
16
+ import omicverse as ov
17
+ import scanpy as sc
18
+ import scvelo as scv
19
+
20
+ ov.plot_set()
21
+
22
+
23
+ # ## Data preprocessed
24
+ #
25
+ # We need to normalized and scale the data at first.
26
+
27
+ # In[3]:
28
+
29
+
30
+ adata = scv.datasets.pancreas()
31
+ adata
32
+
33
+
34
+ # In[4]:
35
+
36
+
37
+ #quantity control
38
+ adata=ov.pp.qc(adata,
39
+ tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
40
+ mt_startswith='mt-')
41
+ #normalize and high variable genes (HVGs) calculated
42
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
43
+
44
+ #save the whole genes and filter the non-HVGs
45
+ adata.raw = adata
46
+ adata = adata[:, adata.var.highly_variable_features]
47
+
48
+ #scale the adata.X
49
+ ov.pp.scale(adata)
50
+
51
+ #Dimensionality Reduction
52
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
53
+
54
+
55
+ # ## Constructing a metacellular object
56
+ #
57
+ # We can use `ov.single.MetaCell` to construct a metacellular object to train the SEACells model, the arguments can be found in below.
58
+ #
59
+ # - :param ad: (AnnData) annotated data matrix
60
+ # - :param build_kernel_on: (str) key corresponding to matrix in ad.obsm which is used to compute kernel for metacells
61
+ # Typically 'X_pca' for scRNA or 'X_svd' for scATAC
62
+ # - :param n_SEACells: (int) number of SEACells to compute
63
+ # - :param use_gpu: (bool) whether to use GPU for computation
64
+ # - :param verbose: (bool) whether to suppress verbose program logging
65
+ # - :param n_waypoint_eigs: (int) number of eigenvectors to use for waypoint initialization
66
+ # - :param n_neighbors: (int) number of nearest neighbors to use for graph construction
67
+ # - :param convergence_epsilon: (float) convergence threshold for Franke-Wolfe algorithm
68
+ # - :param l2_penalty: (float) L2 penalty for Franke-Wolfe algorithm
69
+ # - :param max_franke_wolfe_iters: (int) maximum number of iterations for Franke-Wolfe algorithm
70
+ # - :param use_sparse: (bool) whether to use sparse matrix operations. Currently only supported for CPU implementation.
71
+
72
+ # In[5]:
73
+
74
+
75
+ meta_obj=ov.single.MetaCell(adata,use_rep='scaled|original|X_pca',
76
+ n_metacells=None,
77
+ use_gpu='cuda:0')
78
+
79
+
80
+ # In[6]:
81
+
82
+
83
+ get_ipython().run_cell_magic('time', '', 'meta_obj.initialize_archetypes()\n')
84
+
85
+
86
+ # ## Train and save the model
87
+
88
+ # In[7]:
89
+
90
+
91
+ get_ipython().run_cell_magic('time', '', 'meta_obj.train(min_iter=10, max_iter=50)\n')
92
+
93
+
94
+ # In[9]:
95
+
96
+
97
+ meta_obj.save('seacells/model.pkl')
98
+
99
+
100
+ # In[6]:
101
+
102
+
103
+ meta_obj.load('seacells/model.pkl')
104
+
105
+
106
+ # ## Predicted the metacells
107
+ #
108
+ # we can use `predicted` to predicted the metacells of raw scRNA-seq data. There are two method can be selected, one is `soft`, the other is `hard`.
109
+ #
110
+ # In the `soft` method, Aggregates cells within each SEACell, summing over all raw data x assignment weight for all cells belonging to a SEACell. Data is un-normalized and pseudo-raw aggregated counts are stored in .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
111
+ #
112
+ # In the `hard` method, Aggregates cells within each SEACell, summing over all raw data for all cells belonging to a SEACell. Data is unnormalized and raw aggregated counts are stored .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
113
+
114
+ # In[10]:
115
+
116
+
117
+ ad=meta_obj.predicted(method='soft',celltype_label='clusters',
118
+ summarize_layer='lognorm')
119
+
120
+
121
+ # ## Benchmarking
122
+ #
123
+ # Benchmarking metrics were computed for each metacell for all combinations of data modality, dataset and method. Cell type purity was used to assess the quality of PBMC metacells. Methods were compared using the Wilcoxon rank-sum test. We note that this test might possibly inflate significance due to the dependency between metacells, but it nonetheless provides an estimate of the direction of difference. Top-performing metacell approaches should have scores that are low on compactness, high on separation and high on cell type purity
124
+
125
+ # In[11]:
126
+
127
+
128
+ SEACell_purity = meta_obj.compute_celltype_purity('clusters')
129
+ separation = meta_obj.separation(use_rep='scaled|original|X_pca',nth_nbr=1)
130
+ compactness = meta_obj.compactness(use_rep='scaled|original|X_pca')
131
+
132
+
133
+ # In[12]:
134
+
135
+
136
+ import seaborn as sns
137
+ import matplotlib.pyplot as plt
138
+ ov.plot_set()
139
+ fig, axes = plt.subplots(1,3,figsize=(4,4))
140
+ sns.boxplot(data=SEACell_purity, y='clusters_purity',ax=axes[0],
141
+ color=ov.utils.blue_color[3])
142
+ sns.boxplot(data=compactness, y='compactness',ax=axes[1],
143
+ color=ov.utils.blue_color[4])
144
+ sns.boxplot(data=separation, y='separation',ax=axes[2],
145
+ color=ov.utils.blue_color[4])
146
+ plt.tight_layout()
147
+ plt.suptitle('Evaluate of MetaCells',fontsize=13,y=1.05)
148
+ for ax in axes:
149
+ ax.grid(False)
150
+ ax.spines['top'].set_visible(False)
151
+ ax.spines['right'].set_visible(False)
152
+ ax.spines['bottom'].set_visible(True)
153
+ ax.spines['left'].set_visible(True)
154
+
155
+
156
+ # In[13]:
157
+
158
+
159
+ import matplotlib.pyplot as plt
160
+ fig, ax = plt.subplots(figsize=(4,4))
161
+ ov.pl.embedding(
162
+ meta_obj.adata,
163
+ basis="X_umap",
164
+ color=['clusters'],
165
+ frameon='small',
166
+ title="Meta cells",
167
+ #legend_loc='on data',
168
+ legend_fontsize=14,
169
+ legend_fontoutline=2,
170
+ size=10,
171
+ ax=ax,
172
+ alpha=0.2,
173
+ #legend_loc='',
174
+ add_outline=False,
175
+ #add_outline=True,
176
+ outline_color='black',
177
+ outline_width=1,
178
+ show=False,
179
+ #palette=ov.utils.blue_color[:],
180
+ #legend_fontweight='normal'
181
+ )
182
+ ov.single.plot_metacells(ax,meta_obj.adata,color='#CB3E35',
183
+ )
184
+
185
+
186
+ # ## Get the raw obs value from adata
187
+ #
188
+ # There are times when we compute some floating point type data such as pseudotime on the raw single cell data. We want to get the result of the original data on the metacell, in this case, we can use the `ov.single` function to get it.
189
+ #
190
+ # Note that the type parameter supports `str`,`max`,`min`,`mean`.
191
+
192
+ # In[14]:
193
+
194
+
195
+ ov.single.get_obs_value(ad,adata,groupby='S_score',
196
+ type='mean')
197
+ ad.obs.head()
198
+
199
+
200
+ # ## Visualize the MetaCells
201
+
202
+ # In[15]:
203
+
204
+
205
+ import scanpy as sc
206
+ ad.raw=ad.copy()
207
+ sc.pp.highly_variable_genes(ad, n_top_genes=2000, inplace=True)
208
+ ad=ad[:,ad.var.highly_variable]
209
+
210
+
211
+ # In[16]:
212
+
213
+
214
+ ov.pp.scale(ad)
215
+ ov.pp.pca(ad,layer='scaled',n_pcs=30)
216
+ ov.pp.neighbors(ad, n_neighbors=15, n_pcs=20,
217
+ use_rep='scaled|original|X_pca')
218
+
219
+
220
+ # In[17]:
221
+
222
+
223
+ ov.pp.umap(ad)
224
+
225
+
226
+ # We want the metacells to take on the same colours as the original data, a noteworthy fact is that the colours of the original data are stored in the `adata.uns['_colors']`
227
+
228
+ # In[18]:
229
+
230
+
231
+ ad.obs['celltype']=ad.obs['celltype'].astype('category')
232
+ ad.obs['celltype']=ad.obs['celltype'].cat.reorder_categories(adata.obs['clusters'].cat.categories)
233
+ ad.uns['celltype_colors']=adata.uns['clusters_colors']
234
+
235
+
236
+ # In[19]:
237
+
238
+
239
+ ov.pl.embedding(ad, basis='X_umap',
240
+ color=["celltype","S_score"],
241
+ frameon='small',cmap='RdBu_r',
242
+ wspace=0.5)
243
+
244
+
245
+ # In[ ]:
246
+
247
+
248
+
249
+
ovrawm/t_metatime.txt ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Celltype auto annotation with MetaTiME
5
+ #
6
+ # MetaTiME learns data-driven, interpretable, and reproducible gene programs by integrating millions of single cells from hundreds of tumor scRNA-seq data. The idea is to learn a map of single-cell space with biologically meaningful directions from large-scale data, which helps understand functional cell states and transfers knowledge to new data analysis. MetaTiME provides pretrained meta-components (MeCs) to automatically annotate fine-grained cell states and plot signature continuum for new single-cells of tumor microenvironment.
7
+ #
8
+ # Here, we integrate MetaTiME in omicverse. This tutorial demonstrates how to use [MetaTiME (original code)](https://github.com/yi-zhang/MetaTiME/blob/main/docs/notebooks/metatime_annotator.ipynb) to annotate celltype in TME
9
+ #
10
+ # Paper: [MetaTiME integrates single-cell gene expression to characterize the meta-components of the tumor immune microenvironment](https://www.nature.com/articles/s41467-023-38333-8)
11
+ #
12
+ # Code: https://github.com/yi-zhang/MetaTiME
13
+ #
14
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1isvjTfSFM2cy6GzHWAwbuvSjveEJijzP?usp=sharing
15
+ #
16
+ # ![metatime](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41467-023-38333-8/MediaObjects/41467_2023_38333_Fig1_HTML.png)
17
+
18
+ # In[1]:
19
+
20
+
21
+ import omicverse as ov
22
+ ov.utils.ov_plot_set()
23
+
24
+
25
+ # ## Data normalize and Batch remove
26
+ #
27
+ # The sample data has multiple patients , and we can use batch correction on patients. Here, we using [scVI](https://docs.scvi-tools.org/en/stable/) to remove batch.
28
+ #
29
+ # <div class="admonition warning">
30
+ # <p class="admonition-title">Note</p>
31
+ # <p>
32
+ # If your data contains count matrix, we provide a wrapped function for pre-processing the data. Otherwise, if the data is already depth-normalized, log-transformed, and cells are filtered, we can skip this step.
33
+ # </p>
34
+ # </div>
35
+
36
+ # In[ ]:
37
+
38
+
39
+ '''
40
+ import scvi
41
+ scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="patient")
42
+ vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30, gene_likelihood="nb")
43
+ vae.train()
44
+ adata.obsm["X_scVI"] = vae.get_latent_representation()
45
+ '''
46
+
47
+
48
+ # Example data can be obtained from figshare: https://figshare.com/ndownloader/files/41440050
49
+
50
+ # In[2]:
51
+
52
+
53
+ import scanpy as sc
54
+ adata=sc.read('TiME_adata_scvi.h5ad')
55
+ adata
56
+
57
+
58
+ # It is recommended that malignant cells are identified first and removed for best practice in cell state annotation.
59
+ #
60
+ # In the BCC data, the cluster of malignant cells are identified with `inferCNV`. We can use the pre-saved column 'isTME' to keep Tumor Microenvironment cells.
61
+ #
62
+ # These are the authors' exact words, but tests have found that the difference in annotation effect is not that great even without removing the malignant cells
63
+ #
64
+ # But I think this step is not necessary
65
+
66
+ # In[3]:
67
+
68
+
69
+ #adata = adata[adata.obs['isTME']]
70
+
71
+
72
+ # ## Neighborhood graph calculated
73
+ #
74
+ # We note that scVI was used earlier to remove the batch effect from the data, so we need to recalculate the neighbourhood map based on what is stored in `adata.obsm['X_scVI']`. Note that if you are not using scVI but using another method to calculate the neighbourhood map, such as `X_pca`, then you need to change `X_scVI` to `X_pca` to complete the calculation
75
+ #
76
+ # ```
77
+ # #Example
78
+ # #sc.tl.pca(adata)
79
+ # #sc.pp.neighbors(adata, use_rep="X_pca")
80
+ # ```
81
+
82
+ # In[4]:
83
+
84
+
85
+ sc.pp.neighbors(adata, use_rep="X_scVI")
86
+
87
+
88
+ # To visualize the PCA’s embeddings, we use the `pymde` package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
89
+
90
+ # In[5]:
91
+
92
+
93
+ adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["X_scVI"])
94
+
95
+
96
+ # In[6]:
97
+
98
+
99
+ sc.pl.embedding(
100
+ adata,
101
+ basis="X_mde",
102
+ color=["patient"],
103
+ frameon=False,
104
+ ncols=1,
105
+ )
106
+
107
+
108
+ # In[7]:
109
+
110
+
111
+ #adata.write_h5ad('adata_mde.h5ad',compression='gzip')
112
+ #adata=sc.read('adata_mde.h5ad')
113
+
114
+
115
+ # ## MeteTiME model init
116
+ #
117
+ # Next, let's load the pre-computed MetaTiME MetaComponents (MeCs), and their functional annotation.
118
+
119
+ # In[8]:
120
+
121
+
122
+ TiME_object=ov.single.MetaTiME(adata,mode='table')
123
+
124
+
125
+ # We can over-cluster the cells which is useful for fine-grained cell state annotation.
126
+ #
127
+ # As the resolution gets larger, the number of clusters gets larger
128
+
129
+ # In[9]:
130
+
131
+
132
+ TiME_object.overcluster(resolution=8,clustercol = 'overcluster',)
133
+
134
+
135
+ # ## TME celltype predicted
136
+ #
137
+ # We using `TiME_object.predictTiME()` to predicted the latent celltype in TME.
138
+ #
139
+ # - The minor celltype will be stored in `adata.obs['MetaTiME']`
140
+ # - The major celltype will be stored in `adata.obs['Major_MetaTiME']`
141
+
142
+ # In[10]:
143
+
144
+
145
+ TiME_object.predictTiME(save_obs_name='MetaTiME')
146
+
147
+
148
+ # ## Visualize
149
+ #
150
+ # The original author provides a drawing function that effectively avoids overlapping labels. Here I have expanded its parameters so that it can be visualised using parameters other than X_umap
151
+
152
+ # In[13]:
153
+
154
+
155
+ fig,ax=TiME_object.plot(cluster_key='MetaTiME',basis='X_mde',dpi=80)
156
+ #fig.save
157
+
158
+
159
+ # We can also use `sc.pl.embedding` to visualize the celltype
160
+
161
+ # In[15]:
162
+
163
+
164
+ sc.pl.embedding(
165
+ adata,
166
+ basis="X_mde",
167
+ color=["Major_MetaTiME"],
168
+ frameon=False,
169
+ ncols=1,
170
+ )
171
+
172
+
173
+ # In[ ]:
174
+
175
+
176
+
177
+
ovrawm/t_mofa.txt ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Multi omics analysis by MOFA
5
+ # MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion.
6
+ #
7
+ # This tutorial focuses on how to perform mofa in multi-omics like scRNA-seq and scATAC-seq
8
+ #
9
+ # Paper: [MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1)
10
+ #
11
+ # Code: https://github.com/bioFAM/mofapy2
12
+ #
13
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1UPGQA3BenrC-eLIGVtdKVftSnOKIwNrP?usp=sharing
14
+
15
+ # ## Part.1 MOFA Model
16
+ # In this part, we construct a model of mofa by scRNA-seq and scATAC-seq
17
+
18
+ # In[1]:
19
+
20
+
21
+ import omicverse as ov
22
+ rna=ov.utils.read('data/sample/rna_p_n_raw.h5ad')
23
+ atac=ov.utils.read('data/sample/atac_p_n_raw.h5ad')
24
+
25
+
26
+ # In[2]:
27
+
28
+
29
+ rna,atac
30
+
31
+
32
+ # We only need to add anndata to `ov.single.mofa` to construct the base model
33
+
34
+ # In[4]:
35
+
36
+
37
+ test_mofa=ov.single.pyMOFA(omics=[rna,atac],
38
+ omics_name=['RNA','ATAC'])
39
+
40
+
41
+ # In[ ]:
42
+
43
+
44
+ test_mofa.mofa_preprocess()
45
+ test_mofa.mofa_run(outfile='models/brac_rna_atac.hdf5')
46
+
47
+
48
+ # ## Part.2 MOFA Analysis
49
+ # After get the model by mofa, we need to analysis the factor about different omics, we provide some method to do this
50
+
51
+ # ### load data
52
+
53
+ # In[1]:
54
+
55
+
56
+ import omicverse as ov
57
+ ov.utils.ov_plot_set()
58
+
59
+
60
+ # In[2]:
61
+
62
+
63
+ rna=ov.utils.read('data/sample/rna_test.h5ad')
64
+
65
+
66
+ # ### add value of factor to anndata
67
+
68
+ # In[3]:
69
+
70
+
71
+ rna=ov.single.factor_exact(rna,hdf5_path='data/sample/MOFA_POS.hdf5')
72
+ rna
73
+
74
+
75
+ # ### analysis of the correlation between factor and cell type
76
+
77
+ # In[4]:
78
+
79
+
80
+ ov.single.factor_correlation(adata=rna,cluster='cell_type',factor_list=[1,2,3,4,5])
81
+
82
+
83
+ # ### Get the gene/feature weights of different factor
84
+
85
+ # In[5]:
86
+
87
+
88
+ ov.single.get_weights(hdf5_path='data/sample/MOFA_POS.hdf5',view='RNA',factor=1)
89
+
90
+
91
+ # ## Part.3 MOFA Visualize
92
+ #
93
+ # To visualize the result of mofa, we provide a series of function to do this.
94
+
95
+ # In[6]:
96
+
97
+
98
+ pymofa_obj=ov.single.pyMOFAART(model_path='data/sample/MOFA_POS.hdf5')
99
+
100
+
101
+ # We get the factor of each cell at first
102
+
103
+ # In[7]:
104
+
105
+
106
+ pymofa_obj.get_factors(rna)
107
+ rna
108
+
109
+
110
+ # We can also plot the varience in each View
111
+
112
+ # In[8]:
113
+
114
+
115
+ pymofa_obj.plot_r2()
116
+
117
+
118
+ # In[9]:
119
+
120
+
121
+ pymofa_obj.get_r2()
122
+
123
+
124
+ # ### Visualize the correlation between factor and celltype
125
+
126
+ # In[10]:
127
+
128
+
129
+ pymofa_obj.plot_cor(rna,'cell_type')
130
+
131
+
132
+ # We found that factor6 is correlated to Epithelial
133
+
134
+ # In[11]:
135
+
136
+
137
+ pymofa_obj.plot_factor(rna,'cell_type','Epi',figsize=(3,3),
138
+ factor1=6,factor2=10,)
139
+
140
+
141
+ # In[24]:
142
+
143
+
144
+ import scanpy as sc
145
+ sc.pp.neighbors(rna)
146
+ sc.tl.umap(rna)
147
+ sc.pl.embedding(
148
+ rna,
149
+ basis="X_umap",
150
+ color=["factor6","cell_type"],
151
+ frameon=False,
152
+ ncols=2,
153
+ #palette=ov.utils.pyomic_palette(),
154
+ show=False,
155
+ cmap='Greens',
156
+ vmin=0,
157
+ )
158
+ #plt.savefig("figures/umap_factor6.png",dpi=300,bbox_inches = 'tight')
159
+
160
+
161
+ # In[12]:
162
+
163
+
164
+ pymofa_obj.plot_weight_gene_d1(view='RNA',factor1=6,factor2=10,)
165
+
166
+
167
+ # In[18]:
168
+
169
+
170
+ pymofa_obj.plot_weights(view='RNA',factor=6,color='#5de25d',
171
+ ascending=True)
172
+
173
+
174
+ # In[14]:
175
+
176
+
177
+ pymofa_obj.plot_top_feature_heatmap(view='RNA')
178
+
179
+
180
+ # In[ ]:
181
+
182
+
183
+
184
+
ovrawm/t_mofa_glue.txt ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Multi omics analysis by MOFA and GLUE
5
+ # MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion.
6
+ #
7
+ # Most of the time, however, we did not get paired cells in the multi-omics analysis. Here, we can pair cells using GLUE, a dimensionality reduction algorithm that can integrate different histological layers, and it can efficiently merge data from different histological layers.
8
+ #
9
+ # This tutorial focuses on how to perform mofa in multi-omics using GLUE.
10
+ #
11
+ # Paper: [MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1) and [Multi-omics single-cell data integration and regulatory inference with graph-linked embedding](https://www.nature.com/articles/s41587-022-01284-4)
12
+ #
13
+ # Code: https://github.com/bioFAM/mofapy2 and https://github.com/gao-lab/GLUE
14
+ #
15
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1zlakFf20IoBdyuOQDocwFQHu8XOVizRL?usp=sharing
16
+ #
17
+ # We used the result anndata object `rna-emb.h5ad` and `atac.emb.h5ad` from [GLUE'tutorial](https://scglue.readthedocs.io/en/latest/training.html)
18
+
19
+ # In[1]:
20
+
21
+
22
+ import omicverse as ov
23
+ ov.utils.ov_plot_set()
24
+
25
+
26
+ # ## Load the data
27
+ #
28
+ # We use `ov.utils.read` to read the `h5ad` files
29
+
30
+ # In[2]:
31
+
32
+
33
+ rna=ov.utils.read("chen_rna-emb.h5ad")
34
+ atac=ov.utils.read("chen_atac-emb.h5ad")
35
+
36
+
37
+ # ## Pair the omics
38
+ #
39
+ # Each cell in our rna and atac data has a feature vector, X_glue, based on which we can calculate the Pearson correlation coefficient to perform cell matching.
40
+
41
+ # In[3]:
42
+
43
+
44
+ pair_obj=ov.single.GLUE_pair(rna,atac)
45
+ pair_obj.correlation()
46
+
47
+
48
+ # We counted the top 50 highly correlated cells in another histology layer for each cell in one of the histology layers to avoid missing data due to one cell being highly correlated with multiple cells. The default minimum threshold for high correlation is 0.9. We can obtain more paired cells by increasing the depth, but note that increasing the depth may lead to higher errors in cell matching
49
+
50
+ # In[4]:
51
+
52
+
53
+ res_pair=pair_obj.find_neighbor_cell(depth=20)
54
+ res_pair.to_csv('models/chen_pair_res.csv')
55
+
56
+
57
+ # We filter to get paired cells
58
+
59
+ # In[14]:
60
+
61
+
62
+ rna1=rna[res_pair['omic_1']]
63
+ atac1=atac[res_pair['omic_2']]
64
+ rna1.obs.index=res_pair.index
65
+ atac1.obs.index=res_pair.index
66
+ rna1,atac1
67
+
68
+
69
+ # We can use mudata to store the multi-omics
70
+
71
+ # In[6]:
72
+
73
+
74
+ from mudata import MuData
75
+
76
+ mdata = MuData({'rna': rna1, 'atac': atac1})
77
+ mdata
78
+
79
+
80
+ # In[7]:
81
+
82
+
83
+ mdata.write("chen_mu.h5mu",compression='gzip')
84
+
85
+
86
+ # ## MOFA prepare
87
+ #
88
+ # In the MOFA analysis, we only need to use highly variable genes, for which we perform one filter
89
+
90
+ # In[22]:
91
+
92
+
93
+ rna1=mdata['rna']
94
+ rna1=rna1[:,rna1.var['highly_variable']==True]
95
+ atac1=mdata['atac']
96
+ atac1=atac1[:,atac1.var['highly_variable']==True]
97
+ rna1.obs.index=res_pair.index
98
+ atac1.obs.index=res_pair.index
99
+
100
+
101
+ # In[23]:
102
+
103
+
104
+ import random
105
+ random_obs_index=random.sample(list(rna1.obs.index),5000)
106
+
107
+
108
+ # In[25]:
109
+
110
+
111
+ from sklearn.metrics import adjusted_rand_score as ari
112
+ ari_random=ari(rna1[random_obs_index].obs['cell_type'], atac1[random_obs_index].obs['cell_type'])
113
+ ari_raw=ari(rna1.obs['cell_type'], atac1.obs['cell_type'])
114
+ print('raw ari:{}, random ari:{}'.format(ari_raw,ari_random))
115
+
116
+
117
+ # In[26]:
118
+
119
+
120
+ #rna1=rna1[random_obs_index]
121
+ #atac1=atac1[random_obs_index]
122
+
123
+
124
+ # ## MOFA analysis
125
+ #
126
+ # In this part, we construct a model of mofa by scRNA-seq and scATAC-seq
127
+
128
+ # In[28]:
129
+
130
+
131
+ test_mofa=ov.single.pyMOFA(omics=[rna1,atac1],
132
+ omics_name=['RNA','ATAC'])
133
+
134
+
135
+ # In[29]:
136
+
137
+
138
+ test_mofa.mofa_preprocess()
139
+ test_mofa.mofa_run(outfile='models/chen_rna_atac.hdf5')
140
+
141
+
142
+ # ## MOFA Visualization
143
+ #
144
+ # In this part, we provide a series of function to visualize the result of mofa.
145
+
146
+ # In[30]:
147
+
148
+
149
+ pymofa_obj=ov.single.pyMOFAART(model_path='models/chen_rna_atac.hdf5')
150
+
151
+
152
+ # In[31]:
153
+
154
+
155
+ pymofa_obj.get_factors(rna1)
156
+ rna1
157
+
158
+
159
+ # ### Visualize the varience of each view
160
+
161
+ # In[32]:
162
+
163
+
164
+ pymofa_obj.plot_r2()
165
+
166
+
167
+ # In[33]:
168
+
169
+
170
+ pymofa_obj.get_r2()
171
+
172
+
173
+ # ### Visualize the correlation between factor and celltype
174
+
175
+ # In[37]:
176
+
177
+
178
+ pymofa_obj.plot_cor(rna1,'cell_type',figsize=(4,6))
179
+
180
+
181
+ # In[38]:
182
+
183
+
184
+ pymofa_obj.get_cor(rna1,'cell_type')
185
+
186
+
187
+ # In[46]:
188
+
189
+
190
+ pymofa_obj.plot_factor(rna1,'cell_type','Ast',figsize=(3,3),
191
+ factor1=1,factor2=3,)
192
+
193
+
194
+ # ### Visualize the factor in UMAP
195
+ #
196
+ # To visualize the GLUE’s learned embeddings, we use the pymde package wrapperin scvi-tools. This is an alternative to UMAP that is GPU-accelerated.
197
+ #
198
+ # You can use `sc.tl.umap` insteaded.
199
+
200
+ # In[41]:
201
+
202
+
203
+ from scvi.model.utils import mde
204
+ import scanpy as sc
205
+ sc.pp.neighbors(rna1, use_rep="X_glue", metric="cosine")
206
+ rna1.obsm["X_mde"] = mde(rna1.obsm["X_glue"])
207
+
208
+
209
+ # In[47]:
210
+
211
+
212
+ sc.pl.embedding(
213
+ rna1,
214
+ basis="X_mde",
215
+ color=["factor1","factor3","cell_type"],
216
+ frameon=False,
217
+ ncols=3,
218
+ #palette=ov.utils.pyomic_palette(),
219
+ show=False,
220
+ cmap='Greens',
221
+ vmin=0,
222
+ )
223
+
224
+
225
+ # ### Weights ranked
226
+ # A visualization of factor weights familiar to MOFA and MOFA+ users is implemented with some modifications in `plot_weight_gene_d1`, `plot_weight_gene_d2`, and `plot_weights`.
227
+
228
+ # In[48]:
229
+
230
+
231
+ pymofa_obj.plot_weight_gene_d1(view='RNA',factor1=1,factor2=3,)
232
+
233
+
234
+ # In[50]:
235
+
236
+
237
+ pymofa_obj.plot_weights(view='RNA',factor=1,
238
+ ascending=False)
239
+
240
+
241
+ # ### Weights heatmap
242
+ #
243
+ # While trying to annotate factors, a global overview of top features defining them could be helpful.
244
+
245
+ # In[51]:
246
+
247
+
248
+ pymofa_obj.plot_top_feature_heatmap(view='RNA')
249
+
250
+
251
+ # In[ ]:
252
+
253
+
254
+
255
+
ovrawm/t_network.txt ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Protein-Protein interaction (PPI) analysis by String-db
5
+ # STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.
6
+ #
7
+ # Here we produce a tutorial that use python to construct protein-protein interaction network
8
+ #
9
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1ReLCFA5cNNcem_WaMXYN9da7W0GN4gzl?usp=sharing
10
+
11
+ # In[1]:
12
+
13
+
14
+ import omicverse as ov
15
+ ov.utils.ov_plot_set()
16
+
17
+
18
+ # ## Prepare data
19
+ #
20
+ # Here we use the example data of string-db to perform the analysis
21
+ #
22
+ # FAA4 and its ten most confident interactors.
23
+ # FAA4 in yeast is a long chain fatty acyl-CoA synthetase; see it connected to other synthetases as well as regulators.
24
+ #
25
+ # Saccharomyces cerevisiae
26
+ # NCBI taxonomy Id: 4932
27
+ # Other names: ATCC 18824, Candida robusta, NRRL Y-12632, S. cerevisiae, Saccharomyces capensis, Saccharomyces italicus, Saccharomyces oviformis, Saccharomyces uvarum var. melibiosus, lager beer yeast, yeast
28
+
29
+ # In[2]:
30
+
31
+
32
+ gene_list=['FAA4','POX1','FAT1','FAS2','FAS1','FAA1','OLE1','YJU3','TGL3','INA1','TGL5']
33
+
34
+
35
+ # Besides, we also need to set the gene's type and color. Here, we randomly set the top 5 genes named Type1, other named Type2
36
+
37
+ # In[3]:
38
+
39
+
40
+ gene_type_dict=dict(zip(gene_list,['Type1']*5+['Type2']*6))
41
+ gene_color_dict=dict(zip(gene_list,['#F7828A']*5+['#9CCCA4']*6))
42
+
43
+
44
+ # ## STRING interaction analysis
45
+ #
46
+ # The network API method also allows you to retrieve your STRING interaction network for one or multiple proteins in various text formats. It will tell you the combined score and all the channel specific scores for the set of proteins. You can also extend the network neighborhood by setting "add_nodes", which will add, to your network, new interaction partners in order of their confidence.
47
+
48
+ # In[7]:
49
+
50
+
51
+ G_res=ov.bulk.string_interaction(gene_list,4932)
52
+ G_res.head()
53
+
54
+
55
+ # ## STRING PPI network
56
+ #
57
+ # We also can use `ov.bulk.pyPPI` to get the PPI network of `gene_list`, we init it at first
58
+
59
+ # In[5]:
60
+
61
+
62
+ ppi=ov.bulk.pyPPI(gene=gene_list,
63
+ gene_type_dict=gene_type_dict,
64
+ gene_color_dict=gene_color_dict,
65
+ species=4932)
66
+
67
+
68
+ # Then we connect to string-db to calculate the protein-protein interaction
69
+
70
+ # In[8]:
71
+
72
+
73
+ ppi.interaction_analysis()
74
+
75
+
76
+ # We provided a very simple function to plot the network, you can refer the `ov.utils.plot_network` to find out the parameter
77
+
78
+ # In[9]:
79
+
80
+
81
+ ppi.plot_network()
82
+
83
+
84
+ # In[ ]:
85
+
86
+
87
+
88
+
ovrawm/t_nocd.txt ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Overlapping Celltype annotation with GNN
5
+ # Droplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce.
6
+ #
7
+ # This tutorial focuses on how to cluster the cell with overlapping and identify the cell with multi-fate
8
+ #
9
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1l7iHdVmTQcv9YmpIjhK_UzLuHbJ1Jv9E?usp=sharing
10
+ #
11
+ # <div class="admonition warning">
12
+ # <p class="admonition-title">Warning</p>
13
+ # <p>
14
+ # NOCD's development is still in progress. The current version may not fully reproduce the original implementation’s results.
15
+ # </p>
16
+ # </div>
17
+
18
+ # ## Part.1 Data preprocess
19
+ #
20
+ # In this part, we perform preliminary processing of the data, such as normalization and logarithmization, in order to make the data more interpretable
21
+
22
+ # In[1]:
23
+
24
+
25
+ import omicverse as ov
26
+ import anndata
27
+ import scanpy as sc
28
+ import matplotlib.pyplot as plt
29
+ import numpy as np
30
+ import pandas as pd
31
+ get_ipython().run_line_magic('matplotlib', 'inline')
32
+
33
+
34
+ # In[2]:
35
+
36
+
37
+ #param for visualization
38
+ sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
39
+ sc.settings.set_figure_params(dpi=80, facecolor='white')
40
+
41
+
42
+ # In[3]:
43
+
44
+
45
+ from matplotlib.colors import LinearSegmentedColormap
46
+ sc_color=['#7CBB5F','#368650','#A499CC','#5E4D9A','#78C2ED','#866017', '#9F987F','#E0DFED',
47
+ '#EF7B77', '#279AD7','#F0EEF0', '#1F577B', '#A56BA7', '#E0A7C8', '#E069A6', '#941456', '#FCBC10',
48
+ '#EAEFC5', '#01A0A7', '#75C8CC', '#F0D7BC', '#D5B26C', '#D5DA48', '#B6B812', '#9DC3C3', '#A89C92', '#FEE00C', '#FEF2A1']
49
+ sc_color_cmap = LinearSegmentedColormap.from_list('Custom', sc_color, len(sc_color))
50
+
51
+
52
+ # In[4]:
53
+
54
+
55
+ adata = anndata.read('sample/rna.h5ad')
56
+ adata
57
+
58
+
59
+ # In[5]:
60
+
61
+
62
+ adata=ov.single.scanpy_lazy(adata)
63
+
64
+
65
+ # ## Part.2 Overlapping Community Detection
66
+ # In this part, we perform a graph neural network (GNN) basedmodel for overlapping community detection in scRNA-seq.
67
+ #
68
+ # ![https://www.cs.cit.tum.de/fileadmin/w00cfj/daml/nocd/nocd.png](https://www.cs.cit.tum.de/fileadmin/w00cfj/daml/nocd/nocd.png)
69
+
70
+ # In[6]:
71
+
72
+
73
+ scbrca=ov.single.scnocd(adata)
74
+ scbrca.matrix_transform()
75
+ scbrca.matrix_normalize()
76
+ scbrca.GNN_configure()
77
+ scbrca.GNN_preprocess()
78
+ scbrca.GNN_model()
79
+ scbrca.GNN_result()
80
+ scbrca.GNN_plot()
81
+ #scbrca.calculate_nocd()
82
+ scbrca.cal_nocd()
83
+
84
+
85
+ # In[8]:
86
+
87
+
88
+ scbrca.calculate_nocd()
89
+
90
+
91
+ # ## Part.3 Visualization
92
+ # In this part, we visualized the overlapping and non-overlapping cell.
93
+
94
+ # In[9]:
95
+
96
+
97
+ sc.pl.umap(scbrca.adata, color=['leiden','nocd'],wspace=0.4,palette=sc_color)
98
+
99
+
100
+ # zero means the cell related to overlap
101
+
102
+ # In[10]:
103
+
104
+
105
+ sc.pl.umap(scbrca.adata, color=['leiden','nocd_n'],wspace=0.4,palette=sc_color)
106
+
107
+
108
+ # In[ ]:
109
+
110
+
111
+
112
+
ovrawm/t_preprocess.txt ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Preprocessing the data of scRNA-seq with omicverse
5
+ #
6
+ # The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
7
+ #
8
+ # Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.
9
+ #
10
+ # User can compare our tutorial with [scanpy'tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) to learn how to use omicverse well
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing
13
+
14
+ # In[1]:
15
+
16
+
17
+ import omicverse as ov
18
+ import scanpy as sc
19
+ ov.ov_plot_set()
20
+
21
+
22
+ # The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
23
+
24
+ # In[2]:
25
+
26
+
27
+ # !mkdir data
28
+ # !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
29
+ # !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
30
+ # !mkdir write
31
+
32
+
33
+ # In[3]:
34
+
35
+
36
+ adata = sc.read_10x_mtx(
37
+ 'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
38
+ var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
39
+ cache=True) # write a cache file for faster subsequent reading
40
+ adata
41
+
42
+
43
+ # In[4]:
44
+
45
+
46
+ adata.var_names_make_unique()
47
+ adata.obs_names_make_unique()
48
+
49
+
50
+ # ## Preprocessing
51
+ #
52
+ # ### Quantity control
53
+ #
54
+ # For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
55
+
56
+ # In[5]:
57
+
58
+
59
+ adata=ov.pp.qc(adata,
60
+ tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
61
+ adata
62
+
63
+
64
+ # ### High variable Gene Detection
65
+ #
66
+ # Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See [Article](https://www.nature.com/articles/s41592-023-01814-1#MOESM3) in *Nature Method* for details.
67
+ #
68
+
69
+ # Sometimes we need to recover the original counts for some single-cell calculations, but storing them in the layer layer may result in missing data, so we provide two functions here, a store function and a release function, to save the original data.
70
+ #
71
+ # We set `layers=counts`, the counts will be stored in `adata.uns['layers_counts']`
72
+
73
+ # In[6]:
74
+
75
+
76
+ ov.utils.store_layers(adata,layers='counts')
77
+ adata
78
+
79
+
80
+ # normalize|HVGs:We use | to control the preprocessing step, | before for the normalisation step, either `shiftlog` or `pearson`, and | after for the highly variable gene calculation step, either `pearson` or `seurat`. Our default is `shiftlog|pearson`.
81
+ #
82
+ # - if you use `mode`=`shiftlog|pearson` you need to set `target_sum=50*1e4`, more people like to se `target_sum=1e4`, we test the result think 50*1e4 will be better
83
+ # - if you use `mode`=`pearson|pearson`, you don't need to set `target_sum`
84
+ #
85
+ # <div class="admonition warning">
86
+ # <p class="admonition-title">Note</p>
87
+ # <p>
88
+ # if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
89
+ # </p>
90
+ # </div>
91
+ #
92
+
93
+ # In[7]:
94
+
95
+
96
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
97
+ adata
98
+
99
+
100
+ # Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
101
+
102
+ # In[8]:
103
+
104
+
105
+ adata.raw = adata
106
+ adata = adata[:, adata.var.highly_variable_features]
107
+ adata
108
+
109
+
110
+ # We find that the adata.X matrix is normalized at this point, including the data in raw, but we want to get the unnormalized data, so we can use the retrieve function `ov.utils.retrieve_layers`
111
+
112
+ # In[9]:
113
+
114
+
115
+ adata_counts=adata.copy()
116
+ ov.utils.retrieve_layers(adata_counts,layers='counts')
117
+ print('normalize adata:',adata.X.max())
118
+ print('raw count adata:',adata_counts.X.max())
119
+
120
+
121
+ # In[10]:
122
+
123
+
124
+ adata_counts
125
+
126
+
127
+ # If we wish to recover the original count matrix at the whole gene level, we can try the following code
128
+
129
+ # In[11]:
130
+
131
+
132
+ adata_counts=adata.raw.to_adata().copy()
133
+ ov.utils.retrieve_layers(adata_counts,layers='counts')
134
+ print('normalize adata:',adata.X.max())
135
+ print('raw count adata:',adata_counts.X.max())
136
+ adata_counts
137
+
138
+
139
+ # ## Principal component analysis
140
+ #
141
+ # In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
142
+
143
+ # In[12]:
144
+
145
+
146
+ ov.pp.scale(adata)
147
+ adata
148
+
149
+
150
+ # If you want to perform pca in normlog layer, you can set `layer`=`normlog`, but we think scaled is necessary in PCA.
151
+
152
+ # In[13]:
153
+
154
+
155
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
156
+ adata
157
+
158
+
159
+ # In[14]:
160
+
161
+
162
+ adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
163
+ ov.utils.embedding(adata,
164
+ basis='X_pca',
165
+ color='CST3',
166
+ frameon='small')
167
+
168
+
169
+ # ## Embedding the neighborhood graph
170
+ #
171
+ # We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
172
+
173
+ # In[15]:
174
+
175
+
176
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
177
+ use_rep='scaled|original|X_pca')
178
+
179
+
180
+ # To visualize the PCA’s embeddings, we use the `pymde` package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
181
+
182
+ # In[16]:
183
+
184
+
185
+ adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
186
+ adata
187
+
188
+
189
+ # In[17]:
190
+
191
+
192
+ ov.utils.embedding(adata,
193
+ basis='X_mde',
194
+ color='CST3',
195
+ frameon='small')
196
+
197
+
198
+ # You also can use `umap` to visualize the neighborhood graph
199
+
200
+ # In[18]:
201
+
202
+
203
+ sc.tl.umap(adata)
204
+
205
+
206
+ # In[19]:
207
+
208
+
209
+ ov.utils.embedding(adata,
210
+ basis='X_umap',
211
+ color='CST3',
212
+ frameon='small')
213
+
214
+
215
+ # ## Clustering the neighborhood graph
216
+ #
217
+ # As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
218
+
219
+ # In[20]:
220
+
221
+
222
+ sc.tl.leiden(adata)
223
+
224
+
225
+ # We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter `fraemon='small'`, which causes the axes to be scaled with the colourbar
226
+
227
+ # In[21]:
228
+
229
+
230
+ ov.utils.embedding(adata,
231
+ basis='X_mde',
232
+ color=['leiden', 'CST3', 'NKG7'],
233
+ frameon='small')
234
+
235
+
236
+ # We also provide a boundary visualisation function `ov.utils.plot_ConvexHull` to visualise specific clusters.
237
+ #
238
+ # Arguments:
239
+ # - color: if None will use the color of clusters
240
+ # - alpha: default is 0.2
241
+
242
+ # In[23]:
243
+
244
+
245
+ import matplotlib.pyplot as plt
246
+ fig,ax=plt.subplots( figsize = (4,4))
247
+
248
+ ov.utils.embedding(adata,
249
+ basis='X_mde',
250
+ color=['leiden'],
251
+ frameon='small',
252
+ show=False,
253
+ ax=ax)
254
+
255
+ ov.utils.plot_ConvexHull(adata,
256
+ basis='X_mde',
257
+ cluster_key='leiden',
258
+ hull_cluster='0',
259
+ ax=ax)
260
+
261
+
262
+ # If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the `ov.utils.gen_mpl_labels` function, which improves text overlap.
263
+ # In addition, we make use of the `patheffects` function, which makes our text have outlines
264
+ #
265
+ # - adjust_kwargs: it could be found in package `adjusttext`
266
+ # - text_kwargs: it could be found in class `plt.texts`
267
+
268
+ # In[67]:
269
+
270
+
271
+ from matplotlib import patheffects
272
+ import matplotlib.pyplot as plt
273
+ fig, ax = plt.subplots(figsize=(4,4))
274
+
275
+ ov.utils.embedding(adata,
276
+ basis='X_mde',
277
+ color=['leiden'],
278
+ show=False, legend_loc=None, add_outline=False,
279
+ frameon='small',legend_fontoutline=2,ax=ax
280
+ )
281
+
282
+ ov.utils.gen_mpl_labels(
283
+ adata,
284
+ 'leiden',
285
+ exclude=("None",),
286
+ basis='X_mde',
287
+ ax=ax,
288
+ adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
289
+ text_kwargs=dict(fontsize= 12 ,weight='bold',
290
+ path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
291
+ )
292
+
293
+
294
+ # In[47]:
295
+
296
+
297
+ marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
298
+ 'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
299
+ 'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
300
+
301
+
302
+ # In[48]:
303
+
304
+
305
+ sc.pl.dotplot(adata, marker_genes, groupby='leiden',
306
+ standard_scale='var');
307
+
308
+
309
+ # ## Finding marker genes
310
+ #
311
+ # Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
312
+
313
+ # In[49]:
314
+
315
+
316
+ sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
317
+ sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
318
+ method='t-test',use_raw=False,key_added='leiden_ttest')
319
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
320
+ cmap='Spectral_r',key='leiden_ttest',
321
+ standard_scale='var',n_genes=3)
322
+
323
+
324
+ # cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
325
+ #
326
+ # Paper: [Accurate and fast cell marker gene identification with COSG](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab579/6511197?redirectedFrom=fulltext)
327
+ #
328
+ # Code: https://github.com/genecell/COSG
329
+ #
330
+
331
+ # In[50]:
332
+
333
+
334
+ sc.tl.rank_genes_groups(adata, groupby='leiden',
335
+ method='t-test',use_rep='scaled|original|X_pca',)
336
+ ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
337
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
338
+ cmap='Spectral_r',key='leiden_cosg',
339
+ standard_scale='var',n_genes=3)
340
+
341
+
342
+ # ## Other plotting
343
+ #
344
+ # Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a `data_dict` and a `color_dict`, both of which have the same key requirements.
345
+ #
346
+ # For `data_dict`. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value
347
+ #
348
+
349
+ # In[51]:
350
+
351
+
352
+ data_dict={}
353
+ for i in adata.obs['leiden'].cat.categories:
354
+ data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
355
+ pval_cutoff=None,log2fc_min=None)
356
+
357
+
358
+ # In[65]:
359
+
360
+
361
+ data_dict.keys()
362
+
363
+
364
+ # In[64]:
365
+
366
+
367
+ data_dict[i].head()
368
+
369
+
370
+ # For `color_dict`, we require that the colour to be displayed for the current key is stored within each key.`
371
+
372
+ # In[63]:
373
+
374
+
375
+ type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
376
+ adata.uns['leiden_colors']))
377
+ type_color_dict
378
+
379
+
380
+ # There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.
381
+ #
382
+ # - data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
383
+ # - color_dict: dict, in each key, there is a color for each omic
384
+ # - pval_threshold: float, pvalue threshold for significant genes
385
+ # - log2fc_threshold: float, log2fc threshold for significant genes
386
+ # - figsize: tuple, figure size
387
+ # - sig_color: str, color for significant genes
388
+ # - normal_color: str, color for non-significant genes
389
+ # - plot_genes_num: int, number of genes to plot
390
+ # - plot_genes_fontsize: int, fontsize for gene names
391
+ # - plot_genes_weight: str, weight for gene names
392
+
393
+ # In[62]:
394
+
395
+
396
+ fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
397
+ pval_threshold=0.01,
398
+ log2fc_threshold=2,
399
+ figsize=(8,4),
400
+ sig_color='#a51616',
401
+ normal_color='#c7c7c7',
402
+ plot_genes_num=2,
403
+ plot_genes_fontsize=6,
404
+ plot_genes_weight='bold',
405
+ )
406
+
407
+ #The following code will be removed in future
408
+ y_min,y_max=0,0
409
+ for i in data_dict.keys():
410
+ y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
411
+ y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
412
+ for i in adata.obs['leiden'].cat.categories:
413
+ axes[i].set_ylim(y_min,y_max)
414
+ plt.suptitle('Stacking_vol',fontsize=12)
415
+
416
+
417
+ # In[ ]:
418
+
419
+
420
+
421
+
ovrawm/t_preprocess_cpu.txt ADDED
@@ -0,0 +1,404 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Preprocessing the data of scRNA-seq with omicverse
5
+ #
6
+ # The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
7
+ #
8
+ # Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.
9
+ #
10
+ # User can compare our tutorial with [scanpy'tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) to learn how to use omicverse well
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing
13
+
14
+ # In[1]:
15
+
16
+
17
+ import scanpy as sc
18
+ import omicverse as ov
19
+ ov.plot_set()
20
+
21
+
22
+ # The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
23
+
24
+ # In[ ]:
25
+
26
+
27
+ # !mkdir data
28
+ get_ipython().system('wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz')
29
+ get_ipython().system('cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz')
30
+ # !mkdir write
31
+
32
+
33
+ # In[2]:
34
+
35
+
36
+ adata = sc.read_10x_mtx(
37
+ 'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
38
+ var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
39
+ cache=True) # write a cache file for faster subsequent reading
40
+ adata
41
+
42
+
43
+ # In[3]:
44
+
45
+
46
+ adata.var_names_make_unique()
47
+ adata.obs_names_make_unique()
48
+
49
+
50
+ # ## Preprocessing
51
+ #
52
+ # ### Quantity control
53
+ #
54
+ # For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
55
+ #
56
+ # <div class="admonition warning">
57
+ # <p class="admonition-title">Note</p>
58
+ # <p>
59
+ # if the version of `omicverse` larger than `1.6.4`, the `doublets_method` can be set between `scrublet` and `sccomposite`.
60
+ # </p>
61
+ # </div>
62
+ #
63
+ # COMPOSITE (COMpound POiSson multIplet deTEction model) is a computational tool for multiplet detection in both single-cell single-omics and multiomics settings. It has been implemented as an automated pipeline and is available as both a cloud-based application with a user-friendly interface and a Python package.
64
+ #
65
+ # Hu, H., Wang, X., Feng, S. et al. A unified model-based framework for doublet or multiplet detection in single-cell multiomics data. Nat Commun 15, 5562 (2024). https://doi.org/10.1038/s41467-024-49448-x
66
+
67
+ # In[4]:
68
+
69
+
70
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.qc(adata,\n tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},\n doublets_method='sccomposite',\n batch_key=None)\nadata\n")
71
+
72
+
73
+ # ### High variable Gene Detection
74
+ #
75
+ # Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See [Article](https://www.nature.com/articles/s41592-023-01814-1#MOESM3) in *Nature Method* for details.
76
+ #
77
+
78
+ # normalize|HVGs:We use | to control the preprocessing step, | before for the normalisation step, either `shiftlog` or `pearson`, and | after for the highly variable gene calculation step, either `pearson` or `seurat`. Our default is `shiftlog|pearson`.
79
+ #
80
+ # - if you use `mode`=`shiftlog|pearson` you need to set `target_sum=50*1e4`, more people like to se `target_sum=1e4`, we test the result think 50*1e4 will be better
81
+ # - if you use `mode`=`pearson|pearson`, you don't need to set `target_sum`
82
+ #
83
+ # <div class="admonition warning">
84
+ # <p class="admonition-title">Note</p>
85
+ # <p>
86
+ # if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
87
+ # </p>
88
+ # </div>
89
+ #
90
+
91
+ # In[5]:
92
+
93
+
94
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)\nadata\n")
95
+
96
+
97
+ # Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
98
+
99
+ # In[6]:
100
+
101
+
102
+ get_ipython().run_cell_magic('time', '', 'adata.raw = adata\nadata = adata[:, adata.var.highly_variable_features]\nadata\n')
103
+
104
+
105
+ # ## Principal component analysis
106
+ #
107
+ # In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
108
+
109
+ # In[7]:
110
+
111
+
112
+ get_ipython().run_cell_magic('time', '', 'ov.pp.scale(adata)\nadata\n')
113
+
114
+
115
+ # If you want to perform pca in normlog layer, you can set `layer`=`normlog`, but we think scaled is necessary in PCA.
116
+
117
+ # In[8]:
118
+
119
+
120
+ get_ipython().run_cell_magic('time', '', "ov.pp.pca(adata,layer='scaled',n_pcs=50)\nadata\n")
121
+
122
+
123
+ # In[9]:
124
+
125
+
126
+ adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
127
+ ov.pl.embedding(adata,
128
+ basis='X_pca',
129
+ color='CST3',
130
+ frameon='small')
131
+
132
+
133
+ # ## Embedding the neighborhood graph
134
+ #
135
+ # We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
136
+
137
+ # In[10]:
138
+
139
+
140
+ get_ipython().run_cell_magic('time', '', "ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,\n use_rep='scaled|original|X_pca')\n")
141
+
142
+
143
+ # You also can use `umap` to visualize the neighborhood graph
144
+
145
+ # In[11]:
146
+
147
+
148
+ get_ipython().run_cell_magic('time', '', 'ov.pp.umap(adata)\n')
149
+
150
+
151
+ # In[12]:
152
+
153
+
154
+ ov.pl.embedding(adata,
155
+ basis='X_umap',
156
+ color='CST3',
157
+ frameon='small')
158
+
159
+
160
+ # To visualize the PCA’s embeddings, we use the `pymde` package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
161
+
162
+ # In[13]:
163
+
164
+
165
+ ov.pp.mde(adata,embedding_dim=2,n_neighbors=15, basis='X_mde',
166
+ n_pcs=50, use_rep='scaled|original|X_pca',)
167
+
168
+
169
+ # In[14]:
170
+
171
+
172
+ ov.pl.embedding(adata,
173
+ basis='X_mde',
174
+ color='CST3',
175
+ frameon='small')
176
+
177
+
178
+ # ## Score cell cyle
179
+ #
180
+ # In OmicVerse, we store both G1M/S and G2M genes into the function (both human and mouse), so you can run cell cycle analysis without having to manually enter cycle genes!
181
+
182
+ # In[19]:
183
+
184
+
185
+ adata_raw=adata.raw.to_adata()
186
+ ov.pp.score_genes_cell_cycle(adata_raw,species='human')
187
+
188
+
189
+ # In[21]:
190
+
191
+
192
+ ov.pl.embedding(adata_raw,
193
+ basis='X_mde',
194
+ color='phase',
195
+ frameon='small')
196
+
197
+
198
+ # ## Clustering the neighborhood graph
199
+ #
200
+ # As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
201
+
202
+ # In[22]:
203
+
204
+
205
+ ov.pp.leiden(adata,resolution=1)
206
+
207
+
208
+ # We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter `fraemon='small'`, which causes the axes to be scaled with the colourbar
209
+
210
+ # In[23]:
211
+
212
+
213
+ ov.pl.embedding(adata,
214
+ basis='X_mde',
215
+ color=['leiden', 'CST3', 'NKG7'],
216
+ frameon='small')
217
+
218
+
219
+ # We also provide a boundary visualisation function `ov.utils.plot_ConvexHull` to visualise specific clusters.
220
+ #
221
+ # Arguments:
222
+ # - color: if None will use the color of clusters
223
+ # - alpha: default is 0.2
224
+
225
+ # In[24]:
226
+
227
+
228
+ import matplotlib.pyplot as plt
229
+ fig,ax=plt.subplots( figsize = (4,4))
230
+
231
+ ov.pl.embedding(adata,
232
+ basis='X_mde',
233
+ color=['leiden'],
234
+ frameon='small',
235
+ show=False,
236
+ ax=ax)
237
+
238
+ ov.pl.ConvexHull(adata,
239
+ basis='X_mde',
240
+ cluster_key='leiden',
241
+ hull_cluster='0',
242
+ ax=ax)
243
+
244
+
245
+ # If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the `ov.utils.gen_mpl_labels` function, which improves text overlap.
246
+ # In addition, we make use of the `patheffects` function, which makes our text have outlines
247
+ #
248
+ # - adjust_kwargs: it could be found in package `adjusttext`
249
+ # - text_kwargs: it could be found in class `plt.texts`
250
+
251
+ # In[25]:
252
+
253
+
254
+ from matplotlib import patheffects
255
+ import matplotlib.pyplot as plt
256
+ fig, ax = plt.subplots(figsize=(4,4))
257
+
258
+ ov.pl.embedding(adata,
259
+ basis='X_mde',
260
+ color=['leiden'],
261
+ show=False, legend_loc=None, add_outline=False,
262
+ frameon='small',legend_fontoutline=2,ax=ax
263
+ )
264
+
265
+ ov.utils.gen_mpl_labels(
266
+ adata,
267
+ 'leiden',
268
+ exclude=("None",),
269
+ basis='X_mde',
270
+ ax=ax,
271
+ adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
272
+ text_kwargs=dict(fontsize= 12 ,weight='bold',
273
+ path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
274
+ )
275
+
276
+
277
+ # In[26]:
278
+
279
+
280
+ marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
281
+ 'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
282
+ 'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
283
+
284
+
285
+ # In[27]:
286
+
287
+
288
+ sc.pl.dotplot(adata, marker_genes, groupby='leiden',
289
+ standard_scale='var');
290
+
291
+
292
+ # ## Finding marker genes
293
+ #
294
+ # Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
295
+
296
+ # In[28]:
297
+
298
+
299
+ sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
300
+ sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
301
+ method='t-test',use_raw=False,key_added='leiden_ttest')
302
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
303
+ cmap='Spectral_r',key='leiden_ttest',
304
+ standard_scale='var',n_genes=3)
305
+
306
+
307
+ # cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
308
+ #
309
+ # Paper: [Accurate and fast cell marker gene identification with COSG](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab579/6511197?redirectedFrom=fulltext)
310
+ #
311
+ # Code: https://github.com/genecell/COSG
312
+ #
313
+
314
+ # In[29]:
315
+
316
+
317
+ sc.tl.rank_genes_groups(adata, groupby='leiden',
318
+ method='t-test',use_rep='scaled|original|X_pca',)
319
+ ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
320
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
321
+ cmap='Spectral_r',key='leiden_cosg',
322
+ standard_scale='var',n_genes=3)
323
+
324
+
325
+ # ## Other plotting
326
+ #
327
+ # Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a `data_dict` and a `color_dict`, both of which have the same key requirements.
328
+ #
329
+ # For `data_dict`. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value
330
+ #
331
+
332
+ # In[51]:
333
+
334
+
335
+ data_dict={}
336
+ for i in adata.obs['leiden'].cat.categories:
337
+ data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
338
+ pval_cutoff=None,log2fc_min=None)
339
+
340
+
341
+ # In[65]:
342
+
343
+
344
+ data_dict.keys()
345
+
346
+
347
+ # In[64]:
348
+
349
+
350
+ data_dict[i].head()
351
+
352
+
353
+ # For `color_dict`, we require that the colour to be displayed for the current key is stored within each key.`
354
+
355
+ # In[63]:
356
+
357
+
358
+ type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
359
+ adata.uns['leiden_colors']))
360
+ type_color_dict
361
+
362
+
363
+ # There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.
364
+ #
365
+ # - data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
366
+ # - color_dict: dict, in each key, there is a color for each omic
367
+ # - pval_threshold: float, pvalue threshold for significant genes
368
+ # - log2fc_threshold: float, log2fc threshold for significant genes
369
+ # - figsize: tuple, figure size
370
+ # - sig_color: str, color for significant genes
371
+ # - normal_color: str, color for non-significant genes
372
+ # - plot_genes_num: int, number of genes to plot
373
+ # - plot_genes_fontsize: int, fontsize for gene names
374
+ # - plot_genes_weight: str, weight for gene names
375
+
376
+ # In[62]:
377
+
378
+
379
+ fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
380
+ pval_threshold=0.01,
381
+ log2fc_threshold=2,
382
+ figsize=(8,4),
383
+ sig_color='#a51616',
384
+ normal_color='#c7c7c7',
385
+ plot_genes_num=2,
386
+ plot_genes_fontsize=6,
387
+ plot_genes_weight='bold',
388
+ )
389
+
390
+ #The following code will be removed in future
391
+ y_min,y_max=0,0
392
+ for i in data_dict.keys():
393
+ y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
394
+ y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
395
+ for i in adata.obs['leiden'].cat.categories:
396
+ axes[i].set_ylim(y_min,y_max)
397
+ plt.suptitle('Stacking_vol',fontsize=12)
398
+
399
+
400
+ # In[ ]:
401
+
402
+
403
+
404
+
ovrawm/t_preprocess_gpu.txt ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Preprocessing the data of scRNA-seq with omicverse[GPU]
5
+ #
6
+ # The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
7
+ #
8
+ # Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.
9
+ #
10
+ # User can compare our tutorial with [scanpy'tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) to learn how to use omicverse well
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing
13
+
14
+ # ## Installation
15
+ #
16
+ # Note that the GPU module is not directly present and needs to be installed separately, for a detailed [tutorial](https://rapids-singlecell.readthedocs.io/en/latest/index.html) see [https://rapids-singlecell.readthedocs.io/en/latest/index.html](https://rapids-singlecell.readthedocs.io/en/latest/index.html)
17
+ #
18
+ # ### pip
19
+ # ```shell
20
+ # pip install rapids-singlecell
21
+ # #rapids
22
+ # pip install \
23
+ # --extra-index-url=https://pypi.nvidia.com \
24
+ # cudf-cu12==24.4.* dask-cudf-cu12==24.4.* cuml-cu12==24.4.* \
25
+ # cugraph-cu12==24.4.* cuspatial-cu12==24.4.* cuproj-cu12==24.4.* \
26
+ # cuxfilter-cu12==24.4.* cucim-cu12==24.4.* pylibraft-cu12==24.4.* \
27
+ # raft-dask-cu12==24.4.* cuvs-cu12==24.4.*
28
+ # #cupy
29
+ # pip install cupy-cuda12x
30
+ # ```
31
+ #
32
+ # ### conda-env
33
+ # Note that in order to avoid conflicts, we will consider installing rapid_singlecell first before installing omicverse.
34
+ #
35
+ # The easiest way to install rapids-singlecell is to use one of the yaml file provided in the [conda](https://github.com/Starlitnightly/omicverse/tree/master/conda) folder. These yaml files install everything needed to run the example notebooks and get you started.
36
+ # ```shell
37
+ # conda env create -f conda/omicverse_gpu.yml
38
+ # # or
39
+ # mamba env create -f conda/omicverse_gpu.yml
40
+ # ```
41
+
42
+ # In[1]:
43
+
44
+
45
+ import omicverse as ov
46
+ import scanpy as sc
47
+ ov.plot_set()
48
+ ov.settings.gpu_init()
49
+
50
+
51
+ # The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
52
+
53
+ # In[2]:
54
+
55
+
56
+ # !mkdir data
57
+ #!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
58
+ #!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
59
+ # !mkdir write
60
+
61
+
62
+ # In[3]:
63
+
64
+
65
+ adata = sc.read_10x_mtx(
66
+ 'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
67
+ var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
68
+ cache=True) # write a cache file for faster subsequent reading
69
+ adata
70
+
71
+
72
+ # In[4]:
73
+
74
+
75
+ adata.var_names_make_unique()
76
+ adata.obs_names_make_unique()
77
+
78
+
79
+ # ## Preprocessing
80
+ #
81
+ # ### Quantity control
82
+ #
83
+ # For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
84
+
85
+ # In[5]:
86
+
87
+
88
+ ov.pp.anndata_to_GPU(adata)
89
+
90
+
91
+ # In[6]:
92
+
93
+
94
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.qc(adata,\n tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},\n batch_key=None)\nadata\n")
95
+
96
+
97
+ # ### High variable Gene Detection
98
+ #
99
+ # Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See [Article](https://www.nature.com/articles/s41592-023-01814-1#MOESM3) in *Nature Method* for details.
100
+ #
101
+
102
+ # normalize|HVGs:We use | to control the preprocessing step, | before for the normalisation step, either `shiftlog` or `pearson`, and | after for the highly variable gene calculation step, either `pearson` or `seurat`. Our default is `shiftlog|pearson`.
103
+ #
104
+ # - if you use `mode`=`shiftlog|pearson` you need to set `target_sum=50*1e4`, more people like to se `target_sum=1e4`, we test the result think 50*1e4 will be better
105
+ # - if you use `mode`=`pearson|pearson`, you don't need to set `target_sum`
106
+ #
107
+ # <div class="admonition warning">
108
+ # <p class="admonition-title">Note</p>
109
+ # <p>
110
+ # if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
111
+ # </p>
112
+ # </div>
113
+ #
114
+
115
+ # In[7]:
116
+
117
+
118
+ get_ipython().run_cell_magic('time', '', "adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)\nadata\n")
119
+
120
+
121
+ # Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
122
+
123
+ # In[8]:
124
+
125
+
126
+ adata.raw = adata
127
+ adata = adata[:, adata.var.highly_variable_features]
128
+ adata
129
+
130
+
131
+ # ## Principal component analysis
132
+ #
133
+ # In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
134
+
135
+ # In[9]:
136
+
137
+
138
+ get_ipython().run_cell_magic('time', '', 'ov.pp.scale(adata)\nadata\n')
139
+
140
+
141
+ # If you want to perform pca in normlog layer, you can set `layer`=`normlog`, but we think scaled is necessary in PCA.
142
+
143
+ # In[10]:
144
+
145
+
146
+ get_ipython().run_cell_magic('time', '', "ov.pp.pca(adata,layer='scaled',n_pcs=50)\nadata\n")
147
+
148
+
149
+ # In[11]:
150
+
151
+
152
+ adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
153
+ ov.utils.embedding(adata,
154
+ basis='X_pca',
155
+ color='CST3',
156
+ frameon='small')
157
+
158
+
159
+ # ## Embedding the neighborhood graph
160
+ #
161
+ # We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
162
+
163
+ # In[23]:
164
+
165
+
166
+ get_ipython().run_cell_magic('time', '', "ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,\n use_rep='scaled|original|X_pca',method='cagra')\n")
167
+
168
+
169
+ # To visualize the PCA’s embeddings, we use the `pymde` package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
170
+
171
+ # In[19]:
172
+
173
+
174
+ adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
175
+ adata
176
+
177
+
178
+ # In[20]:
179
+
180
+
181
+ ov.pl.embedding(adata,
182
+ basis='X_mde',
183
+ color='CST3',
184
+ frameon='small')
185
+
186
+
187
+ # You also can use `umap` to visualize the neighborhood graph
188
+
189
+ # In[21]:
190
+
191
+
192
+ ov.pp.umap(adata)
193
+
194
+
195
+ # In[22]:
196
+
197
+
198
+ ov.pl.embedding(adata,
199
+ basis='X_umap',
200
+ color='CST3',
201
+ frameon='small')
202
+
203
+
204
+ # ## Clustering the neighborhood graph
205
+ #
206
+ # As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
207
+
208
+ # In[24]:
209
+
210
+
211
+ ov.pp.leiden(adata)
212
+
213
+
214
+ # In[30]:
215
+
216
+
217
+ ov.pp.anndata_to_CPU(adata)
218
+
219
+
220
+ # We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter `fraemon='small'`, which causes the axes to be scaled with the colourbar
221
+
222
+ # In[25]:
223
+
224
+
225
+ ov.pl.embedding(adata,
226
+ basis='X_mde',
227
+ color=['leiden', 'CST3', 'NKG7'],
228
+ frameon='small')
229
+
230
+
231
+ # We also provide a boundary visualisation function `ov.utils.plot_ConvexHull` to visualise specific clusters.
232
+ #
233
+ # Arguments:
234
+ # - color: if None will use the color of clusters
235
+ # - alpha: default is 0.2
236
+
237
+ # In[26]:
238
+
239
+
240
+ import matplotlib.pyplot as plt
241
+ fig,ax=plt.subplots( figsize = (4,4))
242
+
243
+ ov.pl.embedding(adata,
244
+ basis='X_mde',
245
+ color=['leiden'],
246
+ frameon='small',
247
+ show=False,
248
+ ax=ax)
249
+
250
+ ov.pl.ConvexHull(adata,
251
+ basis='X_mde',
252
+ cluster_key='leiden',
253
+ hull_cluster='0',
254
+ ax=ax)
255
+
256
+
257
+ # If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the `ov.utils.gen_mpl_labels` function, which improves text overlap.
258
+ # In addition, we make use of the `patheffects` function, which makes our text have outlines
259
+ #
260
+ # - adjust_kwargs: it could be found in package `adjusttext`
261
+ # - text_kwargs: it could be found in class `plt.texts`
262
+
263
+ # In[27]:
264
+
265
+
266
+ from matplotlib import patheffects
267
+ import matplotlib.pyplot as plt
268
+ fig, ax = plt.subplots(figsize=(4,4))
269
+
270
+ ov.pl.embedding(adata,
271
+ basis='X_mde',
272
+ color=['leiden'],
273
+ show=False, legend_loc=None, add_outline=False,
274
+ frameon='small',legend_fontoutline=2,ax=ax
275
+ )
276
+
277
+ ov.utils.gen_mpl_labels(
278
+ adata,
279
+ 'leiden',
280
+ exclude=("None",),
281
+ basis='X_mde',
282
+ ax=ax,
283
+ adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
284
+ text_kwargs=dict(fontsize= 12 ,weight='bold',
285
+ path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
286
+ )
287
+
288
+
289
+ # In[28]:
290
+
291
+
292
+ marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
293
+ 'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
294
+ 'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
295
+
296
+
297
+ # In[29]:
298
+
299
+
300
+ sc.pl.dotplot(adata, marker_genes, groupby='leiden',
301
+ standard_scale='var');
302
+
303
+
304
+ # ## Finding marker genes
305
+ #
306
+ # Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
307
+
308
+ # In[31]:
309
+
310
+
311
+ sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
312
+ sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
313
+ method='t-test',use_raw=False,key_added='leiden_ttest')
314
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
315
+ cmap='Spectral_r',key='leiden_ttest',
316
+ standard_scale='var',n_genes=3)
317
+
318
+
319
+ # cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
320
+ #
321
+ # Paper: [Accurate and fast cell marker gene identification with COSG](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab579/6511197?redirectedFrom=fulltext)
322
+ #
323
+ # Code: https://github.com/genecell/COSG
324
+ #
325
+
326
+ # In[32]:
327
+
328
+
329
+ sc.tl.rank_genes_groups(adata, groupby='leiden',
330
+ method='t-test',use_rep='scaled|original|X_pca',)
331
+ ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
332
+ sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
333
+ cmap='Spectral_r',key='leiden_cosg',
334
+ standard_scale='var',n_genes=3)
335
+
336
+
337
+ # ## Other plotting
338
+ #
339
+ # Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a `data_dict` and a `color_dict`, both of which have the same key requirements.
340
+ #
341
+ # For `data_dict`. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value
342
+ #
343
+
344
+ # In[33]:
345
+
346
+
347
+ data_dict={}
348
+ for i in adata.obs['leiden'].cat.categories:
349
+ data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
350
+ pval_cutoff=None,log2fc_min=None)
351
+
352
+
353
+ # In[34]:
354
+
355
+
356
+ data_dict.keys()
357
+
358
+
359
+ # In[35]:
360
+
361
+
362
+ data_dict[i].head()
363
+
364
+
365
+ # For `color_dict`, we require that the colour to be displayed for the current key is stored within each key.`
366
+
367
+ # In[36]:
368
+
369
+
370
+ type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
371
+ adata.uns['leiden_colors']))
372
+ type_color_dict
373
+
374
+
375
+ # There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.
376
+ #
377
+ # - data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
378
+ # - color_dict: dict, in each key, there is a color for each omic
379
+ # - pval_threshold: float, pvalue threshold for significant genes
380
+ # - log2fc_threshold: float, log2fc threshold for significant genes
381
+ # - figsize: tuple, figure size
382
+ # - sig_color: str, color for significant genes
383
+ # - normal_color: str, color for non-significant genes
384
+ # - plot_genes_num: int, number of genes to plot
385
+ # - plot_genes_fontsize: int, fontsize for gene names
386
+ # - plot_genes_weight: str, weight for gene names
387
+
388
+ # In[37]:
389
+
390
+
391
+ fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
392
+ pval_threshold=0.01,
393
+ log2fc_threshold=2,
394
+ figsize=(8,4),
395
+ sig_color='#a51616',
396
+ normal_color='#c7c7c7',
397
+ plot_genes_num=2,
398
+ plot_genes_fontsize=6,
399
+ plot_genes_weight='bold',
400
+ )
401
+
402
+ #The following code will be removed in future
403
+ y_min,y_max=0,0
404
+ for i in data_dict.keys():
405
+ y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
406
+ y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
407
+ for i in adata.obs['leiden'].cat.categories:
408
+ axes[i].set_ylim(y_min,y_max)
409
+ plt.suptitle('Stacking_vol',fontsize=12)
410
+
411
+
412
+ # In[ ]:
413
+
414
+
415
+
416
+
ovrawm/t_scdeg.txt ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Differential expression analysis in single cell
5
+ #
6
+ # Sometimes we need to compare differentially expressed genes or differentially expressed features between two cell types on single cell data, but existing methods focus more on cell-specific gene analysis. Researchers need to transfer bulk RNA-seq analysis to single-cell analysis, which involves interaction between different programming languages or programming tools, adding significantly to the workload of the researcher.
7
+ #
8
+ # Here, we use omicverse's bulk RNA-seq pyDEG method to complete differential expression analysis at the single cell level. We will present two different perspectives, one from the perspective of all cells and one from the perspective of the metacellular.
9
+ #
10
+ # Colab_Reproducibility:https://colab.research.google.com/drive/12faBRh0xT7v6KSy8NCSRqbegF_AEoDXr?usp=sharing
11
+
12
+ # In[1]:
13
+
14
+
15
+ import omicverse as ov
16
+ import scanpy as sc
17
+ import scvelo as scv
18
+
19
+ ov.utils.ov_plot_set()
20
+
21
+
22
+ # ## Data preprocessed
23
+ #
24
+ # We need to normalized and scale the data at first.
25
+
26
+ # In[2]:
27
+
28
+
29
+ adata = scv.datasets.pancreas()
30
+ adata
31
+
32
+
33
+ # In[3]:
34
+
35
+
36
+ adata.X.max()
37
+
38
+
39
+ # We found that the max value of anndata object larger than 10 and type is int. We need to normalize and log1p it
40
+
41
+ # In[4]:
42
+
43
+
44
+ #quantity control
45
+ adata=ov.pp.qc(adata,
46
+ tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
47
+ #normalize and high variable genes (HVGs) calculated
48
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
49
+
50
+ #save the whole genes and filter the non-HVGs
51
+ adata.raw = adata
52
+ adata = adata[:, adata.var.highly_variable_features]
53
+
54
+ #scale the adata.X
55
+ ov.pp.scale(adata)
56
+
57
+ #Dimensionality Reduction
58
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
59
+
60
+
61
+ # In[5]:
62
+
63
+
64
+ adata.X.max()
65
+
66
+
67
+ # ## Different expression in total level
68
+ #
69
+ # We then select the target cells to be analysed, including `Alpha` and `Beta`, derive the expression matrix using `to_df()` and build the differential expression analysis module using `pyDEG`
70
+
71
+ # In[6]:
72
+
73
+
74
+ test_adata=adata[adata.obs['clusters'].isin(['Alpha','Beta'])]
75
+ test_adata
76
+
77
+
78
+ # In[7]:
79
+
80
+
81
+ dds=ov.bulk.pyDEG(test_adata.to_df(layer='lognorm').T)
82
+
83
+
84
+ # In[8]:
85
+
86
+
87
+ dds.drop_duplicates_index()
88
+ print('... drop_duplicates_index success')
89
+
90
+
91
+ # We also need to set up an experimental group and a control group, i.e. the two types of cells to be compared and analysed
92
+
93
+ # In[9]:
94
+
95
+
96
+ treatment_groups=test_adata.obs[test_adata.obs['clusters']=='Alpha'].index.tolist()
97
+ control_groups=test_adata.obs[test_adata.obs['clusters']=='Beta'].index.tolist()
98
+ result=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
99
+
100
+
101
+ # In[10]:
102
+
103
+
104
+ result.sort_values('qvalue').head()
105
+
106
+
107
+ # In[11]:
108
+
109
+
110
+ # -1 means automatically calculates
111
+ dds.foldchange_set(fc_threshold=-1,
112
+ pval_threshold=0.05,
113
+ logp_max=10)
114
+
115
+
116
+ # In[12]:
117
+
118
+
119
+ dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
120
+ plot_genes_num=8,plot_genes_fontsize=12,)
121
+
122
+
123
+ # In[13]:
124
+
125
+
126
+ dds.plot_boxplot(genes=['Irx1','Adra2a'],treatment_groups=treatment_groups,
127
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
128
+ legend_bbox=(2,0.55))
129
+
130
+
131
+ # In[14]:
132
+
133
+
134
+ ov.utils.embedding(adata,
135
+ basis='X_umap',
136
+ frameon='small',
137
+ color=['clusters','Irx1','Adra2a'])
138
+
139
+
140
+ # ## Different expression in Metacells level
141
+ #
142
+ # Here, we calculated the metacells from the whole scRNA-seq datasets using SEACells, and the same analyze with total level.
143
+
144
+ # ### Constructing a metacellular object
145
+ #
146
+ # We can use `ov.single.MetaCell` to construct a metacellular object to train the SEACells model, the arguments can be found in below.
147
+ #
148
+ # - :param ad: (AnnData) annotated data matrix
149
+ # - :param build_kernel_on: (str) key corresponding to matrix in ad.obsm which is used to compute kernel for metacells
150
+ # Typically 'X_pca' for scRNA or 'X_svd' for scATAC
151
+ # - :param n_SEACells: (int) number of SEACells to compute
152
+ # - :param use_gpu: (bool) whether to use GPU for computation
153
+ # - :param verbose: (bool) whether to suppress verbose program logging
154
+ # - :param n_waypoint_eigs: (int) number of eigenvectors to use for waypoint initialization
155
+ # - :param n_neighbors: (int) number of nearest neighbors to use for graph construction
156
+ # - :param convergence_epsilon: (float) convergence threshold for Franke-Wolfe algorithm
157
+ # - :param l2_penalty: (float) L2 penalty for Franke-Wolfe algorithm
158
+ # - :param max_franke_wolfe_iters: (int) maximum number of iterations for Franke-Wolfe algorithm
159
+ # - :param use_sparse: (bool) whether to use sparse matrix operations. Currently only supported for CPU implementation.
160
+
161
+ # In[15]:
162
+
163
+
164
+ meta_obj=ov.single.MetaCell(adata,use_rep='scaled|original|X_pca',n_metacells=150,
165
+ use_gpu=True)
166
+
167
+
168
+ # In[16]:
169
+
170
+
171
+ meta_obj.initialize_archetypes()
172
+
173
+
174
+ # ## Train and save the model
175
+
176
+ # In[17]:
177
+
178
+
179
+ meta_obj.train(min_iter=10, max_iter=50)
180
+
181
+
182
+ # In[34]:
183
+
184
+
185
+ meta_obj.save('seacells/model.pkl')
186
+
187
+
188
+ # In[ ]:
189
+
190
+
191
+ meta_obj.load('seacells/model.pkl')
192
+
193
+
194
+ # ## Predicted the metacells
195
+ #
196
+ # we can use `predicted` to predicted the metacells of raw scRNA-seq data. There are two method can be selected, one is `soft`, the other is `hard`.
197
+ #
198
+ # In the `soft` method, Aggregates cells within each SEACell, summing over all raw data x assignment weight for all cells belonging to a SEACell. Data is un-normalized and pseudo-raw aggregated counts are stored in .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
199
+ #
200
+ # In the `hard` method, Aggregates cells within each SEACell, summing over all raw data for all cells belonging to a SEACell. Data is unnormalized and raw aggregated counts are stored .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
201
+
202
+ # In[19]:
203
+
204
+
205
+ ad=meta_obj.predicted(method='soft',celltype_label='clusters',
206
+ summarize_layer='lognorm')
207
+
208
+
209
+ # In[20]:
210
+
211
+
212
+ ad.X.min(),ad.X.max()
213
+
214
+
215
+ # In[21]:
216
+
217
+
218
+ import matplotlib.pyplot as plt
219
+ fig, ax = plt.subplots(figsize=(4,4))
220
+ ov.utils.embedding(
221
+ meta_obj.adata,
222
+ basis="X_umap",
223
+ color=['clusters'],
224
+ frameon='small',
225
+ title="Meta cells",
226
+ #legend_loc='on data',
227
+ legend_fontsize=14,
228
+ legend_fontoutline=2,
229
+ size=10,
230
+ ax=ax,
231
+ alpha=0.2,
232
+ #legend_loc='',
233
+ add_outline=False,
234
+ #add_outline=True,
235
+ outline_color='black',
236
+ outline_width=1,
237
+ show=False,
238
+ #palette=ov.utils.blue_color[:],
239
+ #legend_fontweight='normal'
240
+ )
241
+ ov.single._metacell.plot_metacells(ax,meta_obj.adata,color='#CB3E35',
242
+ )
243
+
244
+
245
+ # ### Differentially expressed analysis
246
+ #
247
+ # Similar to total cells for differential expression analysis, we used metacells for differential expression in the same way.
248
+
249
+ # In[23]:
250
+
251
+
252
+ test_adata=ad[ad.obs['celltype'].isin(['Alpha','Beta'])]
253
+ test_adata
254
+
255
+
256
+ # In[24]:
257
+
258
+
259
+ dds_meta=ov.bulk.pyDEG(test_adata.to_df().T)
260
+
261
+
262
+ # In[25]:
263
+
264
+
265
+ dds_meta.drop_duplicates_index()
266
+ print('... drop_duplicates_index success')
267
+
268
+
269
+ # We also need to set up an experimental group and a control group, i.e. the two types of cells to be compared and analysed
270
+
271
+ # In[27]:
272
+
273
+
274
+ treatment_groups=test_adata.obs[test_adata.obs['celltype']=='Alpha'].index.tolist()
275
+ control_groups=test_adata.obs[test_adata.obs['celltype']=='Beta'].index.tolist()
276
+ result=dds_meta.deg_analysis(treatment_groups,control_groups,method='ttest')
277
+
278
+
279
+ # In[28]:
280
+
281
+
282
+ result.sort_values('qvalue').head()
283
+
284
+
285
+ # In[29]:
286
+
287
+
288
+ # -1 means automatically calculates
289
+ dds_meta.foldchange_set(fc_threshold=-1,
290
+ pval_threshold=0.05,
291
+ logp_max=10)
292
+
293
+
294
+ # In[30]:
295
+
296
+
297
+ dds_meta.plot_volcano(title='DEG Analysis',figsize=(4,4),
298
+ plot_genes_num=8,plot_genes_fontsize=12,)
299
+
300
+
301
+ # In[31]:
302
+
303
+
304
+ dds_meta.plot_boxplot(genes=['Ctxn2','Mnx1'],treatment_groups=treatment_groups,
305
+ control_groups=control_groups,figsize=(2,3),fontsize=12,
306
+ legend_bbox=(2,0.55))
307
+
308
+
309
+ # In[32]:
310
+
311
+
312
+ ov.utils.embedding(adata,
313
+ basis='X_umap',
314
+ frameon='small',
315
+ color=['clusters','Ctxn2','Mnx1'])
316
+
ovrawm/t_scdrug.txt ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Drug response predict with scDrug
5
+ #
6
+ # scDrug is a database that can be used to predict the drug sensitivity of single cells based on an existing database of drug responses. In the downstream tasks of single cell analysis, especially in tumours, we are fully interested in potential drugs and combination therapies. To this end, we have integrated scDrug's IC50 prediction and inferCNV to infer the function of tumour cells to build a drug screening pipeline.
7
+ #
8
+ # Paper: [scDrug: From single-cell RNA-seq to drug response prediction](https://www.sciencedirect.com/science/article/pii/S2001037022005505)
9
+ #
10
+ # Code: https://github.com/ailabstw/scDrug
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1mayoMO7I7qjYIRjrZEi8r5zuERcxAEcF?usp=sharing
13
+
14
+ # In[1]:
15
+
16
+
17
+ import omicverse as ov
18
+ import scanpy as sc
19
+ import infercnvpy as cnv
20
+ import matplotlib.pyplot as plt
21
+ import os
22
+
23
+ sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
24
+ sc.settings.set_figure_params(dpi=80, facecolor='white')
25
+
26
+
27
+ # ## Infer the Tumor from scRNA-seq
28
+ #
29
+ # Here we use Infercnvpy's example data to complete the tumour analysis, you can also refer to the official tutorial for this step: https://infercnvpy.readthedocs.io/en/latest/notebooks/tutorial_3k.html
30
+ #
31
+ # So, we provide a utility function ov.utils.get_gene_annotation to supplement the coordinate information from GTF files. The following usage assumes that the adata.var_names correspond to “gene_name” attribute in the GTF file. For other cases, please check the function documentation.
32
+ #
33
+ # The GTF file used here can be downloaded from [GENCODE](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/).
34
+ #
35
+ # T2T-CHM13 gtf file can be download from [figshare](https://figshare.com/ndownloader/files/40628072)
36
+
37
+ # In[3]:
38
+
39
+
40
+ adata = cnv.datasets.maynard2020_3k()
41
+
42
+ ov.utils.get_gene_annotation(
43
+ adata, gtf="gencode.v43.basic.annotation.gtf.gz",
44
+ gtf_by="gene_name"
45
+ )
46
+
47
+
48
+ # In[ ]:
49
+
50
+
51
+ adata=adata[:,~adata.var['chrom'].isnull()]
52
+ adata.var['chromosome']=adata.var['chrom']
53
+ adata.var['start']=adata.var['chromStart']
54
+ adata.var['end']=adata.var['chromEnd']
55
+ adata.var['ensg']=adata.var['gene_id']
56
+ adata.var.loc[:, ["ensg", "chromosome", "start", "end"]].head()
57
+
58
+
59
+ # We noted that infercnvpy need to normalize and log the matrix at first
60
+
61
+ # In[4]:
62
+
63
+
64
+ adata
65
+
66
+
67
+ # We use the immune cells as reference and infer the cnv score of each cells in scRNA-seq
68
+
69
+ # In[5]:
70
+
71
+
72
+ # We provide all immune cell types as "normal cells".
73
+ cnv.tl.infercnv(
74
+ adata,
75
+ reference_key="cell_type",
76
+ reference_cat=[
77
+ "B cell",
78
+ "Macrophage",
79
+ "Mast cell",
80
+ "Monocyte",
81
+ "NK cell",
82
+ "Plasma cell",
83
+ "T cell CD4",
84
+ "T cell CD8",
85
+ "T cell regulatory",
86
+ "mDC",
87
+ "pDC",
88
+ ],
89
+ window_size=250,
90
+ )
91
+ cnv.tl.pca(adata)
92
+ cnv.pp.neighbors(adata)
93
+ cnv.tl.leiden(adata)
94
+ cnv.tl.umap(adata)
95
+ cnv.tl.cnv_score(adata)
96
+
97
+
98
+ # In[6]:
99
+
100
+
101
+ sc.pl.umap(adata, color="cnv_score", show=False)
102
+
103
+
104
+ # We set an appropriate threshold for the cnv_score, here we set it to 0.03 and identify cells greater than 0.03 as tumour cells
105
+
106
+ # In[7]:
107
+
108
+
109
+ adata.obs["cnv_status"] = "normal"
110
+ adata.obs.loc[
111
+ adata.obs["cnv_score"]>0.03, "cnv_status"
112
+ ] = "tumor"
113
+
114
+
115
+ # In[8]:
116
+
117
+
118
+ sc.pl.umap(adata, color="cnv_status", show=False)
119
+
120
+
121
+ # We extracted tumour cells separately for drug prediction response
122
+
123
+ # In[11]:
124
+
125
+
126
+ tumor=adata[adata.obs['cnv_status']=='tumor']
127
+ tumor.X.max()
128
+
129
+
130
+ # ## Tumor preprocessing
131
+ #
132
+ # We need to extract the highly variable genes in the tumour for further analysis, and found out the sub-cluster in tumor
133
+
134
+ # In[12]:
135
+
136
+
137
+ adata=tumor
138
+ print('Preprocessing...')
139
+ sc.pp.filter_cells(adata, min_genes=200)
140
+ sc.pp.filter_genes(adata, min_cells=3)
141
+ adata.var['mt'] = adata.var_names.str.startswith('MT-')
142
+ sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
143
+ if not (adata.obs.pct_counts_mt == 0).all():
144
+ adata = adata[adata.obs.pct_counts_mt < 30, :]
145
+
146
+ adata.raw = adata.copy()
147
+
148
+ sc.pp.highly_variable_genes(adata)
149
+ adata = adata[:, adata.var.highly_variable]
150
+ sc.pp.scale(adata)
151
+ sc.tl.pca(adata, svd_solver='arpack')
152
+
153
+
154
+ # In[13]:
155
+
156
+
157
+ sc.pp.neighbors(adata, n_pcs=20)
158
+ sc.tl.umap(adata)
159
+
160
+
161
+ # Here, we need to download the scDrug database and mods so that the subsequent predictions can be made properly
162
+
163
+ # In[27]:
164
+
165
+
166
+ ov.utils.download_GDSC_data()
167
+ ov.utils.download_CaDRReS_model()
168
+
169
+
170
+ # Then, we apply Single-Cell Data Analysis once again to carry out sub-clustering on the tumor clusters at automatically determined resolution.
171
+
172
+ # In[18]:
173
+
174
+
175
+ adata, res,plot_df = ov.single.autoResolution(adata,cpus=4)
176
+
177
+
178
+ # Don't forget to save your data
179
+
180
+ # In[20]:
181
+
182
+
183
+ results_file = os.path.join('./', 'scanpyobj.h5ad')
184
+ adata.write(results_file)
185
+
186
+
187
+ # In[21]:
188
+
189
+
190
+ results_file = os.path.join('./', 'scanpyobj.h5ad')
191
+ adata=sc.read(results_file)
192
+
193
+
194
+ # ## IC50 predicted
195
+ #
196
+ # Drug Response Prediction examined scanpyobj.h5ad generated in Single-Cell Data Analysis, reported clusterwise IC50 and cell death percentages to drugs in the GDSC database via CaDRReS-Sc (a recommender system framework for in silico drug response prediction), or drug sensitivity AUC in the PRISM database from [DepMap Portal PRISM-19Q4](https://doi.org/10.1038/s43018-019-0018-6).
197
+ #
198
+ # Note we need to download the CaDRReS-Sc from github by `git clone https://github.com/CSB5/CaDRReS-Sc`
199
+
200
+ # In[24]:
201
+
202
+
203
+ get_ipython().system('git clone https://github.com/CSB5/CaDRReS-Sc')
204
+
205
+
206
+ # To run drug response predicted, we need to set:
207
+ #
208
+ # - scriptpath: the CaDRReS-Sc path we downloaded just now
209
+ # - modelpath: the model path we downloaded just now
210
+ # - output: the save path of drug response predicted result
211
+
212
+ # In[25]:
213
+
214
+
215
+ import ov
216
+ job=ov.single.Drug_Response(adata,scriptpath='CaDRReS-Sc',
217
+ modelpath='models/',
218
+ output='result')
219
+
220
+
221
+ # In[ ]:
222
+
223
+
224
+
225
+
ovrawm/t_scmulan.txt ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # ## Using scMulan to annotate cell types in Heart, Lung, Liver, Bone marrow, Blood, Brain, and Thymus
5
+
6
+ # In this study, the authors enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. They represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. They construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. They pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts.
7
+
8
+ # #### we provide a liver dataset sampled (percentage of 20%) from Suo C, 2022 (doi/10.1126/science.abo0510)
9
+ # **Paper:** [scMulan: a multitask generative pre-trained language model for single-cell analysis](https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1)
10
+ # **Data download:** https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1
11
+ # **Pre-train model download:** https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
12
+ #
13
+ # If you found this tutorial helpful, please cite scMulan and OmicVerse:
14
+ # Bian H, Chen Y, Dong X, et al. scMulan: a multitask generative pre-trained language model for single-cell analysis[C]//International Conference on Research in Computational Molecular Biology. Cham: Springer Nature Switzerland, 2024: 479-482.
15
+
16
+ # In[36]:
17
+
18
+
19
+ import os
20
+ #os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # if use CPU only
21
+ import scanpy as sc
22
+ import omicverse as ov
23
+ ov.plot_set()
24
+ #import scMulan
25
+ #from scMulan import GeneSymbolUniform
26
+
27
+
28
+ # ## 1. load h5ad
29
+ # You can download the liver dataset from the following link: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1
30
+ #
31
+ # It's recommended that you use h5ad here with raw count (and after your QC)
32
+ #
33
+
34
+ # In[4]:
35
+
36
+
37
+ adata = sc.read('./data/liver_test.h5ad')
38
+
39
+
40
+ # In[5]:
41
+
42
+
43
+ adata
44
+
45
+
46
+ # In[6]:
47
+
48
+
49
+ from scipy.sparse import csc_matrix
50
+ adata.X = csc_matrix(adata.X)
51
+
52
+
53
+ # ## 2. transform original h5ad with uniformed genes (42117 genes)
54
+
55
+ # This step transform the genes in input adata to 42117 gene symbols and reserves the corresponding gene expression values. The gene symbols are the same as the pre-trained scMulan model.
56
+
57
+ # In[7]:
58
+
59
+
60
+ adata_GS_uniformed = ov.externel.scMulan.GeneSymbolUniform(input_adata=adata,
61
+ output_dir="./data",
62
+ output_prefix='liver')
63
+
64
+
65
+ # ## 3. process uniformed data (simply norm and log1p)
66
+
67
+ # In[8]:
68
+
69
+
70
+ ## you can read the saved uniformed adata
71
+
72
+ adata_GS_uniformed=sc.read_h5ad('./data/liver_uniformed.h5ad')
73
+
74
+
75
+ # In[9]:
76
+
77
+
78
+ adata_GS_uniformed
79
+
80
+
81
+ # In[10]:
82
+
83
+
84
+ # norm and log1p count matrix
85
+ # in some case, the count matrix is not normalized, and log1p is not applied.
86
+ # So we need to normalize the count matrix
87
+ if adata_GS_uniformed.X.max() > 10:
88
+ sc.pp.normalize_total(adata_GS_uniformed, target_sum=1e4)
89
+ sc.pp.log1p(adata_GS_uniformed)
90
+
91
+
92
+ # ## 4. load scMulan
93
+
94
+ # In[11]:
95
+
96
+
97
+ # you should first download ckpt from https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
98
+ # put it under .ckpt/ckpt_scMulan.pt
99
+ # by: wget https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1 -O ckpt/ckpt_scMulan.pt
100
+
101
+ ckp_path = './ckpt/ckpt_scMulan.pt'
102
+
103
+
104
+ # In[12]:
105
+
106
+
107
+ scml = ov.externel.scMulan.model_inference(ckp_path, adata_GS_uniformed)
108
+ base_process = scml.cuda_count()
109
+
110
+
111
+ # In[13]:
112
+
113
+
114
+ scml.get_cell_types_and_embds_for_adata(parallel=True, n_process = 1)
115
+ # scml.get_cell_types_and_embds_for_adata(parallel=False) # for only using CPU, but it is really slow.
116
+
117
+
118
+ # The predicted cell types are stored in scml.adata.obs['cell_type_from_scMulan'], besides the cell embeddings (for multibatch integration) in scml.adata.obsm['X_scMulan'] (not used in this tutorial).
119
+
120
+ # ## 5. visualization
121
+ #
122
+ # Here, we visualize the cell types predicted by scMulan. And we also visualize the original cell types in the dataset.
123
+
124
+ # In[14]:
125
+
126
+
127
+ adata_mulan = scml.adata.copy()
128
+
129
+
130
+ # In[15]:
131
+
132
+
133
+ # calculated the 2-D embedding of the adata using pyMDE
134
+ ov.pp.scale(adata_mulan)
135
+ ov.pp.pca(adata_mulan)
136
+
137
+ #sc.pl.pca_variance_ratio(adata_mulan)
138
+ ov.pp.mde(adata_mulan,embedding_dim=2,n_neighbors=15, basis='X_mde',
139
+ n_pcs=10, use_rep='scaled|original|X_pca',)
140
+
141
+
142
+ # In[26]:
143
+
144
+
145
+ # Here, we can see the cell type annotation from scMulan
146
+ ov.pl.embedding(adata_mulan,basis='X_mde',
147
+ color=["cell_type_from_scMulan",],
148
+ ncols=1,frameon='small')
149
+
150
+
151
+ # In[29]:
152
+
153
+
154
+ adata_mulan.obsm['X_umap']=adata_mulan.obsm['X_mde']
155
+
156
+
157
+ # In[30]:
158
+
159
+
160
+ # you can run smoothing function to filter the false positives
161
+ ov.externel.scMulan.cell_type_smoothing(adata_mulan, threshold=0.1)
162
+
163
+
164
+ # In[31]:
165
+
166
+
167
+ # cell_type_from_mulan_smoothing: pred+smoothing
168
+ # cell_type: original annotations by the authors
169
+ ov.pl.embedding(adata_mulan,basis='X_mde',
170
+ color=["cell_type_from_mulan_smoothing","cell_type"],
171
+ ncols=1,frameon='small')
172
+
173
+
174
+ # In[32]:
175
+
176
+
177
+ adata_mulan
178
+
179
+
180
+ # In[33]:
181
+
182
+
183
+ top_celltypes = adata_mulan.obs.cell_type_from_scMulan.value_counts().index[:20]
184
+
185
+
186
+ # In[34]:
187
+
188
+
189
+ # you can select some cell types of interest (from scMulan's prediction) for visulization
190
+ # selected_cell_types = ["NK cell", "Kupffer cell", "Conventional dendritic cell 2"] # as example
191
+ selected_cell_types = top_celltypes
192
+ ov.externel.scMulan.visualize_selected_cell_types(adata_mulan,selected_cell_types,smoothing=True)
193
+
194
+
195
+ # In[ ]:
196
+
197
+
198
+
199
+
ovrawm/t_simba.txt ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Data integration and batch correction with SIMBA
5
+ #
6
+ # Here we will use three scRNA-seq human pancreas datasets of different studies as an example to illustrate how SIMBA performs scRNA-seq batch correction for multiple batches
7
+ #
8
+ # We follow the corresponding tutorial at [SIMBA](https://simba-bio.readthedocs.io/en/latest/rna_human_pancreas.html). We do not provide much explanation, and instead refer to the original tutorial.
9
+ #
10
+ # Paper: [SIMBA: single-cell embedding along with features](https://www.nature.com/articles/s41592-023-01899-8)
11
+ #
12
+ # Code: https://github.com/huidongchen/simba
13
+
14
+ # In[1]:
15
+
16
+
17
+ import omicverse as ov
18
+ from omicverse.utils import mde
19
+ workdir = 'result_human_pancreas'
20
+ ov.utils.ov_plot_set()
21
+
22
+
23
+ # We need to install simba at first
24
+ #
25
+ # ```
26
+ # conda install -c bioconda simba
27
+ # ```
28
+ #
29
+ # or
30
+ #
31
+ # ```
32
+ # pip install git+https://github.com/huidongchen/simba
33
+ # pip install git+https://github.com/pinellolab/simba_pbg
34
+ # ```
35
+
36
+ # ## Read data
37
+ #
38
+ # The anndata object was concat from three anndata in simba: `simba.datasets.rna_baron2016()`, `simba.datasets.rna_segerstolpe2016()`, and `simba.datasets.rna_muraro2016()`
39
+ #
40
+ # It can be downloaded from figshare: https://figshare.com/ndownloader/files/41418600
41
+
42
+ # In[2]:
43
+
44
+
45
+ adata=ov.utils.read('simba_adata_raw.h5ad')
46
+
47
+
48
+ # We need to set workdir to initiate the pySIMBA object
49
+
50
+ # In[3]:
51
+
52
+
53
+ simba_object=ov.single.pySIMBA(adata,workdir)
54
+
55
+
56
+ # ## Preprocess
57
+ #
58
+ # Follow the raw tutorial, we set the paragument as default.
59
+
60
+ # In[4]:
61
+
62
+
63
+ simba_object.preprocess(batch_key='batch',min_n_cells=3,
64
+ method='lib_size',n_top_genes=3000,n_bins=5)
65
+
66
+
67
+ # ## Generate a graph for training
68
+ #
69
+ # Observations and variables within each Anndata object are both represented as nodes (entities).
70
+ #
71
+ # the data store in `simba_object.uns['simba_batch_edge_dict']`
72
+
73
+ # In[5]:
74
+
75
+
76
+ simba_object.gen_graph()
77
+
78
+
79
+ # ## PBG training
80
+ #
81
+ # Before training, let’s take a look at the current parameters:
82
+ #
83
+ # - dict_config['workers'] = 12 #The number of CPUs.
84
+
85
+ # In[10]:
86
+
87
+
88
+ simba_object.train(num_workers=6)
89
+
90
+
91
+ # In[6]:
92
+
93
+
94
+ simba_object.load('result_human_pancreas/pbg/graph0')
95
+
96
+
97
+ # ## Batch correction
98
+ #
99
+ # Here, we use `simba_object.batch_correction()` to perform the batch correction
100
+ #
101
+ # <div class="admonition note">
102
+ # <p class="admonition-title">Note</p>
103
+ # <p>
104
+ # If the batch is greater than 10, then the batch correction is less effective
105
+ # </p>
106
+ # </div>
107
+
108
+ # In[7]:
109
+
110
+
111
+ adata=simba_object.batch_correction()
112
+ adata
113
+
114
+
115
+ # ## Visualize
116
+ #
117
+ # We also use `mde` instead `umap` to visualize the result
118
+
119
+ # In[8]:
120
+
121
+
122
+ adata.obsm["X_mde"] = mde(adata.obsm["X_simba"])
123
+
124
+
125
+ # In[11]:
126
+
127
+
128
+ sc.pl.embedding(adata,basis='X_mde',color=['cell_type1','batch'])
129
+
130
+
131
+ # Certainly, umap can also be used to visualize
132
+
133
+ # In[10]:
134
+
135
+
136
+ import scanpy as sc
137
+ sc.pp.neighbors(adata, use_rep="X_simba")
138
+ sc.tl.umap(adata)
139
+ sc.pl.umap(adata,color=['cell_type1','batch'])
140
+
141
+
142
+ # In[ ]:
143
+
144
+
145
+
146
+
ovrawm/t_single_batch.txt ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Data integration and batch correction
5
+ #
6
+ # An important task of single-cell analysis is the integration of several samples, which we can perform with omicverse.
7
+ #
8
+ # Here we demonstrate how to merge data using omicverse and perform a corrective analysis for batch effects. We provide a total of 4 methods for batch effect correction in omicverse, including harmony, scanorama and combat which do not require GPU, and SIMBA which requires GPU. if available, we recommend using GPU-based scVI and scANVI to get the best batch effect correction results.
9
+ #
10
+ #
11
+
12
+ # In[1]:
13
+
14
+
15
+ import omicverse as ov
16
+ #print(f"omicverse version: {ov.__version__}")
17
+ import scanpy as sc
18
+ #print(f"scanpy version: {sc.__version__}")
19
+ ov.utils.ov_plot_set()
20
+
21
+
22
+ # ## Data integration
23
+ #
24
+ # First, we need to concat the data of scRNA-seq from different batch. We can use `sc.concat` to perform it。
25
+ #
26
+ # The dataset we will use to demonstrate data integration contains several samples of bone marrow mononuclear cells. These samples were originally created for the Open Problems in Single-Cell Analysis NeurIPS Competition 2021.
27
+ #
28
+ # We selected sample of `s1d3`, `s2d1` and `s3d7` to perform integrate. The individual data can be downloaded from figshare.
29
+ #
30
+ # - s1d3:
31
+ # - s2d1:
32
+ # - s3d7:
33
+
34
+ # In[2]:
35
+
36
+
37
+ adata1=ov.read('neurips2021_s1d3.h5ad')
38
+ adata1.obs['batch']='s1d3'
39
+ adata2=ov.read('neurips2021_s2d1.h5ad')
40
+ adata2.obs['batch']='s2d1'
41
+ adata3=ov.read('neurips2021_s3d7.h5ad')
42
+ adata3.obs['batch']='s3d7'
43
+
44
+
45
+ # In[3]:
46
+
47
+
48
+ adata=sc.concat([adata1,adata2,adata3],merge='same')
49
+ adata
50
+
51
+
52
+ # We can see that there are now three elements in the batch
53
+
54
+ # In[4]:
55
+
56
+
57
+ adata.obs['batch'].unique()
58
+
59
+
60
+ # In[7]:
61
+
62
+
63
+ import numpy as np
64
+ adata.X=adata.X.astype(np.int64)
65
+
66
+
67
+ # ## Data preprocess and Batch visualize
68
+ #
69
+ # We first performed quality control of the data and normalisation with screening for highly variable genes. Then visualise potential batch effects in the data.
70
+ #
71
+ # Here, we can set `batch_key=batch` to correct the doublet detectation and Highly variable genes identifcation.
72
+
73
+ # In[8]:
74
+
75
+
76
+ adata=ov.pp.qc(adata,
77
+ tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
78
+ batch_key='batch')
79
+ adata
80
+
81
+
82
+ # We can store the raw counts if we need the raw counts after filtered the HVGs.
83
+
84
+ # In[10]:
85
+
86
+
87
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',
88
+ n_HVGs=3000,batch_key=None)
89
+ adata
90
+
91
+
92
+ # In[11]:
93
+
94
+
95
+ adata.raw = adata
96
+ adata = adata[:, adata.var.highly_variable_features]
97
+ adata
98
+
99
+
100
+ # We can save the pre-processed data.
101
+
102
+ # In[12]:
103
+
104
+
105
+ adata.write_h5ad('neurips2021_batch_normlog.h5ad',compression='gzip')
106
+
107
+
108
+ # Similarly, we calculated PCA for HVGs and visualised potential batch effects in the data using pymde. pymde is GPU-accelerated UMAP.
109
+
110
+ # In[13]:
111
+
112
+
113
+ ov.pp.scale(adata)
114
+ ov.pp.pca(adata,layer='scaled',n_pcs=50,mask_var='highly_variable_features')
115
+
116
+ adata.obsm["X_mde_pca"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
117
+
118
+
119
+ # There is a very clear batch effect in the data
120
+
121
+ # In[14]:
122
+
123
+
124
+ ov.utils.embedding(adata,
125
+ basis='X_mde_pca',frameon='small',
126
+ color=['batch','cell_type'],show=False)
127
+
128
+
129
+ # ## Harmony
130
+ #
131
+ # Harmony is an algorithm for performing integration of single cell genomics datasets. Please check out manuscript on [Nature Methods](https://www.nature.com/articles/s41592-019-0619-0).
132
+ #
133
+ # ![harmony](https://portals.broadinstitute.org/harmony/articles/main.jpg)
134
+
135
+ # The function `ov.single.batch_correction` can be set in three methods: `harmony`,`combat` and `scanorama`
136
+
137
+ # In[40]:
138
+
139
+
140
+ adata_harmony=ov.single.batch_correction(adata,batch_key='batch',
141
+ methods='harmony',n_pcs=50)
142
+ adata
143
+
144
+
145
+ # In[41]:
146
+
147
+
148
+ adata.obsm["X_mde_harmony"] = ov.utils.mde(adata.obsm["X_harmony"])
149
+
150
+
151
+ # In[42]:
152
+
153
+
154
+ ov.utils.embedding(adata,
155
+ basis='X_mde_harmony',frameon='small',
156
+ color=['batch','cell_type'],show=False)
157
+
158
+
159
+ # ## Combat
160
+ #
161
+ # combat is a batch effect correction method that is very widely used in bulk RNA-seq, and it works just as well on single-cell sequencing data.
162
+ #
163
+ #
164
+
165
+ # In[43]:
166
+
167
+
168
+ adata_combat=ov.single.batch_correction(adata,batch_key='batch',
169
+ methods='combat',n_pcs=50)
170
+ adata
171
+
172
+
173
+ # In[44]:
174
+
175
+
176
+ adata.obsm["X_mde_combat"] = ov.utils.mde(adata.obsm["X_combat"])
177
+
178
+
179
+ # In[45]:
180
+
181
+
182
+ ov.utils.embedding(adata,
183
+ basis='X_mde_combat',frameon='small',
184
+ color=['batch','cell_type'],show=False)
185
+
186
+
187
+ # ## scanorama
188
+ #
189
+ # Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data.
190
+ #
191
+ # ![scanorama](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41587-019-0113-3/MediaObjects/41587_2019_113_Fig1_HTML.png?as=webp)
192
+
193
+ # In[46]:
194
+
195
+
196
+ adata_scanorama=ov.single.batch_correction(adata,batch_key='batch',
197
+ methods='scanorama',n_pcs=50)
198
+ adata
199
+
200
+
201
+ # In[47]:
202
+
203
+
204
+ adata.obsm["X_mde_scanorama"] = ov.utils.mde(adata.obsm["X_scanorama"])
205
+
206
+
207
+ # In[48]:
208
+
209
+
210
+ ov.utils.embedding(adata,
211
+ basis='X_mde_scanorama',frameon='small',
212
+ color=['batch','cell_type'],show=False)
213
+
214
+
215
+ # ## scVI
216
+ #
217
+ # An important task of single-cell analysis is the integration of several samples, which we can perform with scVI. For integration, scVI treats the data as unlabelled. When our dataset is fully labelled (perhaps in independent studies, or independent analysis pipelines), we can obtain an integration that better preserves biology using scANVI, which incorporates cell type annotation information. Here we demonstrate this functionality with an integrated analysis of cells from the lung atlas integration task from the scIB manuscript. The same pipeline would generally be used to analyze any collection of scRNA-seq datasets.
218
+
219
+ # In[3]:
220
+
221
+
222
+ adata_scvi=ov.single.batch_correction(adata,batch_key='batch',
223
+ methods='scVI',n_layers=2, n_latent=30, gene_likelihood="nb")
224
+ adata
225
+
226
+
227
+ # In[4]:
228
+
229
+
230
+ adata.obsm["X_mde_scVI"] = ov.utils.mde(adata.obsm["X_scVI"])
231
+
232
+
233
+ # In[5]:
234
+
235
+
236
+ ov.utils.embedding(adata,
237
+ basis='X_mde_scVI',frameon='small',
238
+ color=['batch','cell_type'],show=False)
239
+
240
+
241
+ # ## MIRA+CODAL
242
+ #
243
+ # Topic modeling of batched single-cell data is challenging because these models cannot typically distinguish between biological and technical effects of the assay. CODAL (COvariate Disentangling Augmented Loss) uses a novel mutual information regularization technique to explicitly disentangle these two sources of variation.
244
+
245
+ # In[15]:
246
+
247
+
248
+ LDA_obj=ov.utils.LDA_topic(adata,feature_type='expression',
249
+ highly_variable_key='highly_variable_features',
250
+ layers='counts',batch_key='batch',learning_rate=1e-3)
251
+
252
+
253
+ # In[16]:
254
+
255
+
256
+ LDA_obj.plot_topic_contributions(6)
257
+
258
+
259
+ # In[17]:
260
+
261
+
262
+ LDA_obj.predicted(15)
263
+
264
+
265
+ # In[37]:
266
+
267
+
268
+ adata.obsm["X_mde_mira_topic"] = ov.utils.mde(adata.obsm["X_topic_compositions"])
269
+ adata.obsm["X_mde_mira_feature"] = ov.utils.mde(adata.obsm["X_umap_features"])
270
+
271
+
272
+ # In[38]:
273
+
274
+
275
+ ov.utils.embedding(adata,
276
+ basis='X_mde_mira_topic',frameon='small',
277
+ color=['batch','cell_type'],show=False)
278
+
279
+
280
+ # In[39]:
281
+
282
+
283
+ ov.utils.embedding(adata,
284
+ basis='X_mde_mira_feature',frameon='small',
285
+ color=['batch','cell_type'],show=False)
286
+
287
+
288
+ # ## Benchmarking test
289
+ #
290
+ # The methods demonstrated here are selected based on results from benchmarking experiments including the single-cell integration benchmarking project [Luecken et al., 2021]. This project also produced a software package called [scib](https://www.github.com/theislab/scib) that can be used to run a range of integration methods as well as the metrics that were used for evaluation. In this section, we show how to use this package to evaluate the quality of an integration.
291
+
292
+ # In[6]:
293
+
294
+
295
+ adata.write_h5ad('neurips2021_batch_all.h5ad',compression='gzip')
296
+
297
+
298
+ # In[2]:
299
+
300
+
301
+ adata=sc.read('neurips2021_batch_all.h5ad')
302
+
303
+
304
+ # In[7]:
305
+
306
+
307
+ adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca'].copy()
308
+ adata.obsm['X_mira_topic']=adata.obsm['X_topic_compositions'].copy()
309
+ adata.obsm['X_mira_feature']=adata.obsm['X_umap_features'].copy()
310
+
311
+
312
+ # In[ ]:
313
+
314
+
315
+ from scib_metrics.benchmark import Benchmarker
316
+ bm = Benchmarker(
317
+ adata,
318
+ batch_key="batch",
319
+ label_key="cell_type",
320
+ embedding_obsm_keys=["X_pca", "X_combat", "X_harmony",
321
+ 'X_scanorama','X_mira_topic','X_mira_feature','X_scVI'],
322
+ n_jobs=8,
323
+ )
324
+ bm.benchmark()
325
+
326
+
327
+ # In[9]:
328
+
329
+
330
+ bm.plot_results_table(min_max_scale=False)
331
+
332
+
333
+ # We can find that harmony removes the batch effect the best of the three methods that do not use the GPU, scVI is method to remove batch effect using GPU.
ovrawm/t_slat.txt ADDED
@@ -0,0 +1,365 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Single cell spatial alignment tools
5
+ #
6
+ # SLAT (Spatially-Linked Alignment Tool), a graph-based algorithm for efficient and effective alignment of spatial slices. Adopting a graph adversarial matching strategy, SLAT is the first algorithm capable of aligning heterogenous spatial data across distinct technologies and modalities.
7
+ #
8
+ # We made two improvements in integrating the STT algorithm in OmicVerse:
9
+ #
10
+ # - **Fix the running error in alignment**: We fixed some issues with the scSLAT package on pypi.
11
+ # - **Added more downstream analysis**: We have expanded on the original tutorial by combining the tutorial and reproduce code given by the authors for downstream analysis.
12
+ #
13
+ # If you found this tutorial helpful, please cite SLAT and OmicVerse:
14
+ #
15
+ # - Xia, CR., Cao, ZJ., Tu, XM. et al. Spatial-linked alignment tool (SLAT) for aligning heterogenous slices. Nat Commun 14, 7236 (2023). https://doi.org/10.1038/s41467-023-43105-5
16
+
17
+ # In[1]:
18
+
19
+
20
+ import omicverse as ov
21
+ import os
22
+
23
+ import scanpy as sc
24
+ import numpy as np
25
+ import pandas as pd
26
+ import torch
27
+ ov.plot_set()
28
+
29
+
30
+ # In[2]:
31
+
32
+
33
+ #import scSLAT
34
+ from omicverse.externel.scSLAT.model import load_anndatas, Cal_Spatial_Net, run_SLAT, scanpy_workflow, spatial_match
35
+ from omicverse.externel.scSLAT.viz import match_3D_multi, hist, Sankey, match_3D_celltype, Sankey,Sankey_multi,build_3D
36
+ from omicverse.externel.scSLAT.metrics import region_statistics
37
+
38
+
39
+ # ## Preprocess Data
40
+ #
41
+ # adata1.h5ad: E11.5 mouse embryo dataset, download from [here](https://drive.google.com/uc?export=download&id=1KkuJt6aSlKS1AJzFZjE_odypY-GINRuD)
42
+ #
43
+ # adata2.h5ad: E12.5 mouse embryo dataset, download from [here](https://drive.google.com/uc?export=download&id=1YIiEmjGfHxcDbGn4nv2kzmTHUo3_q5hJ)
44
+
45
+ # In[3]:
46
+
47
+
48
+ adata1 = sc.read_h5ad('data/E115_Stereo.h5ad')
49
+ adata2 = sc.read_h5ad('data/E125_Stereo.h5ad')
50
+
51
+
52
+ # In[4]:
53
+
54
+
55
+ adata1.obs['week']='E11.5'
56
+ adata2.obs['week']='E12.5'
57
+
58
+
59
+ # In[5]:
60
+
61
+
62
+ sc.pl.spatial(adata1, color='annotation', spot_size=3)
63
+ sc.pl.spatial(adata2, color='annotation', spot_size=3)
64
+
65
+
66
+ # ## Run SLAT
67
+ #
68
+ # Then we run SLAT as usual
69
+
70
+ # In[6]:
71
+
72
+
73
+ Cal_Spatial_Net(adata1, k_cutoff=20, model='KNN')
74
+ Cal_Spatial_Net(adata2, k_cutoff=20, model='KNN')
75
+ edges, features = load_anndatas([adata1, adata2], feature='DPCA', check_order=False)
76
+
77
+
78
+ # In[7]:
79
+
80
+
81
+ embd0, embd1, time = run_SLAT(features, edges, LGCN_layer=5)
82
+
83
+
84
+ # In[8]:
85
+
86
+
87
+ best, index, distance = spatial_match([embd0, embd1], reorder=False, adatas=[adata1,adata2])
88
+
89
+
90
+ # In[9]:
91
+
92
+
93
+ matching = np.array([range(index.shape[0]), best])
94
+ best_match = distance[:,0]
95
+ region_statistics(best_match, start=0.5, number_of_interval=10)
96
+
97
+
98
+ # ## Visualization of alignment
99
+
100
+ # In[10]:
101
+
102
+
103
+ import matplotlib.pyplot as plt
104
+ matching_list=[matching]
105
+ model = build_3D([adata1,adata2], matching_list,subsample_size=300, )
106
+ ax=model.draw_3D(hide_axis=True, line_color='#c2c2c2', height=1, size=[6,6], line_width=1)
107
+
108
+
109
+ # Then we check the alignment quality of the whole slide
110
+
111
+ # In[11]:
112
+
113
+
114
+ adata2.obs['low_quality_index']= best_match
115
+ adata2.obs['low_quality_index'] = adata2.obs['low_quality_index'].astype(float)
116
+
117
+
118
+ # In[12]:
119
+
120
+
121
+ adata2.obsm['spatial']
122
+
123
+
124
+ # In[13]:
125
+
126
+
127
+ sc.pl.spatial(adata2, color='low_quality_index', spot_size=3, title='Quality')
128
+
129
+
130
+ # We use a Sankey diagram to show the correspondence between cell types at different stages of development
131
+
132
+ # In[33]:
133
+
134
+
135
+ fig=Sankey_multi(adata_li=[adata1,adata2],
136
+ prefix_li=['E11.5','E12.5'],
137
+ matching_li=[matching],
138
+ clusters='annotation',filter_num=10,
139
+ node_opacity = 0.8,
140
+ link_opacity = 0.2,
141
+ layout=[800,500],
142
+ font_size=12,
143
+ font_color='Black',
144
+ save_name=None,
145
+ format='png',
146
+ width=1200,
147
+ height=1000,
148
+ return_fig=True)
149
+ fig.show()
150
+
151
+
152
+ # In[34]:
153
+
154
+
155
+ fig.write_html("slat_sankey.html")
156
+
157
+
158
+ # ## Focus on developing Kidney
159
+ #
160
+ # We highlighted the “Kidney” cells in E12.5 and their aligned precursor cells in E11.5 in alignment results. Consistent with our biological priors, the precursors of the kidney are the mesonephros and the metanephros
161
+ #
162
+ # Then we focus on another organ: ‘Ovary’, and found ovary only has single spatial origin. It is interesting that precursors of ovary are spatially close to the mesonephros (see Kidney part), because mammalian ovary originates from the regressed mesonephros.
163
+
164
+ # In[27]:
165
+
166
+
167
+ color_dict1=dict(zip(adata1.obs['annotation'].cat.categories,
168
+ adata1.uns['annotation_colors'].tolist()))
169
+ adata1_df = pd.DataFrame({'index':range(embd0.shape[0]),
170
+ 'x': adata1.obsm['spatial'][:,0],
171
+ 'y': adata1.obsm['spatial'][:,1],
172
+ 'celltype':adata1.obs['annotation'],
173
+ 'color':adata1.obs['annotation'].map(color_dict1)
174
+ }
175
+ )
176
+ color_dict2=dict(zip(adata2.obs['annotation'].cat.categories,
177
+ adata2.uns['annotation_colors'].tolist()))
178
+ adata2_df = pd.DataFrame({'index':range(embd1.shape[0]),
179
+ 'x': adata2.obsm['spatial'][:,0],
180
+ 'y': adata2.obsm['spatial'][:,1],
181
+ 'celltype':adata2.obs['annotation'],
182
+ 'color':adata2.obs['annotation'].map(color_dict2)
183
+ }
184
+ )
185
+
186
+
187
+ # In[28]:
188
+
189
+
190
+ kidney_align = match_3D_celltype(adata1_df, adata2_df, matching, meta='celltype',
191
+ highlight_celltype = [['Urogenital ridge'],['Kidney','Ovary']],
192
+ subsample_size=10000, highlight_line = ['blue'], scale_coordinate = True )
193
+ kidney_align.draw_3D(size= [6, 6], line_width =0.8, point_size=[0.6,0.6], hide_axis=True)
194
+
195
+
196
+ # We can get the lineage of the query's cells and mappings using the following function
197
+
198
+ # In[15]:
199
+
200
+
201
+ def cal_matching_cell(target_adata,query_adata,matching,query_cell,clusters='annotation',):
202
+ adata1_df = pd.DataFrame({'index':range(target_adata.shape[0]),
203
+ 'x': target_adata.obsm['spatial'][:,0],
204
+ 'y': target_adata.obsm['spatial'][:,1],
205
+ 'celltype':target_adata.obs[clusters]})
206
+ adata2_df = pd.DataFrame({'index':range(query_adata.shape[0]),
207
+ 'x': query_adata.obsm['spatial'][:,0],
208
+ 'y': query_adata.obsm['spatial'][:,1],
209
+ 'celltype':query_adata.obs[clusters]})
210
+ query_adata = target_adata[matching[1,adata2_df.loc[adata2_df.celltype==query_cell,'index'].values],:]
211
+ #adata2_df['target_celltype'] = adata1_df.iloc[matching[1,:],:]['celltype'].to_list()
212
+ #adata2_df['target_obs_names'] = adata1_df.iloc[matching[1,:],:].index.to_list()
213
+
214
+ #query_obs=adata2_df.loc[adata2_df['celltype']==query_cell,'target_obs_names'].tolist()
215
+ return query_adata
216
+
217
+
218
+
219
+ # We find that maps mapped on 3D also show up well on 2D
220
+
221
+ # In[21]:
222
+
223
+
224
+ query_adata=cal_matching_cell(target_adata=adata1,
225
+ query_adata=adata2,
226
+ matching=matching,
227
+ query_cell='Kidney',clusters='annotation')
228
+ query_adata
229
+
230
+
231
+ # In[17]:
232
+
233
+
234
+ adata1.obs['kidney_anno']=''
235
+ adata1.obs.loc[query_adata.obs.index,'kidney_anno']=query_adata.obs['annotation']
236
+
237
+
238
+ # In[18]:
239
+
240
+
241
+ sc.pl.spatial(adata1, color='kidney_anno', spot_size=3,
242
+ palette=['#F5F5F5','#ff7f0e', 'green',])
243
+
244
+
245
+ # We are concerned with Kidney lineage development, so we integrated the cells corresponding to the Kidney lineage on the two sections of E11 and E12, and then we could use the method of difference analysis to study the dynamic process of Kidney lineage development.
246
+
247
+ # In[22]:
248
+
249
+
250
+ kidney_lineage_ad=sc.concat([query_adata,adata2[adata2.obs['annotation']=='Kidney']],merge='same')
251
+ kidney_lineage_ad=ov.pp.preprocess(kidney_lineage_ad,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
252
+ kidney_lineage_ad.raw = kidney_lineage_ad
253
+ kidney_lineage_ad = kidney_lineage_ad[:, kidney_lineage_ad.var.highly_variable_features]
254
+ ov.pp.scale(kidney_lineage_ad)
255
+ ov.pp.pca(kidney_lineage_ad)
256
+ ov.pp.neighbors(kidney_lineage_ad,use_rep='scaled|original|X_pca',metric="cosine")
257
+ ov.utils.cluster(kidney_lineage_ad,method='leiden',resolution=1)
258
+ ov.pp.umap(kidney_lineage_ad)
259
+
260
+
261
+ # In[23]:
262
+
263
+
264
+ ov.pl.embedding(kidney_lineage_ad,basis='X_umap',
265
+ color=['annotation','week','leiden'],
266
+ frameon='small')
267
+
268
+
269
+ # In[25]:
270
+
271
+
272
+ # Nphs1 https://www.nature.com/articles/s41467-021-22266-1
273
+ sc.pl.dotplot(kidney_lineage_ad,{'nephron progenitors':['Wnt9b','Osr1','Nphs1','Lhx1','Pax2','Pax8'],
274
+ 'metanephric':['Eya1','Shisa3','Foxc1'],
275
+ 'kidney':['Wt1','Wnt4','Nr2f2','Dach1','Cd44']} ,
276
+ 'leiden',dendrogram=False,colorbar_title='Expression')
277
+
278
+
279
+ # In[26]:
280
+
281
+
282
+ kidney_lineage_ad.obs['re_anno'] = 'Unknown'
283
+ kidney_lineage_ad.obs.loc[kidney_lineage_ad.obs.leiden.isin(['4']),'re_anno'] = 'Nephron progenitors (E11.5)'
284
+ kidney_lineage_ad.obs.loc[kidney_lineage_ad.obs.leiden.isin(['2','3','1','5']),'re_anno'] = 'Metanephron progenitors (E11.5)'
285
+ kidney_lineage_ad.obs.loc[kidney_lineage_ad.obs.leiden=='0','re_anno'] = 'Kidney (E12.5)'
286
+
287
+
288
+ # In[28]:
289
+
290
+
291
+ # kidney_all = kidney_all[kidney_all.obs.leiden!='3',:]
292
+ kidney_lineage_ad.obs.leiden = list(kidney_lineage_ad.obs.leiden)
293
+ ov.pl.embedding(kidney_lineage_ad,basis='X_umap',
294
+ color=['annotation','re_anno'],
295
+ frameon='small')
296
+
297
+
298
+ # In[29]:
299
+
300
+
301
+ adata1.obs['kidney_anno']=''
302
+ adata1.obs.loc[kidney_lineage_ad[kidney_lineage_ad.obs['week']=='E11.5'].obs.index,'kidney_anno']=kidney_lineage_ad[kidney_lineage_ad.obs['week']=='E11.5'].obs['re_anno']
303
+
304
+
305
+ # In[41]:
306
+
307
+
308
+ import matplotlib.pyplot as plt
309
+ fig, ax = plt.subplots(1, 1, figsize=(8, 8))
310
+ sc.pl.spatial(adata1, color='kidney_anno', spot_size=1.5,
311
+ palette=['#F5F5F5','#ff7f0e', 'green',],show=False,ax=ax)
312
+
313
+
314
+ # We can also differentially analyse Kidney's developmental pedigree to find different marker genes, and we can analyse transcription factors and thus find the regulatory units involved.
315
+
316
+ # In[42]:
317
+
318
+
319
+ test_adata=kidney_lineage_ad
320
+ dds=ov.bulk.pyDEG(test_adata.to_df(layer='lognorm').T)
321
+ dds.drop_duplicates_index()
322
+ print('... drop_duplicates_index success')
323
+ treatment_groups=test_adata.obs[test_adata.obs['week']=='E12.5'].index.tolist()
324
+ control_groups=test_adata.obs[test_adata.obs['week']=='E11.5'].index.tolist()
325
+ result=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
326
+ # -1 means automatically calculates
327
+ dds.foldchange_set(fc_threshold=-1,
328
+ pval_threshold=0.05,
329
+ logp_max=10)
330
+
331
+
332
+ # In[43]:
333
+
334
+
335
+ dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
336
+ plot_genes_num=8,plot_genes_fontsize=12,)
337
+
338
+
339
+ # In[52]:
340
+
341
+
342
+ up_gene=dds.result.loc[dds.result['sig']=='up'].sort_values('qvalue')[:3].index.tolist()
343
+ down_gene=dds.result.loc[dds.result['sig']=='down'].sort_values('qvalue')[:3].index.tolist()
344
+ deg_gene=up_gene+down_gene
345
+
346
+
347
+ # In[53]:
348
+
349
+
350
+ sc.pl.dotplot(kidney_lineage_ad,deg_gene,
351
+ groupby='re_anno')
352
+
353
+
354
+ # In addition to analysing directly using differential expression, we can also look for weekly marker genes by considering weeks as categories.
355
+
356
+ # In[55]:
357
+
358
+
359
+ sc.tl.dendrogram(kidney_lineage_ad,'re_anno',use_rep='scaled|original|X_pca')
360
+ sc.tl.rank_genes_groups(kidney_lineage_ad, 're_anno', use_rep='scaled|original|X_pca',
361
+ method='t-test',use_raw=False,key_added='re_anno_ttest')
362
+ sc.pl.rank_genes_groups_dotplot(kidney_lineage_ad,groupby='re_anno',
363
+ cmap='RdBu_r',key='re_anno_ttest',
364
+ standard_scale='var',n_genes=3)
365
+
ovrawm/t_spaceflow.txt ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Identifying Pseudo-Spatial Map
5
+ #
6
+ # SpaceFlow is Python package for identifying spatiotemporal patterns and spatial domains from Spatial Transcriptomic (ST) Data. Based on deep graph network, SpaceFlow provides the following functions:
7
+ # 1. Encodes the ST data into **low-dimensional embeddings** that reflecting both expression similarity and the spatial proximity of cells in ST data.
8
+ # 2. Incorporates **spatiotemporal** relationships of cells or spots in ST data through a **pseudo-Spatiotemporal Map (pSM)** derived from the embeddings.
9
+ # 3. Identifies **spatial domains** with spatially-coherent expression patterns.
10
+ #
11
+ # Check out [(Ren et al., Nature Communications, 2022)](https://www.nature.com/articles/s41467-022-31739-w) for the detailed methods and applications.
12
+ #
13
+ #
14
+ # ![fig](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41467-022-31739-w/MediaObjects/41467_2022_31739_Fig1_HTML.png)
15
+ #
16
+
17
+ # In[1]:
18
+
19
+
20
+ import omicverse as ov
21
+ #print(f"omicverse version: {ov.__version__}")
22
+ import scanpy as sc
23
+ #print(f"scanpy version: {sc.__version__}")
24
+ ov.utils.ov_plot_set()
25
+
26
+
27
+ # ## Preprocess data
28
+ #
29
+ # Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.
30
+ #
31
+ # This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.
32
+
33
+ # In[2]:
34
+
35
+
36
+ adata = sc.read_visium(path='data', count_file='151676_filtered_feature_bc_matrix.h5')
37
+ adata.var_names_make_unique()
38
+
39
+
40
+ # <div class="admonition warning">
41
+ # <p class="admonition-title">Note</p>
42
+ # <p>
43
+ # We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
44
+ # </p>
45
+ # </div>
46
+ #
47
+ # ```python
48
+ # #adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
49
+ # #adata.raw = adata
50
+ # #adata = adata[:, adata.var.highly_variable_features]
51
+ # ```
52
+
53
+ # In[3]:
54
+
55
+
56
+ sc.pp.calculate_qc_metrics(adata, inplace=True)
57
+ adata = adata[:,adata.var['total_counts']>100]
58
+ adata=ov.space.svg(adata,mode='prost',n_svgs=3000,target_sum=1e4,platform="visium",)
59
+ adata.raw = adata
60
+ adata = adata[:, adata.var.space_variable_features]
61
+ adata
62
+
63
+
64
+ # We read the ground truth area of our spatial data
65
+
66
+ # In[4]:
67
+
68
+
69
+ # read the annotation
70
+ import pandas as pd
71
+ import os
72
+ Ann_df = pd.read_csv(os.path.join('data', '151676_truth.txt'), sep='\t', header=None, index_col=0)
73
+ Ann_df.columns = ['Ground Truth']
74
+ adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
75
+ sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
76
+
77
+
78
+ # ## Training the SpaceFlow Model
79
+ #
80
+ # Here, we used `ov.space.pySpaceFlow` to construct a SpaceFlow Object and train the model.
81
+ #
82
+ # We need to store the space location info in `adata.obsm['spatial']`
83
+
84
+ # In[5]:
85
+
86
+
87
+ sf_obj=ov.space.pySpaceFlow(adata)
88
+
89
+
90
+ # We then train a spatially regularized deep graph network model to learn a low-dimensional embedding that reflecting both expression similarity and the spatial proximity of cells in ST data.
91
+ #
92
+ # Parameters:
93
+ # - `spatial_regularization_strength`: the strength of spatial regularization, the larger the more of the spatial coherence in the identified spatial domains and spatiotemporal patterns. (default: 0.1)
94
+ # - `z_dim`: the target size of the learned embedding. (default: 50)
95
+ # - `lr`: learning rate for optimizing the model. (default: 1e-3)
96
+ # - `epochs`: the max number of the epochs for model training. (default: 1000)
97
+ # - `max_patience`: the max number of the epoch for waiting the loss decreasing. If loss does not decrease for epochs larger than this threshold, the learning will stop, and the model with the parameters that shows the minimal loss are kept as the best model. (default: 50)
98
+ # - `min_stop`: the earliest epoch the learning can stop if no decrease in loss for epochs larger than the `max_patience`. (default: 100)
99
+ # - `random_seed`: the random seed set to the random generators of the `random`, `numpy`, `torch` packages. (default: 42)
100
+ # - `gpu`: the index of the Nvidia GPU, if no GPU, the model will be trained via CPU, which is slower than the GPU training time. (default: 0)
101
+ # - `regularization_acceleration`: whether or not accelerate the calculation of regularization loss using edge subsetting strategy (default: True)
102
+ # - `edge_subset_sz`: the edge subset size for regularization acceleration (default: 1000000)
103
+ #
104
+
105
+ # In[6]:
106
+
107
+
108
+ sf_obj.train(spatial_regularization_strength=0.1,
109
+ z_dim=50, lr=1e-3, epochs=1000,
110
+ max_patience=50, min_stop=100,
111
+ random_seed=42, gpu=0,
112
+ regularization_acceleration=True, edge_subset_sz=1000000)
113
+
114
+
115
+ # ## Calculated the Pseudo-Spatial Map
116
+ #
117
+ # Unlike the original SpaceFlow, we only need to use the `cal_PSM` function when calling SpaceFlow in omicverse to compute the pSM.
118
+
119
+ # In[7]:
120
+
121
+
122
+ sf_obj.cal_pSM(n_neighbors=20,resolution=1,
123
+ max_cell_for_subsampling=5000,psm_key='pSM_spaceflow')
124
+
125
+
126
+ # In[8]:
127
+
128
+
129
+ sc.pl.spatial(adata, color=['pSM_spaceflow','Ground Truth'],cmap='RdBu_r')
130
+
131
+
132
+ # ## Clustering the space
133
+ #
134
+ # We can use `GMM`, `leiden` or `louvain` to cluster the space.
135
+ #
136
+ # ```python
137
+ # sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
138
+ # use_rep='spaceflow')
139
+ # ov.utils.cluster(adata,use_rep='spaceflow',method='louvain',resolution=1)
140
+ # ov.utils.cluster(adata,use_rep='spaceflow',method='leiden',resolution=1)
141
+ # ```
142
+
143
+ # In[9]:
144
+
145
+
146
+ ov.utils.cluster(adata,use_rep='spaceflow',method='GMM',n_components=7,covariance_type='full',
147
+ tol=1e-9, max_iter=1000, random_state=3607)
148
+
149
+
150
+ # In[10]:
151
+
152
+
153
+ sc.pl.spatial(adata, color=['gmm_cluster',"Ground Truth"])
154
+
155
+
156
+ # In[ ]:
157
+
158
+
159
+
160
+
ovrawm/t_stagate.txt ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Spatial clustering and denoising expressions
5
+ #
6
+ # Spatial clustering, which shares an analogy with single-cell clustering, has expanded the scope of tissue physiology studies from cell-centroid to structure-centroid with spatially resolved transcriptomics (SRT) data.
7
+ #
8
+ # Here, we presented two spatial clustering methods in OmicVerse.
9
+ #
10
+ # We made three improvements in integrating the `GraphST` and `STAGATE` algorithm in OmicVerse:
11
+ # - We removed the preprocessing that comes with `GraphST` and used the preprocessing consistent with all SRTs in OmicVerse
12
+ # - We optimised the dimensional display of `GraphST`, and PCA is considered a self-contained computational step.
13
+ # - We implemented `mclust` using Python, removing the R language dependency.
14
+ #
15
+ # If you found this tutorial helpful, please cite `GraphST`, `STAGATE` and OmicVerse:
16
+ #
17
+ # - Long, Y., Ang, K.S., Li, M. et al. Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat Commun 14, 1155 (2023). https://doi.org/10.1038/s41467-023-36796-3
18
+ # - Dong, K., Zhang, S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat Commun 13, 1739 (2022). https://doi.org/10.1038/s41467-022-29439-6
19
+ #
20
+ #
21
+
22
+ # In[1]:
23
+
24
+
25
+ import omicverse as ov
26
+ #print(f"omicverse version: {ov.__version__}")
27
+ import scanpy as sc
28
+ #print(f"scanpy version: {sc.__version__}")
29
+ ov.plot_set()
30
+
31
+
32
+ # ## Preprocess data
33
+ #
34
+ # Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.
35
+ #
36
+ # This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.
37
+
38
+ # In[2]:
39
+
40
+
41
+ adata = sc.read_visium(path='data', count_file='151676_filtered_feature_bc_matrix.h5')
42
+ adata.var_names_make_unique()
43
+
44
+
45
+ # <div class="admonition warning">
46
+ # <p class="admonition-title">Note</p>
47
+ # <p>
48
+ # We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
49
+ # </p>
50
+ # </div>
51
+ #
52
+ # ```python
53
+ # #adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
54
+ # #adata.raw = adata
55
+ # #adata = adata[:, adata.var.highly_variable_features]
56
+ # ```
57
+
58
+ # In[4]:
59
+
60
+
61
+ sc.pp.calculate_qc_metrics(adata, inplace=True)
62
+ adata = adata[:,adata.var['total_counts']>100]
63
+ adata=ov.space.svg(adata,mode='prost',n_svgs=3000,target_sum=1e4,platform="visium",)
64
+ adata
65
+
66
+
67
+ # In[5]:
68
+
69
+
70
+ adata.write('data/cluster_svg.h5ad',compression='gzip')
71
+
72
+
73
+ # In[2]:
74
+
75
+
76
+ #adata=ov.read('data/cluster_svg.h5ad',compression='gzip')
77
+
78
+
79
+ # (Optional) We read the ground truth area of our spatial data
80
+ #
81
+ # This step is not mandatory to run, in the tutorial, it's just to demonstrate the accuracy of our clustering effect, and in your own tasks, there is often no Ground_truth
82
+
83
+ # In[3]:
84
+
85
+
86
+ # read the annotation
87
+ import pandas as pd
88
+ import os
89
+ Ann_df = pd.read_csv(os.path.join('data', '151676_truth.txt'), sep='\t', header=None, index_col=0)
90
+ Ann_df.columns = ['Ground Truth']
91
+ adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
92
+ sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
93
+
94
+
95
+ # ## GraphST model
96
+ #
97
+ # GraphST was rated as one of the best spatial clustering algorithms on Nature Method 2024.04, so we first tried to call GraphST for spatial domain identification in OmicVerse.
98
+
99
+ # In[64]:
100
+
101
+
102
+ # define model
103
+ model = ov.externel.GraphST.GraphST(adata, device='cuda:0')
104
+
105
+ # train model
106
+ adata = model.train(n_pcs=30)
107
+
108
+
109
+ # ### Clustering the space
110
+ #
111
+ # We can use `mclust`, `leiden` or `louvain` to cluster the space.
112
+ #
113
+ # Note that we also add optimal transport to optimise the distribution of labels, using `refine_label` for that processing.
114
+ #
115
+
116
+ # In[84]:
117
+
118
+
119
+ ov.utils.cluster(adata,use_rep='graphst|original|X_pca',method='mclust',n_components=10,
120
+ modelNames='EEV', random_state=112,
121
+ )
122
+ adata.obs['mclust_GraphST'] = ov.utils.refine_label(adata, radius=50, key='mclust')
123
+
124
+
125
+ # In[87]:
126
+
127
+
128
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=20,
129
+ use_rep='graphst|original|X_pca')
130
+ ov.utils.cluster(adata,use_rep='graphst|original|X_pca',method='louvain',resolution=0.7)
131
+ ov.utils.cluster(adata,use_rep='graphst|original|X_pca',method='leiden',resolution=0.7)
132
+ adata.obs['louvain_GraphST'] = ov.utils.refine_label(adata, radius=50, key='louvain')
133
+ adata.obs['leiden_GraphST'] = ov.utils.refine_label(adata, radius=50, key='leiden')
134
+
135
+
136
+ # In[88]:
137
+
138
+
139
+ sc.pl.spatial(adata, color=['mclust_GraphST','leiden_GraphST',
140
+ 'louvain_GraphST',"Ground Truth"])
141
+
142
+
143
+ # <div class="admonition warning">
144
+ # <p class="admonition-title">Note</p>
145
+ # <p>
146
+ # If you find that the clustering is mediocre, you might consider re-running `model.train()`. Why the results improve
147
+ # </p>
148
+ # </div>
149
+ #
150
+ #
151
+
152
+ # ## STAGATE model
153
+ #
154
+ # STAGATE is designed for spatial clustering and denoising expressions of spatial resolved transcriptomics (ST) data.
155
+ #
156
+ # STAGATE learns low-dimensional latent embeddings with both spatial information and gene expressions via a graph attention auto-encoder. The method adopts an attention mechanism in the middle layer of the encoder and decoder, which adaptively learns the edge weights of spatial neighbor networks, and further uses them to update the spot representation by collectively aggregating information from its neighbors. The latent embeddings and the reconstructed expression profiles can be used to downstream tasks such as spatial domain identification, visualization, spatial trajectory inference, data denoising and 3D expression domain extraction.
157
+ #
158
+ # Dong, Kangning, and Shihua Zhang. “Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder.” Nature Communications 13.1 (2022): 1-12.
159
+ #
160
+ #
161
+ # Here, we used `ov.space.pySTAGATE` to construct a STAGATE object to train the model.
162
+ #
163
+
164
+ # In[14]:
165
+
166
+
167
+ #This step sometimes needs to be run twice
168
+ #and you need to check that adata.obs['X'] is correctly assigned instead of NA
169
+ adata.obs['X'] = adata.obsm['spatial'][:,0]
170
+ adata.obs['Y'] = adata.obsm['spatial'][:,1]
171
+ adata.obs['X'][0]
172
+
173
+
174
+ # In[15]:
175
+
176
+
177
+ STA_obj=ov.space.pySTAGATE(adata,num_batch_x=3,num_batch_y=2,
178
+ spatial_key=['X','Y'],rad_cutoff=200,num_epoch = 1000,lr=0.001,
179
+ weight_decay=1e-4,hidden_dims = [512, 30],
180
+ device='cuda:0')
181
+
182
+
183
+ # In[16]:
184
+
185
+
186
+ STA_obj.train()
187
+
188
+
189
+ # We stored the latent embedding in `adata.obsm['STAGATE']`, and denoising expression in `adata.layers['STAGATE_ReX']`
190
+
191
+ # In[17]:
192
+
193
+
194
+ STA_obj.predicted()
195
+ adata
196
+
197
+
198
+ # ### Clustering the space
199
+ #
200
+
201
+ # In[18]:
202
+
203
+
204
+ ov.utils.cluster(adata,use_rep='STAGATE',method='mclust',n_components=8,
205
+ modelNames='EEV', random_state=112,
206
+ )
207
+ adata.obs['mclust_STAGATE'] = ov.utils.refine_label(adata, radius=50, key='mclust')
208
+
209
+
210
+ # In[21]:
211
+
212
+
213
+ sc.pp.neighbors(adata, n_neighbors=15, n_pcs=20,
214
+ use_rep='STAGATE')
215
+ ov.utils.cluster(adata,use_rep='STAGATE',method='louvain',resolution=0.5)
216
+ ov.utils.cluster(adata,use_rep='STAGATE',method='leiden',resolution=0.5)
217
+ adata.obs['louvain_STAGATE'] = ov.utils.refine_label(adata, radius=50, key='louvain')
218
+ adata.obs['leiden_STAGATE'] = ov.utils.refine_label(adata, radius=50, key='leiden')
219
+
220
+
221
+ # In[22]:
222
+
223
+
224
+ sc.pl.spatial(adata, color=['mclust_STAGATE','leiden_STAGATE',
225
+ 'louvain_STAGATE',"Ground Truth"])
226
+
227
+
228
+ # ### Denoising
229
+
230
+ # In[23]:
231
+
232
+
233
+ adata.var.sort_values('PI',ascending=False).head(10)
234
+
235
+
236
+ # In[24]:
237
+
238
+
239
+ plot_gene = 'MBP'
240
+ import matplotlib.pyplot as plt
241
+ fig, axs = plt.subplots(1, 2, figsize=(8, 4))
242
+ sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
243
+ sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
244
+
245
+
246
+ # ### Calculated the Pseudo-Spatial Map
247
+ #
248
+ # We compared the model results from `SpaceFlow` and `STAGATE`, and to our surprise, STAGATE can also be applied to predict pSM.
249
+
250
+ # In[25]:
251
+
252
+
253
+ STA_obj.cal_pSM(n_neighbors=20,resolution=1,
254
+ max_cell_for_subsampling=5000)
255
+ adata
256
+
257
+
258
+ # In[26]:
259
+
260
+
261
+ sc.pl.spatial(adata, color=['Ground Truth','pSM_STAGATE'],
262
+ cmap='RdBu_r')
263
+
264
+
265
+ # ## Evaluate cluster
266
+ #
267
+ # We use ARI to evaluate the scoring of our clusters against the true values
268
+ #
269
+
270
+ # In[86]:
271
+
272
+
273
+ from sklearn.metrics.cluster import adjusted_rand_score
274
+
275
+ obs_df = adata.obs.dropna()
276
+ #GraphST
277
+ ARI = adjusted_rand_score(obs_df['mclust_GraphST'], obs_df['Ground Truth'])
278
+ print('mclust_GraphST: Adjusted rand index = %.2f' %ARI)
279
+
280
+ ARI = adjusted_rand_score(obs_df['leiden_GraphST'], obs_df['Ground Truth'])
281
+ print('leiden_GraphST: Adjusted rand index = %.2f' %ARI)
282
+
283
+ ARI = adjusted_rand_score(obs_df['louvain_GraphST'], obs_df['Ground Truth'])
284
+ print('louvain_GraphST: Adjusted rand index = %.2f' %ARI)
285
+
286
+ ARI = adjusted_rand_score(obs_df['mclust_STAGATE'], obs_df['Ground Truth'])
287
+ print('mclust_STAGATE: Adjusted rand index = %.2f' %ARI)
288
+
289
+ ARI = adjusted_rand_score(obs_df['leiden_STAGATE'], obs_df['Ground Truth'])
290
+ print('leiden_STAGATE: Adjusted rand index = %.2f' %ARI)
291
+
292
+ ARI = adjusted_rand_score(obs_df['louvain_STAGATE'], obs_df['Ground Truth'])
293
+ print('louvain_STAGATE: Adjusted rand index = %.2f' %ARI)
294
+
295
+
296
+ # It seems that STAGATE outperforms GraphST on this task, but OmicVerse only provides a unified implementation of the algorithm and does not do a full benchmark of the algorithm.
ovrawm/t_staligner.txt ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Spatial integration and clustering
5
+ #
6
+ # STAligner is designed for alignment and integration of spatially resolved transcriptomics data.
7
+ #
8
+ # STAligner first normalizes the expression profiles for all spots and constructs a spatial neighbor network using the spatial coordinates. STAligner further employs a graph attention auto-encoder neural network to extract spatially aware embedding, and constructs the spot triplets based on current embeddings to guide the alignment process by attracting similar spots and discriminating dissimilar spots across slices. STAligner introduces the triplet loss to update the spot embedding to reduce the distance from the anchor to positive spot, and increase the distance from the anchor to negative spot. The triplet construction and auto-encoder training are optimized iteratively until batch-corrected embeddings are generated. STAligner can be applied to integrate ST datasets to achieve alignment and simultaneous identification of spatial domains from different biological samples in (a), technological platforms (I), developmental (embryonic) stages (II), disease conditions (III) and consecutive slices of a tissue for 3D slice alignment (IV
9
+ #
10
+ # Zhou, X., Dong, K. & Zhang, S. Integrating spatial transcriptomics data across different conditions, technologies and developmental stages. Nat Comput Sci 3, 894–906 (2023). https://doi.org/10.1038/s43588-023-00528-w
11
+ #
12
+ # ![image.png](attachment:00790548-59f9-4fad-a1e3-a52c3ae98d44.png)).
13
+
14
+ # In[1]:
15
+
16
+
17
+ from scipy.sparse import csr_matrix
18
+ import omicverse as ov
19
+ import scanpy as sc
20
+ import anndata as ad
21
+ import pandas as pd
22
+ import os
23
+
24
+ ov.utils.ov_plot_set()
25
+
26
+
27
+ # # Preprocess data
28
+ #
29
+ # Here, We use the mouse olfactory bulb data generated by Stereo-seq and Slide-seqV2. The processed Stereo-seq and Slide-seqV2 data can be downloaded from https://drive.google.com/drive/folders/1Omte1adVFzyRDw7VloOAQYwtv_NjdWcG?usp=share_link. and the original tutorals can be finded from https://staligner.readthedocs.io/en/latest
30
+ #
31
+ # Here is a critical point that must be clarified: for STAligner, it first calculates highly variable genes before concating annadata samples. Therefore, the number of highly variable genes should not be selected too low. Otherwise, in the case of a large number of samples, the downstream features for STAligner training would be insufficient, impacting the model's performance.
32
+ #
33
+ # When using STAligner, it is necessary to adjust the **rad_cutoff** parameter according to different data to ensure that each spot has an **average of 5-10 adjacent spots** connected to it. Such as: "11.3356 neighbors per cell on average."
34
+ #
35
+
36
+ # In[2]:
37
+
38
+
39
+ Batch_list = []
40
+ adj_list = []
41
+ section_ids = ['Slide-seqV2_MoB', 'Stereo-seq_MoB']
42
+ print(section_ids)
43
+ pathway = '/storage/zengjianyangLab/hulei/scRNA-seq/scripts/STAligner'
44
+
45
+ for section_id in section_ids:
46
+ print(section_id)
47
+ adata = sc.read_h5ad(os.path.join(pathway,section_id+".h5ad"))
48
+
49
+ # check whether the adata.X is sparse matrix
50
+ if isinstance(adata.X, pd.DataFrame):
51
+ adata.X = csr_matrix(adata.X)
52
+ else:
53
+ pass
54
+
55
+ adata.var_names_make_unique(join="++")
56
+
57
+ # make spot name unique
58
+ adata.obs_names = [x+'_'+section_id for x in adata.obs_names]
59
+
60
+ # Constructing the spatial network
61
+ ov.space.Cal_Spatial_Net(adata, rad_cutoff=50) # the spatial network are saved in adata.uns[‘adj’]
62
+
63
+ # Normalization
64
+ sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=10000)
65
+ sc.pp.normalize_total(adata, target_sum=1e4)
66
+ sc.pp.log1p(adata)
67
+
68
+ adata = adata[:, adata.var['highly_variable']]
69
+ adj_list.append(adata.uns['adj'])
70
+ Batch_list.append(adata)
71
+
72
+
73
+ # In[3]:
74
+
75
+
76
+ Batch_list
77
+
78
+
79
+ # In[4]:
80
+
81
+
82
+ adata_concat = ad.concat(Batch_list, label="slice_name", keys=section_ids)
83
+ adata_concat.obs["batch_name"] = adata_concat.obs["slice_name"].astype('category')
84
+ print('adata_concat.shape: ', adata_concat.shape)
85
+
86
+
87
+ # # Training STAligner model
88
+ #
89
+ # Here, we used `ov.space.pySTAligner` to construct a STAGATE object to train the model.
90
+ #
91
+ # We are using the `train_STAligner_subgraph` function from STAligner to reduce GPU memory usage, each slice is considered as a subgraph for training.
92
+
93
+ # In[5]:
94
+
95
+
96
+ get_ipython().run_cell_magic('time', '', "# iter_comb is used to specify the order of integration. For example, (0, 1) means slice 0 will be algined with slice 1 as reference.\niter_comb = [(i, i + 1) for i in range(len(section_ids) - 1)]\n\n# Here, to reduce GPU memory usage, each slice is considered as a subgraph for training.\nSTAligner_obj = ov.space.pySTAligner(adata_concat, verbose=True, knn_neigh = 100, n_epochs = 600, iter_comb = iter_comb,\n batch_key = 'batch_name', key_added='STAligner', Batch_list = Batch_list)\n")
97
+
98
+
99
+ # In[6]:
100
+
101
+
102
+ STAligner_obj.train()
103
+
104
+
105
+ # We stored the latent embedding in `adata.obsm['STAligner']`.
106
+
107
+ # In[7]:
108
+
109
+
110
+ adata = STAligner_obj.predicted()
111
+
112
+
113
+ # # Clustering the space
114
+ #
115
+ # We can use `GMM`, `leiden` or `louvain` to cluster the space.
116
+ #
117
+ # `ov.utils.cluster(adata,use_rep='STAligner',method='GMM',n_components=7,covariance_type='full', tol=1e-9, max_iter=1000, random_state=3607`
118
+ #
119
+ # or `sc.pp.neighbors(adata, use_rep='STAligner', random_state=666)`
120
+ # `ov.utils.cluster(adata,use_rSTAlignerGATE',method='leiden',resolution=1)`
121
+
122
+ # In[8]:
123
+
124
+
125
+ sc.pp.neighbors(adata, use_rep='STAligner', random_state=666)
126
+ ov.utils.cluster(adata,use_rep='STAligner',method='leiden',resolution=0.4)
127
+ sc.tl.umap(adata, random_state=666)
128
+ sc.pl.umap(adata, color=['batch_name',"leiden"],wspace=0.5)
129
+
130
+
131
+ # We can also map the clustering results back to the original spatial coordinates to obtain spatially specific clustering results.
132
+
133
+ # In[9]:
134
+
135
+
136
+ import matplotlib.pyplot as plt
137
+ spot_size = 50
138
+ title_size = 15
139
+ fig, ax = plt.subplots(1, 2, figsize=(6, 3), gridspec_kw={'wspace': 0.05, 'hspace': 0.2})
140
+ _sc_0 = sc.pl.spatial(adata[adata.obs['batch_name'] == 'Slide-seqV2_MoB'], img_key=None, color=['leiden'], title=['Slide-seqV2'],
141
+ legend_fontsize=10, show=False, ax=ax[0], frameon=False, spot_size=spot_size, legend_loc=None)
142
+ _sc_0[0].set_title('Slide-seqV2', size=title_size)
143
+
144
+ _sc_1 = sc.pl.spatial(adata[adata.obs['batch_name'] == 'Stereo-seq_MoB'], img_key=None, color=['leiden'], title=['Stereo-seq'],
145
+ legend_fontsize=10, show=False, ax=ax[1], frameon=False, spot_size=spot_size)
146
+ _sc_1[0].set_title('Stereo-seq',size=title_size)
147
+ _sc_1[0].invert_yaxis()
148
+ plt.show()
149
+
150
+
151
+ # In[ ]:
152
+
153
+
154
+
155
+
ovrawm/t_starfysh.txt ADDED
@@ -0,0 +1,519 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Deconvolution spatial transcriptomic without scRNA-seq
5
+ #
6
+ # This is a tutorial on an example real Spatial Transcriptomics (ST) data (CID44971_TNBC) from Wu et al., 2021. Raw tutorial could be found in https://starfysh.readthedocs.io/en/latest/notebooks/Starfysh_tutorial_real.html
7
+ #
8
+ #
9
+ # Starfysh performs cell-type deconvolution followed by various downstream analyses to discover spatial interactions in tumor microenvironment. Specifically, Starfysh looks for anchor spots (presumably with the highest compositions of one given cell type) informed by user-provided gene signatures ([see example](https://drive.google.com/file/d/1AXWQy_mwzFEKNjAdrJjXuegB3onxJoOM/view?usp=share_link)) as priors to guide the deconvolution inference, which further enables downstream analyses such as sample integration, spatial hub characterization, cell-cell interactions, etc. This tutorial focuses on the deconvolution task. Overall, Starfysh provides the following options:
10
+ #
11
+ # At omicverse, we have made the following improvements:
12
+ # - Easier visualization, you can use omicverse unified visualization for scientific mapping
13
+ # - Reduce installation dependency errors, we optimized the automatic selection of different packages, you don't need to install too many extra packages and cause conflicts.
14
+ #
15
+ # **Base feature**:
16
+ #
17
+ # - Spot-level deconvolution with expected cell types and corresponding annotated signature gene sets (default)
18
+ #
19
+ # **Optional**:
20
+ #
21
+ # - Archetypal Analysis (AA):
22
+ #
23
+ # *If gene signatures are not provided*
24
+ #
25
+ # - Unsupervised cell type annotation
26
+ #
27
+ # *If gene signatures are provided but require refinement*:
28
+ #
29
+ # - Novel cell type / cell state discovery (complementary to known cell types from the *signatures*)
30
+ # - Refine known marker genes by appending archetype-specific differentially expressed genes, and update anchor spots accordingly
31
+ #
32
+ # - Product-of-Experts (PoE) integration
33
+ #
34
+ # Multi-modal integrative predictions with expression & histology image by leverging additional side information (e.g. cell density) from H&E image.
35
+ #
36
+ # He, S., Jin, Y., Nazaret, A. et al.
37
+ # Starfysh integrates spatial transcriptomic and histologic data to reveal heterogeneous tumor–immune hubs.
38
+ # Nat Biotechnol (2024).
39
+ # https://doi.org/10.1038/s41587-024-02173-8
40
+
41
+ # In[1]:
42
+
43
+
44
+ import scanpy as sc
45
+ import omicverse as ov
46
+ ov.plot_set()
47
+
48
+
49
+ # In[2]:
50
+
51
+
52
+ from omicverse.externel.starfysh import (AA, utils, plot_utils, post_analysis)
53
+ from omicverse.externel.starfysh import _starfysh as sf_model
54
+
55
+
56
+ # ### (1). load data and marker genes
57
+ #
58
+ # File Input:
59
+ # - Spatial transcriptomics
60
+ # - Count matrix: `adata`
61
+ # - (Optional): Paired histology & spot coordinates: `img`, `map_info`
62
+ #
63
+ # - Annotated signatures (marker genes) for potential cell types: `gene_sig`
64
+ #
65
+ # Starfysh is built upon scanpy and Anndata. The common ST/Visium data sample folder consists a expression count file (usually `filtered_featyur_bc_matrix.h5`), and a subdirectory with corresponding H&E image and spatial information, as provided by Visium platform.
66
+ #
67
+ # For example, our example real ST data has the following structure:
68
+ # ```
69
+ # ├── data_folder
70
+ # signature.csv
71
+ #
72
+ # ├── CID44971:
73
+ # \__ filtered_feature_bc_mactrix.h5
74
+ #
75
+ # ├── spatial:
76
+ # \__ aligned_fiducials.jpg
77
+ # detected_tissue_image.jpg
78
+ # scalefactors_json.json
79
+ # tissue_hires_image.png
80
+ # tissue_lowres_image.png
81
+ # tissue_positions_list.csv
82
+ # ```
83
+ #
84
+ # For data that doesn't follow the common visium data structure (e.g. missing `filtered_featyur_bc_matrix.h5` or the given `.h5ad` count matrix file lacks spatial metadata), please construct the data as Anndata synthesizing information as the example simulated data shows:
85
+
86
+ # [Note]: If you’re running this tutorial locally, please download the sample [data](https://drive.google.com/drive/folders/1RIp0Z2eF1m8Ortx0sgB4z5g5ISsRFzJ4?usp=share_link) and [signature gene sets](https://drive.google.com/file/d/1AXWQy_mwzFEKNjAdrJjXuegB3onxJoOM/view?usp=share_link), and save it in the relative path `data/star_data` (otherwise please modify the data_path defined in the cell below):
87
+
88
+ # In[3]:
89
+
90
+
91
+ # Specify data paths
92
+ data_path = 'data/star_data'
93
+ sample_id = 'CID44971_TNBC'
94
+ sig_name = 'bc_signatures_version_1013.csv'
95
+
96
+
97
+ # In[4]:
98
+
99
+
100
+ # Load expression counts and signature gene sets
101
+ adata, adata_normed = utils.load_adata(data_folder=data_path,
102
+ sample_id=sample_id, # sample id
103
+ n_genes=2000 # number of highly variable genes to keep
104
+ )
105
+
106
+
107
+ # In[5]:
108
+
109
+
110
+ import pandas as pd
111
+ import os
112
+ gene_sig = pd.read_csv(os.path.join(data_path, sig_name))
113
+ gene_sig = utils.filter_gene_sig(gene_sig, adata.to_df())
114
+ gene_sig.head()
115
+
116
+
117
+ # **If there's no input signature gene sets, Starfysh defines "archetypal marker genes" as *signatures*. Please refer to the following code snippet and see details in section (3).**
118
+ #
119
+ # ```Python
120
+ # aa_model = AA.ArchetypalAnalysis(adata_orig=adata_normed)
121
+ # archetype, arche_dict, major_idx, evs = aa_model.compute_archetypes(r=40)
122
+ # gene_sig = aa_model.find_markers(n_markers=30, display=False)
123
+ # gene_sig = utils.filter_gene_sig(gene_sig, adata.to_df())
124
+ # gene_sig.head()
125
+ # ```
126
+
127
+ # In[6]:
128
+
129
+
130
+ # Load spatial information
131
+ img_metadata = utils.preprocess_img(data_path,
132
+ sample_id,
133
+ adata_index=adata.obs.index,
134
+ #hchannel=False
135
+ )
136
+ img, map_info, scalefactor = img_metadata['img'], img_metadata['map_info'], img_metadata['scalefactor']
137
+ umap_df = utils.get_umap(adata, display=True)
138
+
139
+
140
+ # In[7]:
141
+
142
+
143
+ import matplotlib.pyplot as plt
144
+ plt.figure(figsize=(6, 6), dpi=80)
145
+ plt.imshow(img)
146
+
147
+
148
+ # In[8]:
149
+
150
+
151
+ map_info.head()
152
+
153
+
154
+ # ### (2). Preprocessing (finding anchor spots)
155
+ # - Identify anchor spot locations.
156
+ #
157
+ # Instantiate parameters for Starfysh model training:
158
+ # - Raw & normalized counts after taking highly variable genes
159
+ # - filtered signature genes
160
+ # - library size & spatial smoothed library size (log-transformed)
161
+ # - Anchor spot indices (`anchors_df`) for each cell type & their signature means (`sig_means`)
162
+ #
163
+
164
+ # In[ ]:
165
+
166
+
167
+ # Parameters for training
168
+ visium_args = utils.VisiumArguments(adata,
169
+ adata_normed,
170
+ gene_sig,
171
+ img_metadata,
172
+ n_anchors=60,
173
+ window_size=3,
174
+ sample_id=sample_id
175
+ )
176
+
177
+ adata, adata_normed = visium_args.get_adata()
178
+ anchors_df = visium_args.get_anchors()
179
+
180
+
181
+ # In[10]:
182
+
183
+
184
+ adata.obs['log library size']=visium_args.log_lib
185
+ adata.obs['windowed log library size']=visium_args.win_loglib
186
+
187
+
188
+ # In[11]:
189
+
190
+
191
+ sc.pl.spatial(adata, cmap='magma',
192
+ # show first 8 cell types
193
+ color='log library size',
194
+ ncols=4, size=1.3,
195
+ img_key='hires',
196
+ #palette=Layer_color
197
+ # limit color scale at 99.2% quantile of cell abundance
198
+ #vmin=0, vmax='p99.2'
199
+ )
200
+
201
+
202
+ # In[12]:
203
+
204
+
205
+ sc.pl.spatial(adata, cmap='magma',
206
+ # show first 8 cell types
207
+ color='windowed log library size',
208
+ ncols=4, size=1.3,
209
+ img_key='hires',
210
+ #palette=Layer_color
211
+ # limit color scale at 99.2% quantile of cell abundance
212
+ #vmin=0, vmax='p99.2'
213
+ )
214
+
215
+
216
+ # plot raw gene expression:
217
+
218
+ # In[13]:
219
+
220
+
221
+ sc.pl.spatial(adata, cmap='magma',
222
+ # show first 8 cell types
223
+ color='IL7R',
224
+ ncols=4, size=1.3,
225
+ img_key='hires',
226
+ #palette=Layer_color
227
+ # limit color scale at 99.2% quantile of cell abundance
228
+ #vmin=0, vmax='p99.2'
229
+ )
230
+
231
+
232
+ # In[14]:
233
+
234
+
235
+ plot_utils.plot_anchor_spots(umap_df,
236
+ visium_args.pure_spots,
237
+ visium_args.sig_mean,
238
+ bbox_x=2
239
+ )
240
+
241
+
242
+ # ### (3). Optional: Archetypal Analysis
243
+ # Overview:
244
+ # If users don't provide annotated gene signature sets with cell types, Starfysh identifies candidates for cell types via archetypal analysis (AA). The underlying assumption is that the geometric "extremes" are identified as the purest cell types, whereas all other spots are mixture of the "archetypes". If the users provide the gene signature sets, they can still optionally apply AA to refine marker genes and update anchor spots for known cell types. In addition, AA can identify & assign potential novel cell types / states. Here are the features provided by the optional archetypal analysis:
245
+ # - Finding archetypal spots & assign 1-1 mapping to their closest anchor spot neighbors
246
+ # - Finding archetypal marker genes & append them to marker genes of annotated cell types
247
+ # - Assigning novel cell type / cell states as the most distant archetypes
248
+ #
249
+ # Overall, Starfysh provides the archetypal analysis as a complementary toolkit for automatic cell-type annotation & signature gene completion:<br><br>
250
+ #
251
+ # 1. *If signature genes aren't provided:* <br><br>Archetypal analysis defines the geometric extrema of the data as major cell types with corresponding marker genes.<br><br>
252
+ #
253
+ # 2. *If complete signature genes are known*: <br><br>Users can skip this section and use only the signature priors<br><br>
254
+ #
255
+ # 3. *If signature genes are incomplete or need refinement*: <br><br>Archetypal analysis can be applied to
256
+ # a. Refine signatures of certain cell types
257
+ # b. Find novel cell types / states that haven't been provided from the input signature
258
+
259
+ # #### If signature genes aren't provided
260
+ #
261
+ # Note: <br>
262
+ # - Intrinsic Dimension (ID) estimator is implemented to estimate the lower-bound for the number of archetypes $k$, followed by elbow method with iterations to identify the optimal $k$. By default, a [conditional number](https://scikit-dimension.readthedocs.io/en/latest/skdim.id.FisherS.html) is set as 30; if you believe there are potentially more / fewer cell types, please increase / decrease `cn` accordingly.
263
+
264
+ # Major cell types & corresponding markers are represented by the inferred archetypes:<br><br>
265
+ #
266
+ #
267
+ #
268
+ # ```Python
269
+ # aa_model = AA.ArchetypalAnalysis(adata_orig=adata_normed)
270
+ # archetype, arche_dict, major_idx, evs = aa_model.compute_archetypes(r=40)
271
+ #
272
+ # # (1). Find archetypal spots & archetypal clusters
273
+ # arche_df = aa_model.find_archetypal_spots(major=True)
274
+ #
275
+ # # (2). Define "signature genes" as marker genes associated with each archetypal cluster
276
+ # gene_sig = aa_model.find_markers(n_markers=30, display=False)
277
+ # gene_sig.head()
278
+ # ```
279
+
280
+ # #### If complete signature genes are known
281
+ #
282
+ # Users can skip th section & run Starfysh
283
+ #
284
+ # #### If signature genes are incomplete or require refinement
285
+ #
286
+ # **In this tutorial, we'll show an example of applying best-aligned archetypes to existing `anchors` of given cell type(s) to append signature genes.**
287
+
288
+ # In[ ]:
289
+
290
+
291
+ aa_model = AA.ArchetypalAnalysis(adata_orig=adata_normed)
292
+ archetype, arche_dict, major_idx, evs = aa_model.compute_archetypes(cn=40)
293
+ # (1). Find archetypal spots & archetypal clusters
294
+ arche_df = aa_model.find_archetypal_spots(major=True)
295
+
296
+ # (2). Find marker genes associated with each archetypal cluster
297
+ markers_df = aa_model.find_markers(n_markers=30, display=False)
298
+
299
+ # (3). Map archetypes to closest anchors (1-1 per cell type)
300
+ map_df, map_dict = aa_model.assign_archetypes(anchors_df)
301
+
302
+ # (4). Optional: Find the most distant archetypes that are not assigned to any annotated cell types
303
+ distant_arches = aa_model.find_distant_archetypes(anchors_df, n=3)
304
+
305
+
306
+ # In[16]:
307
+
308
+
309
+ plot_utils.plot_evs(evs, kmin=aa_model.kmin)
310
+
311
+
312
+ # - Visualize archetypes
313
+
314
+ # In[17]:
315
+
316
+
317
+ aa_model.plot_archetypes(do_3d=False, major=True, disp_cluster=False)
318
+
319
+
320
+ # - Visualize archetypal - cell type mapping:
321
+
322
+ # In[18]:
323
+
324
+
325
+ aa_model.plot_mapping(map_df)
326
+
327
+
328
+ # - Application: appending marker genes Append archetypal marker genes with the best-aligned anchors:
329
+
330
+ # In[ ]:
331
+
332
+
333
+ visium_args = utils.refine_anchors(
334
+ visium_args,
335
+ aa_model,
336
+ #thld=0.7, # alignment threshold
337
+ n_genes=5,
338
+ #n_iters=1
339
+ )
340
+
341
+ # Get updated adata & signatures
342
+ adata, adata_normed = visium_args.get_adata()
343
+ gene_sig = visium_args.gene_sig
344
+ cell_types = gene_sig.columns
345
+
346
+
347
+ # ## Run starfysh without histology integration
348
+ #
349
+ #
350
+ #
351
+ # We perform `n_repeat` random restarts and select the best model with lowest loss for parameter `c` (inferred cell-type proportions):
352
+ #
353
+ # ### (1). Model parameters
354
+
355
+ # In[20]:
356
+
357
+
358
+ import torch
359
+ n_repeats = 3
360
+ epochs = 200
361
+ patience = 50
362
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
363
+
364
+
365
+ # ### (2). Model training
366
+ #
367
+ # Users can choose to run the one of the following `Starfysh` model without/with histology integration:
368
+ #
369
+ # Without histology integration: setting `utils.run_starfysh(poe=False)` (default)
370
+ #
371
+ # With histology integration: setting `utils.run_starfysh(poe=True)`
372
+
373
+ # In[21]:
374
+
375
+
376
+ # Run models
377
+ model, loss = utils.run_starfysh(visium_args,
378
+ n_repeats=n_repeats,
379
+ epochs=epochs,
380
+ #patience=patience,
381
+ device=device
382
+ )
383
+
384
+
385
+ # ### Downstream analysis
386
+ #
387
+ # ### Parse Starfysh inference output
388
+
389
+ # In[22]:
390
+
391
+
392
+ adata, adata_normed = visium_args.get_adata()
393
+ inference_outputs, generative_outputs,adata_ = sf_model.model_eval(model,
394
+ adata,
395
+ visium_args,
396
+ poe=False,
397
+ device=device)
398
+
399
+
400
+ # ### Visualize starfysh deconvolution results
401
+ #
402
+ # **Gene sig mean vs. inferred prop**
403
+
404
+ # In[31]:
405
+
406
+
407
+ import numpy as np
408
+ n_cell_types = gene_sig.shape[1]
409
+ idx = np.random.randint(0, n_cell_types)
410
+ post_analysis.gene_mean_vs_inferred_prop(inference_outputs,
411
+ visium_args,
412
+ idx=idx,
413
+ figsize=(4,4)
414
+ )
415
+
416
+
417
+ # ### Spatial visualizations:
418
+ #
419
+ # Inferred density on Spatial map:
420
+
421
+ # In[24]:
422
+
423
+
424
+ plot_utils.pl_spatial_inf_feature(adata_, feature='ql_m', cmap='Blues')
425
+
426
+
427
+ # **Inferred cell-type proportions (spatial map):**
428
+ #
429
+
430
+ # In[25]:
431
+
432
+
433
+ def cell2proportion(adata):
434
+ adata_plot=sc.AnnData(adata.X)
435
+ adata_plot.obs=utils.extract_feature(adata_, 'qc_m').obs.copy()
436
+ adata_plot.var=adata.var.copy()
437
+ adata_plot.obsm=adata.obsm.copy()
438
+ adata_plot.obsp=adata.obsp.copy()
439
+ adata_plot.uns=adata.uns.copy()
440
+ return adata_plot
441
+ adata_plot=cell2proportion(adata_)
442
+
443
+
444
+ # In[26]:
445
+
446
+
447
+ adata_plot
448
+
449
+
450
+ # In[27]:
451
+
452
+
453
+ sc.pl.spatial(adata_plot, cmap='Spectral_r',
454
+ # show first 8 cell types
455
+ color=['Basal','LumA','LumB'],
456
+ ncols=4, size=1.3,
457
+ img_key='hires',
458
+ vmin=0, vmax='p90'
459
+ )
460
+
461
+
462
+ # In[28]:
463
+
464
+
465
+ ov.pl.embedding(adata_plot,
466
+ basis='z_umap',
467
+ color=['Basal', 'LumA', 'MBC', 'Normal epithelial'],
468
+ frameon='small',
469
+ vmin=0, vmax='p90',
470
+ cmap='Spectral_r',
471
+ )
472
+
473
+
474
+ # In[29]:
475
+
476
+
477
+ pred_exprs = sf_model.model_ct_exp(model,
478
+ adata,
479
+ visium_args,
480
+ device=device)
481
+
482
+
483
+ # Plot spot-level expression (e.g. `IL7R` from *Effector Memory T cells (Tem)*):
484
+ #
485
+
486
+ # In[30]:
487
+
488
+
489
+ gene='IL7R'
490
+ gene_celltype='Tem'
491
+ adata_.layers[f'infer_{gene_celltype}']=pred_exprs[gene_celltype]
492
+
493
+ sc.pl.spatial(adata_, cmap='Spectral_r',
494
+ # show first 8 cell types
495
+ color=gene,
496
+ title=f'{gene} (Predicted expression)\n{gene_celltype}',
497
+ layer=f'infer_{gene_celltype}',
498
+ ncols=4, size=1.3,
499
+ img_key='hires',
500
+ #vmin=0, vmax='p90'
501
+ )
502
+
503
+
504
+ # ### Save model & inferred parameters
505
+
506
+ # In[ ]:
507
+
508
+
509
+ # Specify output directory
510
+ outdir = './results/'
511
+ if not os.path.exists(outdir):
512
+ os.mkdir(outdir)
513
+
514
+ # save the model
515
+ torch.save(model.state_dict(), os.path.join(outdir, 'starfysh_model.pt'))
516
+
517
+ # save `adata` object with inferred parameters
518
+ adata.write(os.path.join(outdir, 'st.h5ad'))
519
+
ovrawm/t_stt.txt ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Spatial transition tensor of single cells
5
+ #
6
+ # spatial transition tensor (STT), a method that uses messenger RNA splicing and spatial transcriptomes through a multiscale dynamical model to characterize multistability in space. By learning a four-dimensional transition tensor and spatial-constrained random walk, STT reconstructs cell-state-specific dynamics and spatial state transitions via both short-time local tensor streamlines between cells and long-time transition paths among attractors.
7
+ #
8
+ # We made three improvements in integrating the STT algorithm in OmicVerse:
9
+ #
10
+ # - **More user-friendly function implementation**: we refreshed the unnecessary extra assignments in the original documentation and automated their encapsulation into the `omicverse.space.STT` class.
11
+ # - **Removed version dependencies**: We removed all the strong dependencies such as ``CellRank==1.3.1`` in the original `requierment.txt`, so that users only need to install the OmicVerse package and the latest version of CellRank to make it work perfectly.
12
+ # - **Added clearer function notes**: We have reorganised the unclear areas described in the original tutorial, where you need to go back to the paper to read, in this document.
13
+ #
14
+ # If you found this tutorial helpful, please cite STT and OmicVerse:
15
+ #
16
+ # - Zhou, P., Bocci, F., Li, T. et al. Spatial transition tensor of single cells. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02266-x
17
+
18
+ # In[1]:
19
+
20
+
21
+ import omicverse as ov
22
+ #import omicverse.STT as st
23
+ import scvelo as scv
24
+ import scanpy as sc
25
+ ov.plot_set()
26
+
27
+
28
+ # ## Preprocess data
29
+ #
30
+ # In this tutorial, we focus on demonstrating and reproducing the original author's data, which has been completed with calculations such as `adata.layers[‘Ms’]`, `adata.layers[‘Mu’]`, and so on. And when analysing our own data, we need the following functions to preprocess the raw data
31
+ #
32
+ # ```
33
+ # scv.pp.filter_and_normalise(adata, min_shared_counts=20, n_top_genes=3000)
34
+ # scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
35
+ # ```
36
+ #
37
+ # The `mouse_brain.h5ad` could be found in the [Github:STT](https://github.com/cliffzhou92/STT/tree/release/data)
38
+
39
+ # In[2]:
40
+
41
+
42
+ adata = sc.read_h5ad('mouse_brain.h5ad')
43
+ adata
44
+
45
+
46
+ # ## Training STT model
47
+ #
48
+ # Here, we used ov.space.STT to construct a STAGATE object to train the model. We need to set the following parameters during initialisation:
49
+ #
50
+ # - `spatial_loc`: The nulling coordinates for each spot, in 10x genomic data, are typically `adata.obsm[‘spatial’]`, so this parameter is typically set to `spatial`, but here we store it in `xy_loc`.
51
+ # - `region`: This parameter is considered to be the region of the attractor, which we would normally define using spatial annotations or cellular annotation information.
52
+
53
+ # In[3]:
54
+
55
+
56
+ STT_obj=ov.space.STT(adata,spatial_loc='xy_loc',region='Region')
57
+
58
+
59
+ # Note that we need to specify the number of potential attractors first when predicting attractors. In the author's original tutorial and original paper, there is no clear definition for the specification of this parameter. After referring to the author's tutorial, we use the calculated number of leiden of `adata_aggr` as a prediction of the number of potential attractors.
60
+
61
+ # In[4]:
62
+
63
+
64
+ STT_obj.stage_estimate()
65
+
66
+
67
+ # The authors noted in the original tutorial that a key parameter called ‘spa_weight’ controls the relative weight of the spatial location similarity kernel.
68
+ #
69
+ # Other parameters are further described in the api documentation. Typically `n_stage` is the parameter we are interested in modifying
70
+
71
+ # In[ ]:
72
+
73
+
74
+ STT_obj.train(n_states = 9, n_iter = 15, weight_connectivities = 0.5,
75
+ n_neighbors = 50,thresh_ms_gene = 0.2, spa_weight =0.3)
76
+
77
+
78
+ # After the prediction is complete, the attractor is stored in `adata.obs[‘attractor’]`. We can use `ov.pl.embedding` to visualize it.
79
+
80
+ # In[12]:
81
+
82
+
83
+ ov.pl.embedding(adata, basis="xy_loc",
84
+ color=["attractor"],frameon='small',
85
+ palette=ov.pl.sc_color[11:])
86
+
87
+
88
+ # In[7]:
89
+
90
+
91
+ ov.pl.embedding(adata, basis="xy_loc",
92
+ color=["Region"],frameon='small',
93
+ )
94
+
95
+
96
+ # ## Pathway analysis
97
+ #
98
+ # In the original tutorial, the author encapsulated the `gseapy==1.0.4` version for access enrichment. Note that the use of this function requires networking, which we have modified so that we can enrich using the local pathway dataset
99
+ #
100
+ # We can download good access data directly in [enrichr](https://maayanlab.cloud/Enrichr/#libraries), such as the `KEGG_2019_mouse` used in this study.
101
+ #
102
+ # https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=KEGG_2019_Mouse
103
+
104
+ # In[8]:
105
+
106
+
107
+ pathway_dict=ov.utils.geneset_prepare('genesets/KEGG_2019_Mouse.txt',organism='Mouse')
108
+
109
+
110
+ # In[ ]:
111
+
112
+
113
+ STT_obj.compute_pathway(pathway_dict)
114
+
115
+
116
+ # After running the function, we can use the `plot_pathway` function to visualize the similairty between pathway dynamics in the low dimensional embeddings.
117
+
118
+ # In[11]:
119
+
120
+
121
+ fig = STT_obj.plot_pathway(figsize = (10,8),size = 100,fontsize = 12)
122
+ for ax in fig.axes:
123
+ ax.set_xlabel('Embedding 1', fontsize=20) # Adjust font size as needed
124
+ ax.set_ylabel('Embedding 2', fontsize=20) # Adjust font size as needed
125
+ fig.show()
126
+
127
+
128
+ # If we are interested in the specific pathways, we can use the `plot_tensor_pathway` function to visualize the streamlines.
129
+
130
+ # In[13]:
131
+
132
+
133
+ import matplotlib.pyplot as plt
134
+ fig, ax = plt.subplots(1, 1, figsize=(4, 4))
135
+ STT_obj.plot_tensor_pathway(pathway_name = 'Wnt signaling pathway',basis = 'xy_loc',
136
+ ax=ax)
137
+
138
+
139
+ # In[14]:
140
+
141
+
142
+ fig, ax = plt.subplots(1, 1, figsize=(4, 4))
143
+ STT_obj.plot_tensor_pathway( 'TGF-beta signaling pathway',basis = 'xy_loc',
144
+ ax=ax)
145
+
146
+
147
+ # ## Tensor analysis
148
+ #
149
+ # In the author's original paper, a very interesting concept is mentioned, attractor-averaged and attractor-specific tensors.
150
+ #
151
+ # We can analyse the Joint Tensor and thus study the steady state processes of different attractors. If the streamlines are passing through the attractor then the attractor is in a steady state, if the streamlines are emanating/converging from the attractor then the attractor is in a dynamic state.
152
+ #
153
+ # In addition to this, the Unspliced Tensor also reflects the strength as well as the size of the attraction.
154
+
155
+ # In[4]:
156
+
157
+
158
+ STT_obj.plot_tensor(list_attractor = [1,3,5,6],
159
+ filter_cells = True, member_thresh = 0.1, density = 1)
160
+
161
+
162
+ # ## Landscape analysis
163
+ #
164
+ # Each attractor corresponds to a spatial steady state, then we can use contour plots to visualise this steady state and use CellRank's correlation function to infer state transitions between different attractors.
165
+
166
+ # In[17]:
167
+
168
+
169
+ STT_obj.construct_landscape(coord_key = 'X_xy_loc')
170
+
171
+
172
+ # In[14]:
173
+
174
+
175
+ sc.pl.embedding(adata, color = ['attractor', 'Region'],basis= 'trans_coord')
176
+
177
+
178
+ # Method to infer the lineage, either ‘MPFT’(maxium probability flow tree, global) or ‘MPPT’(most probable path tree, local)
179
+
180
+ # In[15]:
181
+
182
+
183
+ STT_obj.infer_lineage(si=3,sf=4, method = 'MPPT',flux_fraction=0.8,color_palette_name = 'tab10',size_point = 8,
184
+ size_text=12)
185
+
186
+
187
+ # The Sankey plot displaying the relation between STT attractors (left) and spatial region annotations (right). The width of links indicates the number of cells that share the connected attractor label and region annotation label simultaneously
188
+
189
+ # In[16]:
190
+
191
+
192
+ fig = STT_obj.plot_sankey(adata.obs['attractor'].tolist(),adata.obs['Region'].tolist())
193
+
194
+
195
+ # ## Saving and Loading Data
196
+ #
197
+ # We need to save the data after the calculation is complete, and we provide the load function to load it directly in the next analysis without having to re-analyse it.
198
+
199
+ # In[24]:
200
+
201
+
202
+ #del adata.uns['r2_keep_train']
203
+ #del adata.uns['r2_keep_test']
204
+ #del adata.uns['kernel']
205
+ #del adata.uns['kernel_connectivities']
206
+
207
+ STT_obj.adata.write('data/mouse_brain_adata.h5ad')
208
+ STT_obj.adata_aggr.write('data/mouse_brain_adata_aggr.h5ad')
209
+
210
+
211
+ # In[2]:
212
+
213
+
214
+ adata=ov.read('data/mouse_brain_adata.h5ad')
215
+ adata_aggr=ov.read('data/mouse_brain_adata_aggr.h5ad')
216
+
217
+
218
+ # In[3]:
219
+
220
+
221
+ STT_obj=ov.space.STT(adata,spatial_loc='xy_loc',region='Region')
222
+ STT_obj.load(adata,adata_aggr)
223
+
224
+
225
+ # ## Gene Dynamic
226
+ #
227
+ # The genes with high multistability scores possess varying expression levels in both unspliced and spliced counts within various attractors, and show a gradual change during stage transitions. These gene were stored in `adata.var['r2_test']`
228
+ #
229
+ #
230
+
231
+ # In[18]:
232
+
233
+
234
+ adata.var['r2_test'].sort_values(ascending=False)
235
+
236
+
237
+ # In[11]:
238
+
239
+
240
+ STT_obj.plot_top_genes(top_genes = 6, ncols = 2, figsize = (8,8),)
241
+
242
+
243
+ # We analysed the attractor 1-related gene Sim1, and we found that in the unspliced Mu matrix (smooth unspliced), Sim1 is expressed low at attractor 1; in the spliced matrix Ms, Sim1 is expressed high at attractor 1. It indicates that there is a dynamic tendency of Sim1 gene at attractor 1, i.e., the direction of Sim1 expression is flowing towards attractor 1.
244
+ #
245
+ # Velo analyses can also illustrate this, although Sim1 expression is not highest at attractor 1.
246
+
247
+ # In[26]:
248
+
249
+
250
+ import matplotlib.pyplot as plt
251
+ fig, axes = plt.subplots(1, 4, figsize=(12, 3))
252
+ ov.pl.embedding(adata, basis="xy_loc",
253
+ color=["Sim1"],frameon='small',
254
+ title='Sim1:Ms',show=False,
255
+ layer='Ms',cmap='RdBu_r',ax=axes[0]
256
+ )
257
+ ov.pl.embedding(adata, basis="xy_loc",
258
+ color=["Sim1"],frameon='small',
259
+ title='Sim1:Mu',show=False,
260
+ layer='Mu',cmap='RdBu_r',ax=axes[1]
261
+ )
262
+ ov.pl.embedding(adata, basis="xy_loc",
263
+ color=["Sim1"],frameon='small',
264
+ title='Sim1:Velo',show=False,
265
+ layer='velo',cmap='RdBu_r',ax=axes[2]
266
+ )
267
+ ov.pl.embedding(adata, basis="xy_loc",
268
+ color=["Sim1"],frameon='small',
269
+ title='Sim1:exp',show=False,
270
+ #layer='Mu',
271
+ cmap='RdBu_r',ax=axes[3]
272
+ )
273
+ plt.tight_layout()
274
+
ovrawm/t_tcga.txt ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # TCGA database preprocess
5
+ #
6
+ # We often download patient survival data from the TCGA database for analysis in order to verify the importance of genes in cancer. However, the pre-processing of the TCGA database is often a headache. Here, we have introduced the TCGA module in ov, a way to quickly process the file formats we download from the TCGA database. We need to prepare 3 files as input:
7
+ #
8
+ # - gdc_sample_sheet (`.tsv`): The `Sample Sheet` button of TCGA, and we can get `tsv` file from it
9
+ # - gdc_download_files (`folder`): The `Download/Cart` button of TCGA, and we get `tar.gz` included all file you selected/
10
+ # - clinical_cart (`folder`): The `Clinical` button of TCGA, and we can get `tar.gz` included all clinical of your files
11
+
12
+ # In[1]:
13
+
14
+
15
+ import omicverse as ov
16
+ import scanpy as sc
17
+ ov.plot_set()
18
+
19
+
20
+ # ## TCGA counts read
21
+ #
22
+ # Here, we use `ov.bulk.TCGA` to perform the `gdc_sample_sheet`, `gdc_download_files` and `clinical_cart` you download before. The raw count, fpkm and tpm matrix will be stored in anndata object
23
+
24
+ # In[2]:
25
+
26
+
27
+ get_ipython().run_cell_magic('time', '', "gdc_sample_sheep='data/TCGA_OV/gdc_sample_sheet.2024-07-05.tsv'\ngdc_download_files='data/TCGA_OV/gdc_download_20240705_180129.081531'\nclinical_cart='data/TCGA_OV/clinical.cart.2024-07-05'\naml_tcga=ov.bulk.pyTCGA(gdc_sample_sheep,gdc_download_files,clinical_cart)\naml_tcga.adata_init()\n")
28
+
29
+
30
+ # We can save the anndata object for the next use
31
+
32
+ # In[3]:
33
+
34
+
35
+ aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad',compression='gzip')
36
+
37
+
38
+ # Note: Each time we read the anndata file, we need to initialize the TCGA object using three paths so that the subsequent TCGA functions such as survival analysis can be used properly
39
+ #
40
+ # If you wish to create your own TCGA data, we have provided [sample data](https://figshare.com/ndownloader/files/47461946) here for download:
41
+ #
42
+ # TCGA OV: https://figshare.com/ndownloader/files/47461946
43
+
44
+ # In[2]:
45
+
46
+
47
+ gdc_sample_sheep='data/TCGA_OV/gdc_sample_sheet.2024-07-05.tsv'
48
+ gdc_download_files='data/TCGA_OV/gdc_download_20240705_180129.081531'
49
+ clinical_cart='data/TCGA_OV/clinical.cart.2024-07-05'
50
+ aml_tcga=ov.bulk.pyTCGA(gdc_sample_sheep,gdc_download_files,clinical_cart)
51
+ aml_tcga.adata_read('data/TCGA_OV/ov_tcga_raw.h5ad')
52
+
53
+
54
+ # ## Meta init
55
+ # As the TCGA reads the gene_id, we need to convert it to gene_name as well as adding basic information about the patient. Therefore we need to initialise the patient's meta information.
56
+
57
+ # In[3]:
58
+
59
+
60
+ aml_tcga.adata_meta_init()
61
+
62
+
63
+ # ## Survial init
64
+ # We set up the path for Clinical earlier, but in fact we did not import the patient information in the previous process, we only initially determined the id of the patient's TCGA, so we attracted to initialize the clinical information
65
+
66
+ # In[4]:
67
+
68
+
69
+ aml_tcga.survial_init()
70
+ aml_tcga.adata
71
+
72
+
73
+ # To visualize the gene you interested, we can use `survival_analysis` to finish it.
74
+
75
+ # In[5]:
76
+
77
+
78
+ aml_tcga.survival_analysis('MYC',layer='deseq_normalize',plot=True)
79
+
80
+
81
+ # If you want to calculate the survival of all genes, you can also use the `survial_analysis_all` to finish it. It may calculate a lot of times.
82
+
83
+ # In[ ]:
84
+
85
+
86
+ aml_tcga.survial_analysis_all()
87
+ aml_tcga.adata
88
+
89
+
90
+ # Don't forget to save your result.
91
+
92
+ # In[ ]:
93
+
94
+
95
+ aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_survial_all.h5ad',compression='gzip')
96
+
ovrawm/t_tosica.txt ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Celltype annotation migration(mapping) with TOSICA
5
+ #
6
+ # We know that when all samples cannot be obtained at the same time, it would be desirable to classify the cell types on the first batch of data and use them to annotate the data obtained later or to be obtained in the future with the same standard, without the need to processing and mapping them together again.
7
+ #
8
+ # So migration(mapping) the reference cell annotation is necessary. This tutorial focuses on how to migration(mapping) the cell annotation from reference scRNA-seq atlas to new scRNA-seq data.
9
+ #
10
+ # Paper: [Transformer for one stop interpretable cell type annotation](https://www.nature.com/articles/s41467-023-35923-4)
11
+ #
12
+ # Code: https://github.com/JackieHanLab/TOSICA
13
+ #
14
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1BjPEG-kLAgicP8iQvtVtpzzbIOmk1X23?usp=sharing
15
+ #
16
+ # ![tosica](https://raw.githubusercontent.com/JackieHanLab/TOSICA/main/figure.png)
17
+ #
18
+
19
+ # In[1]:
20
+
21
+
22
+ import omicverse as ov
23
+ import scanpy as sc
24
+ ov.utils.ov_plot_set()
25
+
26
+
27
+ # ## Loading data
28
+ #
29
+ # + `demo_train.h5ad` : Braon(GSE84133) and Muraro(GSE85241)
30
+ #
31
+ # + `demo_test.h5ad` : xin(GSE81608), segerstolpe(E-MTAB-5061) and Lawlor(GSE86473)
32
+ #
33
+ # They can be downloaded at https://figshare.com/projects/TOSICA_demo/158489.
34
+
35
+ # In[2]:
36
+
37
+
38
+ ref_adata = sc.read('demo_train.h5ad')
39
+ ref_adata = ref_adata[:,ref_adata.var_names]
40
+ print(ref_adata)
41
+ print(ref_adata.obs.Celltype.value_counts())
42
+
43
+
44
+ # In[4]:
45
+
46
+
47
+ query_adata = sc.read('demo_test.h5ad')
48
+ query_adata = query_adata[:,ref_adata.var_names]
49
+ print(query_adata)
50
+ print(query_adata.obs.Celltype.value_counts())
51
+
52
+
53
+ # We need to select the same gene training and predicting the celltype
54
+
55
+ # In[ ]:
56
+
57
+
58
+ ref_adata.var_names_make_unique()
59
+ query_adata.var_names_make_unique()
60
+ ret_gene=list(set(query_adata.var_names) & set(ref_adata.var_names))
61
+ len(ret_gene)
62
+
63
+
64
+ # In[ ]:
65
+
66
+
67
+ query_adata=query_adata[:,ret_gene]
68
+ ref_adata=ref_adata[:,ret_gene]
69
+
70
+
71
+ # In[5]:
72
+
73
+
74
+ print(f"The max of ref_adata is {ref_adata.X.max()}, query_data is {query_adata.X.max()}",)
75
+
76
+
77
+ # By comparing the maximum values of the two data, we can see that the data has been normalised `sc.pp.normalize_total` and logarithmised `sc.pp.log1p`. The same treatment is applied to the data when we use our own data for analysis.
78
+ #
79
+ # ## Download Genesets
80
+ #
81
+ # Here, we need to download the genesets as pathway at first. You can use `ov.utils.download_tosica_gmt()` to download automatically or manual download from:
82
+ #
83
+ # - 'GO_bp':'https://figshare.com/ndownloader/files/41460072',
84
+ # - 'TF':'https://figshare.com/ndownloader/files/41460066',
85
+ # - 'reactome':'https://figshare.com/ndownloader/files/41460051',
86
+ # - 'm_GO_bp':'https://figshare.com/ndownloader/files/41460060',
87
+ # - 'm_TF':'https://figshare.com/ndownloader/files/41460057',
88
+ # - 'm_reactome':'https://figshare.com/ndownloader/files/41460054',
89
+ # - 'immune':'https://figshare.com/ndownloader/files/41460063',
90
+ #
91
+
92
+ # In[ ]:
93
+
94
+
95
+ ov.utils.download_tosica_gmt()
96
+
97
+
98
+ # ## Initialisation the TOSICA model
99
+ #
100
+ # We first need to train the TOSICA model on the REFERENCE dataset, where omicverse provides a simple class `pyTOSICA`, and all subsequent operations can be done with `pyTOSICA`. We need to set the parameters for model initialisation.
101
+ #
102
+ # - `adata`: the reference adata object
103
+ # - `gmt_path`: default pre-prepared mask or path to .gmt files. you can use `ov.utils.download_tosica_gmt()` to obtain the genesets
104
+ # - `depth`: the depth of transformer model, When it is set to 2, a memory leak may occur
105
+ # - `label_name`: the reference key of celltype in `adata.obs`
106
+ # - `project_path`: the save path of TOSICA model
107
+ # - `batch_size`: indicates the number of cells passed to the programme for training in a single pass
108
+
109
+ # In[6]:
110
+
111
+
112
+ tosica_obj=ov.single.pyTOSICA(adata=ref_adata,
113
+ gmt_path='genesets/GO_bp.gmt', depth=1,
114
+ label_name='Celltype',
115
+ project_path='hGOBP_demo',
116
+ batch_size=8)
117
+
118
+
119
+ # ## Training the TOSICA model
120
+ #
121
+ # There're 4 arguments to set when training the TOSICA model.
122
+ #
123
+ # - pre_weights: The path of the pre-trained weights.
124
+ # - lr: The learning rate.
125
+ # - epochs: The number of epochs.
126
+ # - lrf: The learning rate of the last layer.
127
+
128
+ # In[5]:
129
+
130
+
131
+ tosica_obj.train(epochs=5)
132
+
133
+
134
+ # We can use `.save` to store the `TOSICA` model in `project_path`
135
+
136
+ # In[6]:
137
+
138
+
139
+ tosica_obj.save()
140
+
141
+
142
+ # The model can be loaded from `project_path` automatically.
143
+
144
+ # In[7]:
145
+
146
+
147
+ tosica_obj.load()
148
+
149
+
150
+ # ## Update with query
151
+ #
152
+
153
+ # In[8]:
154
+
155
+
156
+ new_adata=tosica_obj.predicted(pre_adata=query_adata)
157
+
158
+
159
+ # ## Visualize the reference and mapping
160
+ #
161
+ # We first compute the lower dimensional space of query_data, where we use omicverse's preprocessing method as well as the mde method for dimensionality reduction
162
+ #
163
+ # To visualize the PCA’s embeddings, we use the `pymde` package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
164
+
165
+ # In[15]:
166
+
167
+
168
+ ov.pp.scale(query_adata)
169
+ ov.pp.pca(query_adata,layer='scaled',n_pcs=50)
170
+ sc.pp.neighbors(query_adata, n_neighbors=15, n_pcs=50,
171
+ use_rep='scaled|original|X_pca')
172
+ query_adata.obsm["X_mde"] = ov.utils.mde(query_adata.obsm["scaled|original|X_pca"])
173
+ query_adata
174
+
175
+
176
+ # Since new_adata and query_adata have the same cells, their low-dimensional spaces are also the same. So we proceed directly to the assignment operation.
177
+
178
+ # In[16]:
179
+
180
+
181
+ new_adata.obsm=query_adata[new_adata.obs.index].obsm.copy()
182
+ new_adata.obsp=query_adata[new_adata.obs.index].obsp.copy()
183
+ new_adata
184
+
185
+
186
+ # Since the predicted cell types are not exactly the same as the original cell types, the colours are not exactly the same. For the visualisation effect, we manually set the colour of the predicted cell type with the original cell type.
187
+
188
+ # In[18]:
189
+
190
+
191
+ import numpy as np
192
+ col = np.array([
193
+ "#98DF8A","#E41A1C" ,"#377EB8", "#4DAF4A" ,"#984EA3" ,"#FF7F00" ,"#FFFF33" ,"#A65628" ,"#F781BF" ,"#999999","#1F77B4","#FF7F0E","#279E68","#FF9896"
194
+ ]).astype('<U7')
195
+
196
+ celltype = ("alpha","beta","ductal","acinar","delta","PP","PSC","endothelial","epsilon","mast","macrophage","schwann",'t_cell')
197
+ new_adata.obs['Prediction'] = new_adata.obs['Prediction'].astype('category')
198
+ new_adata.obs['Prediction'] = new_adata.obs['Prediction'].cat.reorder_categories(list(celltype))
199
+ new_adata.uns['Prediction_colors'] = col[1:]
200
+
201
+ celltype = ("MHC class II","alpha","beta","ductal","acinar","delta","PP","PSC","endothelial","epsilon","mast")
202
+ new_adata.obs['Celltype'] = new_adata.obs['Celltype'].astype('category')
203
+ new_adata.obs['Celltype'] = new_adata.obs['Celltype'].cat.reorder_categories(list(celltype))
204
+ new_adata.uns['Celltype_colors'] = col[:11]
205
+
206
+
207
+ # In[24]:
208
+
209
+
210
+ sc.pl.embedding(
211
+ new_adata,
212
+ basis="X_mde",
213
+ color=['Celltype', 'Prediction'],
214
+ frameon=False,
215
+ #ncols=1,
216
+ wspace=0.5,
217
+ #palette=ov.utils.pyomic_palette()[11:],
218
+ show=False,
219
+ )
220
+
221
+
222
+ # ## Pathway attention
223
+ #
224
+ # TOSICA has another special feature, which is the ability to computationally use self-attention mechanisms to find pathways associated with cell types. Here we demonstrate the approach of this downstream analysis.
225
+ #
226
+ # We first need to filter out the predicted types of cells with cell counts less than 5.
227
+
228
+ # In[30]:
229
+
230
+
231
+ cell_idx=new_adata.obs['Prediction'].value_counts()[new_adata.obs['Prediction'].value_counts()<5].index
232
+ new_adata=new_adata[~new_adata.obs['Prediction'].isin(cell_idx)]
233
+
234
+
235
+ # We then used `sc.tl.rank_genes_groups` to calculate the differential pathways with the highest attention across cell types. This differential pathway is derived from the gmt genesets used for the previous calculation.
236
+
237
+ # In[31]:
238
+
239
+
240
+ sc.tl.rank_genes_groups(new_adata, 'Prediction', method='wilcoxon')
241
+
242
+
243
+ # In[34]:
244
+
245
+
246
+ sc.pl.rank_genes_groups_dotplot(new_adata,
247
+ n_genes=3,standard_scale='var',)
248
+
249
+
250
+ # If you want to get the cell-specific pathway, you can use `sc.get.rank_genes_groups_df` to get.
251
+ #
252
+ # For example, we would like to obtain the pathway with the highest attention for the cell type `PP`
253
+
254
+ # In[35]:
255
+
256
+
257
+ degs = sc.get.rank_genes_groups_df(new_adata, group='PP', key='rank_genes_groups',
258
+ pval_cutoff=0.05)
259
+ degs.head()
260
+
261
+
262
+ # In[36]:
263
+
264
+
265
+ sc.pl.embedding(
266
+ new_adata,
267
+ basis="X_mde",
268
+ color=['Prediction','GOBP_REGULATION_OF_MUSCLE_SYSTEM_PROCESS'],
269
+ frameon=False,
270
+ #ncols=1,
271
+ wspace=0.5,
272
+ #palette=ov.utils.pyomic_palette()[11:],
273
+ show=False,
274
+ )
275
+
276
+
277
+ # If you call omciverse to complete a TOSICA analysis, don't forget to cite the following literature:
278
+ #
279
+ # ```
280
+ # @article{pmid:36641532,
281
+ # journal = {Nature communications},
282
+ # doi = {10.1038/s41467-023-35923-4},
283
+ # issn = {2041-1723},
284
+ # number = {1},
285
+ # pmid = {36641532},
286
+ # pmcid = {PMC9840170},
287
+ # address = {England},
288
+ # title = {Transformer for one stop interpretable cell type annotation},
289
+ # volume = {14},
290
+ # author = {Chen, Jiawei and Xu, Hao and Tao, Wanyu and Chen, Zhaoxiong and Zhao, Yuxuan and Han, Jing-Dong J},
291
+ # note = {[Online; accessed 2023-07-18]},
292
+ # pages = {223},
293
+ # date = {2023-01-14},
294
+ # year = {2023},
295
+ # month = {1},
296
+ # day = {14},
297
+ # }
298
+ #
299
+ #
300
+ # @misc{doi:10.1101/2023.06.06.543913,
301
+ # doi = {10.1101/2023.06.06.543913},
302
+ # publisher = {Cold Spring Harbor Laboratory},
303
+ # title = {OmicVerse: A single pipeline for exploring the entire transcriptome universe},
304
+ # author = {Zeng, Zehua and Ma, Yuqing and Hu, Lei and Xiong, Yuanyan and Du, Hongwu},
305
+ # note = {[Online; accessed 2023-07-18]},
306
+ # date = {2023-06-08},
307
+ # year = {2023},
308
+ # month = {6},
309
+ # day = {8},
310
+ # }
311
+ # ```
312
+
313
+ # In[ ]:
314
+
315
+
316
+
317
+
ovrawm/t_traj.txt ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Trajectory Inference with PAGA or Palantir
5
+ #
6
+ # Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process.
7
+ #
8
+ # Palantir is an algorithm to align cells along differentiation trajectories. Palantir models differentiation as a stochastic process where stem cells differentiate to terminally differentiated cells by a series of steps through a low dimensional phenotypic manifold. Palantir effectively captures the continuity in cell states and the stochasticity in cell fate determination.
9
+ #
10
+ # Note that both methods require the input of cells in their initial state, and we will introduce other methods that do not require the input of artificial information, such as pyVIA, in subsequent analyses.
11
+ #
12
+ #
13
+ # ## Preprocess data
14
+ #
15
+ # As an example, we apply differential kinetic analysis to dentate gyrus neurogenesis, which comprises multiple heterogeneous subpopulations.
16
+
17
+ # In[1]:
18
+
19
+
20
+ import scanpy as sc
21
+ import scvelo as scv
22
+ import matplotlib.pyplot as plt
23
+ import omicverse as ov
24
+ ov.plot_set()
25
+
26
+
27
+ # In[2]:
28
+
29
+
30
+ import scvelo as scv
31
+ adata=scv.datasets.dentategyrus()
32
+ adata
33
+
34
+
35
+ # In[3]:
36
+
37
+
38
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,)
39
+ adata.raw = adata
40
+ adata = adata[:, adata.var.highly_variable_features]
41
+ ov.pp.scale(adata)
42
+ ov.pp.pca(adata,layer='scaled',n_pcs=50)
43
+
44
+
45
+ # Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells. In our experience, often a rough estimate of the number of PCs does fine.
46
+
47
+ # In[4]:
48
+
49
+
50
+ ov.utils.plot_pca_variance_ratio(adata)
51
+
52
+
53
+ # ## Trajectory inference with diffusion map
54
+ #
55
+ # Here, we used `ov.single.TrajInfer` to construct a Trajectory Inference object.
56
+
57
+ # In[5]:
58
+
59
+
60
+ Traj=ov.single.TrajInfer(adata,basis='X_umap',groupby='clusters',
61
+ use_rep='scaled|original|X_pca',n_comps=50,)
62
+ Traj.set_origin_cells('nIPC')
63
+
64
+
65
+ # In[6]:
66
+
67
+
68
+ Traj.inference(method='diffusion_map')
69
+
70
+
71
+ # In[7]:
72
+
73
+
74
+ ov.utils.embedding(adata,basis='X_umap',
75
+ color=['clusters','dpt_pseudotime'],
76
+ frameon='small',cmap='Reds')
77
+
78
+
79
+ # PAGA graph abstraction has benchmarked as top-performing method for trajectory inference. It provides a graph-like map of the data topology with weighted edges corresponding to the connectivity between two clusters.
80
+ #
81
+ # Here, PAGA is extended by neighbor directionality.
82
+
83
+ # In[8]:
84
+
85
+
86
+ ov.utils.cal_paga(adata,use_time_prior='dpt_pseudotime',vkey='paga',
87
+ groups='clusters')
88
+
89
+
90
+ # In[9]:
91
+
92
+
93
+ ov.utils.plot_paga(adata,basis='umap', size=50, alpha=.1,title='PAGA LTNN-graph',
94
+ min_edge_width=2, node_size_scale=1.5,show=False,legend_loc=False)
95
+
96
+
97
+ # ## Trajectory inference with Slingshot
98
+ #
99
+ # Provides functions for inferring continuous, branching lineage structures in low-dimensional data. Slingshot was designed to model developmental trajectories in single-cell RNA sequencing data and serve as a component in an analysis pipeline after dimensionality reduction and clustering. It is flexible enough to handle arbitrarily many branching events and allows for the incorporation of prior knowledge through supervised graph construction.
100
+
101
+ # In[10]:
102
+
103
+
104
+ Traj=ov.single.TrajInfer(adata,basis='X_umap',groupby='clusters',
105
+ use_rep='scaled|original|X_pca',n_comps=50)
106
+ Traj.set_origin_cells('nIPC')
107
+ #Traj.set_terminal_cells(["Granule mature","OL","Astrocytes"])
108
+
109
+
110
+ # If you only need the proposed timing and not the lineage of the process, then you can leave the debug_axes parameter unset.
111
+
112
+ # In[ ]:
113
+
114
+
115
+ Traj.inference(method='slingshot',num_epochs=1)
116
+
117
+
118
+ # else, you can set `debug_axes` to visualize the lineage
119
+
120
+ # In[13]:
121
+
122
+
123
+ fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 8))
124
+ Traj.inference(method='slingshot',num_epochs=1,debug_axes=axes)
125
+
126
+
127
+ # In[14]:
128
+
129
+
130
+ ov.utils.embedding(adata,basis='X_umap',
131
+ color=['clusters','slingshot_pseudotime'],
132
+ frameon='small',cmap='Reds')
133
+
134
+
135
+ # In[15]:
136
+
137
+
138
+ sc.pp.neighbors(adata,use_rep='scaled|original|X_pca')
139
+ ov.utils.cal_paga(adata,use_time_prior='slingshot_pseudotime',vkey='paga',
140
+ groups='clusters')
141
+
142
+
143
+ # In[16]:
144
+
145
+
146
+ ov.utils.plot_paga(adata,basis='umap', size=50, alpha=.1,title='PAGA Slingshot-graph',
147
+ min_edge_width=2, node_size_scale=1.5,show=False,legend_loc=False)
148
+
149
+
150
+ # ## Trajectory inference with Palantir
151
+ #
152
+ # Palantir can be run by specifying an approxiate early cell.
153
+ #
154
+ # Palantir can automatically determine the terminal states as well. In this dataset, we know the terminal states and we will set them using the terminal_states parameter
155
+ #
156
+ # Here, we used `ov.single.TrajInfer` to construct a Trajectory Inference object.
157
+
158
+ # In[17]:
159
+
160
+
161
+ Traj=ov.single.TrajInfer(adata,basis='X_umap',groupby='clusters',
162
+ use_rep='scaled|original|X_pca',n_comps=50)
163
+ Traj.set_origin_cells('nIPC')
164
+ Traj.set_terminal_cells(["Granule mature","OL","Astrocytes"])
165
+
166
+
167
+ # In[18]:
168
+
169
+
170
+ Traj.inference(method='palantir',num_waypoints=500)
171
+
172
+
173
+ # Palantir results can be visualized on the tSNE or UMAP using the plot_palantir_results function
174
+
175
+ # In[19]:
176
+
177
+
178
+ Traj.palantir_plot_pseudotime(embedding_basis='X_umap',cmap='RdBu_r',s=3)
179
+
180
+
181
+ # Once the cells are selected, it's often helpful to visualize the selection on the pseudotime trajectory to ensure we've isolated the correct cells for our specific trend. We can do this using the plot_branch_selection function:
182
+
183
+ # In[20]:
184
+
185
+
186
+ Traj.palantir_cal_branch(eps=0)
187
+
188
+
189
+ # In[22]:
190
+
191
+
192
+ ov.externel.palantir.plot.plot_trajectory(adata, "Granule mature",
193
+ cell_color="palantir_entropy",
194
+ n_arrows=10,
195
+ color="red",
196
+ scanpy_kwargs=dict(cmap="RdBu_r"),
197
+ )
198
+
199
+
200
+ # Palantir uses Mellon Function Estimator to determine the gene expression trends along different lineages. The marker trends can be determined using the following snippet. This computes the trends for all lineages. A subset of lineages can be used using the lineages parameter.
201
+
202
+ # In[23]:
203
+
204
+
205
+ gene_trends = Traj.palantir_cal_gene_trends(
206
+ layers="MAGIC_imputed_data",
207
+ )
208
+
209
+
210
+ # In[24]:
211
+
212
+
213
+ genes = ['Cdca3','Rasl10a','Mog','Aqp4']
214
+ Traj.palantir_plot_gene_trends(genes)
215
+ plt.show()
216
+
217
+
218
+ # We can also use paga to visualize the cell stages
219
+
220
+ # In[25]:
221
+
222
+
223
+ ov.utils.cal_paga(adata,use_time_prior='palantir_pseudotime',vkey='paga',
224
+ groups='clusters')
225
+
226
+
227
+ # In[26]:
228
+
229
+
230
+ ov.utils.plot_paga(adata,basis='umap', size=50, alpha=.1,title='PAGA LTNN-graph',
231
+ min_edge_width=2, node_size_scale=1.5,show=False,legend_loc=False)
232
+
233
+
234
+ # In[ ]:
235
+
236
+
237
+
238
+
ovrawm/t_via.txt ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Trajectory Inference with VIA
5
+ #
6
+ # VIA is a single-cell Trajectory Inference method that offers topology construction, pseudotimes, automated terminal state prediction and automated plotting of temporal gene dynamics along lineages. Here, we have improved the original author's colouring logic and user habits so that users can use the anndata object directly for analysis。
7
+ #
8
+ # We have completed this tutorial using the analysis provided by the original VIA authors.
9
+ #
10
+ # Paper: [Generalized and scalable trajectory inference in single-cell omics data with VIA](https://www.nature.com/articles/s41467-021-25773-3)
11
+ #
12
+ # Code: https://github.com/ShobiStassen/VIA
13
+ #
14
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1A2X23z_RLJaYLbXaiCbZa-fjNbuGACrD?usp=sharing
15
+
16
+ # In[1]:
17
+
18
+
19
+ import omicverse as ov
20
+ import scanpy as sc
21
+ import matplotlib.pyplot as plt
22
+ ov.utils.ov_plot_set()
23
+
24
+
25
+ # ## Data loading and preprocessing
26
+ #
27
+ # We have used the dataset scRNA_hematopoiesis provided by the authors for this analysis, noting that the data have been normalized and logarithmicized but not scaled.
28
+
29
+ # In[2]:
30
+
31
+
32
+ adata = ov.single.scRNA_hematopoiesis()
33
+ sc.tl.pca(adata, svd_solver='arpack', n_comps=200)
34
+ adata
35
+
36
+
37
+ # ## Model construct and run
38
+ #
39
+ # We need to specify the cell feature vector `adata_key` used for VIA inference, which can be X_pca, X_scVI or X_glue, depending on the purpose of your analysis, here we use X_pca directly. We also need to specify how many components to be used, the components should not larger than the length of vector.
40
+ #
41
+ # Besides, we need to specify the `clusters` to be colored and calculate for VIA. If the `root_user` is None, it will be calculated the root cell automatically.
42
+ #
43
+ # We need to set `basis` argument stored in `adata.obsm`. An example setting `tsne` because it stored in `obsm: 'tsne', 'MAGIC_imputed_data', 'palantir_branch_probs', 'X_pca'`
44
+ #
45
+ # We also need to set `clusters` argument stored in `adata.obs`. It means the celltype key.
46
+ #
47
+ # Other explaination of argument and attributes could be found at https://pyvia.readthedocs.io/en/latest/Parameters%20and%20Attributes.html
48
+
49
+ # In[3]:
50
+
51
+
52
+ v0 = ov.single.pyVIA(adata=adata,adata_key='X_pca',adata_ncomps=80, basis='tsne',
53
+ clusters='label',knn=30,random_seed=4,root_user=[4823],)
54
+
55
+ v0.run()
56
+
57
+
58
+ # ## Visualize and analysis
59
+ #
60
+ # Before the subsequent analysis, we need to specify the colour of each cluster. Here we use sc.pl.embedding to automatically colour each cluster, if you need to specify your own colours, specify the palette parameter
61
+
62
+ # In[4]:
63
+
64
+
65
+ fig, ax = plt.subplots(1,1,figsize=(4,4))
66
+ sc.pl.embedding(
67
+ adata,
68
+ basis="tsne",
69
+ color=['label'],
70
+ frameon=False,
71
+ ncols=1,
72
+ wspace=0.5,
73
+ show=False,
74
+ ax=ax
75
+ )
76
+ fig.savefig('figures/via_fig1.png',dpi=300,bbox_inches = 'tight')
77
+
78
+
79
+ # ## VIA graph
80
+ #
81
+ # To visualize the results of the Trajectory inference in various ways. Via offers various plotting functions.We first show the cluster-graph level trajectory abstraction consisting of two subplots colored by annotated (true_label) composition and by pseudotime
82
+
83
+ # In[5]:
84
+
85
+
86
+ fig, ax, ax1 = v0.plot_piechart_graph(clusters='label',cmap='Reds',dpi=80,
87
+ show_legend=False,ax_text=False,fontsize=4)
88
+ fig.savefig('figures/via_fig2.png',dpi=300,bbox_inches = 'tight')
89
+
90
+
91
+ # In[ ]:
92
+
93
+
94
+ #you can use `v0.model.single_cell_pt_markov` to extract the pseudotime
95
+ v0.get_pseudotime(v0.adata)
96
+ v0.adata
97
+
98
+
99
+ # ## Visualise gene/feature graph
100
+ #
101
+ # View the gene expression along the VIA graph. We use the computed HNSW small world graph in VIA to accelerate the gene imputation calculations (using similar approach to MAGIC) as follows:
102
+ #
103
+
104
+ # In[6]:
105
+
106
+
107
+ gene_list_magic = ['IL3RA', 'IRF8', 'GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B', 'SPI1', 'CD34', 'CSF1R', 'ITGAX']
108
+ fig,axs=v0.plot_clustergraph(gene_list=gene_list_magic[:4],figsize=(12,3),)
109
+ fig.savefig('figures/via_fig2_1.png',dpi=300,bbox_inches = 'tight')
110
+
111
+
112
+ # ## Trajectory projection
113
+ #
114
+ # Visualize the overall VIA trajectory projected onto a 2D embedding (UMAP, Phate, TSNE etc) in different ways.
115
+ #
116
+ # - Draw the high-level clustergraph abstraction onto the embedding;
117
+ # - Draws a vector field plot of the more fine-grained directionality of cells along the trajectory projected onto an embedding.
118
+ # - Draw high-edge resolution directed graph
119
+
120
+ # In[7]:
121
+
122
+
123
+ fig,ax1,ax2=v0.plot_trajectory_gams(basis='tsne',clusters='label',draw_all_curves=False)
124
+ fig.savefig('figures/via_fig3.png',dpi=300,bbox_inches = 'tight')
125
+
126
+
127
+ # In[8]:
128
+
129
+
130
+ fig,ax=v0.plot_stream(basis='tsne',clusters='label',
131
+ density_grid=0.8, scatter_size=30, scatter_alpha=0.3, linewidth=0.5)
132
+ fig.savefig('figures/via_fig4.png',dpi=300,bbox_inches = 'tight')
133
+
134
+
135
+ # In[9]:
136
+
137
+
138
+ fig,ax=v0.plot_stream(basis='tsne',density_grid=0.8, scatter_size=30, color_scheme='time', linewidth=0.5,
139
+ min_mass = 1, cutoff_perc = 5, scatter_alpha=0.3, marker_edgewidth=0.1,
140
+ density_stream = 2, smooth_transition=1, smooth_grid=0.5)
141
+ fig.savefig('figures/via_fig5.png',dpi=300,bbox_inches = 'tight')
142
+
143
+
144
+ # ## Probabilistic pathways
145
+ #
146
+ # Visualize the probabilistic pathways from root to terminal state as indicated by the lineage likelihood. The higher the linage likelihood, the greater the potential of that particular cell to differentiate towards the terminal state of interest.
147
+
148
+ # In[10]:
149
+
150
+
151
+ fig,axs=v0.plot_lineage_probability(figsize=(8,4),)
152
+ fig.savefig('figures/via_fig6.png',dpi=300,bbox_inches = 'tight')
153
+
154
+
155
+ # We can specify a specific linkage for visualisation
156
+
157
+ # In[11]:
158
+
159
+
160
+ fig,axs=v0.plot_lineage_probability(figsize=(6,3),marker_lineages = [2,3])
161
+ fig.savefig('figures/via_fig7.png',dpi=300,bbox_inches = 'tight')
162
+
163
+
164
+ # ## Gene Dynamics
165
+ #
166
+ # The gene dynamics along pseudotime for all detected lineages are automatically inferred by VIA. These can be interpreted as the change in gene expression along any given lineage.
167
+
168
+ # In[12]:
169
+
170
+
171
+ fig,axs=v0.plot_gene_trend(gene_list=gene_list_magic,figsize=(8,6),)
172
+ fig.savefig('figures/via_fig8.png',dpi=300,bbox_inches = 'tight')
173
+
174
+
175
+ # In[14]:
176
+
177
+
178
+ fig,ax=v0.plot_gene_trend_heatmap(gene_list=gene_list_magic,figsize=(4,4),
179
+ marker_lineages=[2])
180
+ fig.savefig('figures/via_fig9.png',dpi=300,bbox_inches = 'tight')
181
+
182
+
183
+ # In[ ]:
184
+
185
+
186
+
187
+
188
+
189
+ # In[ ]:
190
+
191
+
192
+
193
+
ovrawm/t_via_velo.txt ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Trajectory Inference with VIA and scVelo
5
+ #
6
+ # When scRNA-velocity is available, it can be used to guide the trajectory inference and automate initial state prediction. However, because RNA velocitycan be misguided by(Bergen 2021) boosts in expression, variable transcription rates and data capture scope limited to steady-state populations only, users might find it useful to adjust the level of influence the RNA-velocity data should exercise on the inferred TI.
7
+ #
8
+ # Paper: [Generalized and scalable trajectory inference in single-cell omics data with VIA](https://www.nature.com/articles/s41467-021-25773-3)
9
+ #
10
+ # Code: https://github.com/ShobiStassen/VIA
11
+ #
12
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1MtGr3e9uUb_BWOzKlcbOTiCYsZpljEyF?usp=sharing
13
+
14
+ # In[7]:
15
+
16
+
17
+ import omicverse as ov
18
+ import scanpy as sc
19
+ import scvelo as scv
20
+ import cellrank as cr
21
+ ov.utils.ov_plot_set()
22
+
23
+
24
+ # ## Data loading and preprocessing
25
+ #
26
+ # We use a familiar endocrine-genesis dataset (Bastidas-Ponce et al. (2019).) to demonstrate initial state prediction at the EP Ngn3 low cells and automatic captures of the 4 differentiated islets (alpha, beta, delta and epsilon). As mentioned, it us useful to control the level of influence of RNA-velocity relative to gene-gene distance and this is done using the velo_weight parameter.
27
+
28
+ # In[2]:
29
+
30
+
31
+ adata = cr.datasets.pancreas()
32
+ adata
33
+
34
+
35
+ # In[3]:
36
+
37
+
38
+ n_pcs = 30
39
+ scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=5000)
40
+ sc.tl.pca(adata, n_comps = n_pcs)
41
+ scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
42
+ scv.tl.velocity(adata, mode='stochastic') # good results acheived with mode = 'stochastic' too
43
+
44
+
45
+ # ## Initialize and run VIA
46
+ #
47
+ #
48
+
49
+ # In[4]:
50
+
51
+
52
+ v0 = ov.single.pyVIA(adata=adata,adata_key='X_pca',adata_ncomps=n_pcs, basis='X_umap',
53
+ clusters='clusters',knn=20, root_user=None,
54
+ dataset='', random_seed=42,is_coarse=True, preserve_disconnected=True, pseudotime_threshold_TS=50,
55
+ piegraph_arrow_head_width=0.15,piegraph_edgeweight_scalingfactor=2.5,
56
+ velocity_matrix=adata.layers['velocity'],gene_matrix=adata.X.todense(),velo_weight=0.5,
57
+ edgebundle_pruning_twice=False, edgebundle_pruning=0.15, pca_loadings = adata.varm['PCs']
58
+ )
59
+
60
+ v0.run()
61
+
62
+
63
+ # In[ ]:
64
+
65
+
66
+
67
+
68
+
69
+ # In[5]:
70
+
71
+
72
+ fig, ax, ax1 = v0.plot_piechart_graph(clusters='clusters',cmap='Reds',dpi=80,
73
+ show_legend=False,ax_text=False,fontsize=4)
74
+ fig.set_size_inches(8,4)
75
+
76
+
77
+ # ## Visualize trajectory and cell progression
78
+ #
79
+ # Fine grained vector fields
80
+
81
+ # In[8]:
82
+
83
+
84
+ v0.plot_trajectory_gams(basis='X_umap',clusters='clusters',draw_all_curves=False)
85
+
86
+
87
+ # In[9]:
88
+
89
+
90
+ v0.plot_stream(basis='X_umap',clusters='clusters',
91
+ density_grid=0.8, scatter_size=30, scatter_alpha=0.3, linewidth=0.5)
92
+
93
+
94
+ # ## Draw lineage likelihoods
95
+ #
96
+ # These indicate potential pathways corresponding to the 4 islets (two types of Beta islets Lineage 5 and 12)
97
+
98
+ # In[10]:
99
+
100
+
101
+ v0.plot_lineage_probability()
102
+
ovrawm/t_visualize_bulk.txt ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Visualization of Bulk RNA-seq
5
+ #
6
+ # In this part, we will introduce the tutorial of special plot of `omicverse`.
7
+
8
+ # In[1]:
9
+
10
+
11
+ import omicverse as ov
12
+ import scanpy as sc
13
+ import matplotlib.pyplot as plt
14
+ ov.plot_set()
15
+
16
+
17
+ # ## Venn plot
18
+ #
19
+ # In transcriptome analyses, we often have to study differential genes that are common to different groups. Here, we provide `ov.pl.venn` to draw venn plots to visualise differential genes.
20
+ #
21
+ # **Function**: `ov.pl.venn`:
22
+ # - sets: Subgroups requiring venn plots, Dictionary format, keys no more than 4
23
+ # - palette: You can also re-specify the colour bar that needs to be drawn, just set `palette=['#FFFFFF','#000000']`, we have prepared `ov.pl.red_color`,`ov.pl.blue_color`,`ov.pl.green_color`,`ov.pl.orange_color`, by default.
24
+ # - fontsize: the fontsize and linewidth to visualize, fontsize will be multiplied by 2
25
+
26
+ # In[20]:
27
+
28
+
29
+ fig,ax=plt.subplots(figsize = (4,4))
30
+ #dict of sets
31
+ sets = {
32
+ 'Set1:name': {1,2,3},
33
+ 'Set2': {1,2,3,4},
34
+ 'Set3': {3,4},
35
+ 'Set4': {5,6}
36
+ }
37
+ #plot venn
38
+ ov.pl.venn(sets=sets,palette=ov.pl.sc_color,
39
+ fontsize=5.5,ax=ax,
40
+ )
41
+
42
+
43
+ #If we need to annotate genes, we can use plt.annotate for this purpose,
44
+ #we need to modify the text content, xy and xytext parameters.
45
+ plt.annotate('gene1,gene2', xy=(50,30), xytext=(0,-100),
46
+ ha='center', textcoords='offset points',
47
+ bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
48
+ arrowprops=dict(arrowstyle='->', color='gray'),size=12)
49
+
50
+ #Set the title
51
+ plt.title('Venn4',fontsize=13)
52
+
53
+ #save figure
54
+ fig.savefig("figures/bulk_venn4.png",dpi=300,bbox_inches = 'tight')
55
+
56
+
57
+ # In[9]:
58
+
59
+
60
+ fig,ax=plt.subplots(figsize = (4,4))
61
+ #dict of sets
62
+ sets = {
63
+ 'Set1:name': {1,2,3},
64
+ 'Set2': {1,2,3,4},
65
+ 'Set3': {3,4},
66
+ }
67
+
68
+ ov.pl.venn(sets=sets,ax=ax,fontsize=5.5,
69
+ palette=ov.pl.red_color)
70
+
71
+ plt.title('Venn3',fontsize=13)
72
+
73
+
74
+ # ## Volcano plot
75
+ #
76
+ # For differentially expressed genes, we tend to visualise them only with volcano plots. Here, we present a method for mapping volcanoes using Python `ov.pl.volcano`.
77
+ #
78
+ # **Function**: `ov.pl.venn`:
79
+ #
80
+ # main argument
81
+ # - result: the DEGs result
82
+ # - pval_name: the names of the columns whose vertical coordinates need to be plotted, stored in result.columns. In Bulk RNA-seq experiments, we usually set this to qvalue.
83
+ # - fc_name: The names of the columns for which you need to plot the horizontal coordinates, stored in result.columns. In Bulk RNA-seq experiments, we typically set this to log2FC.
84
+ # - fc_max: We need to set the threshold for the difference foldchange
85
+ # - fc_min: We need to set the threshold for the difference foldchange
86
+ # - pval_threshold: We need to set the threshold for the qvalue
87
+ # - pval_max: We also need to set boundary values so that the data is not too large to affect the visualisation
88
+ # - FC_max: We also need to set boundary values so that the data is not too large to affect the visualisation
89
+ #
90
+ # plot argument
91
+ # - figsize: The size of the generated figure, by default (4,4).
92
+ # - title: The title of the plot, by default ''.
93
+ # - titlefont: A dictionary of font properties for the plot title, by default {'weight':'normal','size':14,}.
94
+ # - up_color: The color of the up-regulated genes in the plot, by default '#e25d5d'.
95
+ # - down_color: The color of the down-regulated genes in the plot, by default '#7388c1'.
96
+ # - normal_color: The color of the non-significant genes in the plot, by default '#d7d7d7'.
97
+ # - legend_bbox: A tuple containing the coordinates of the legend's bounding box, by default (0.8, -0.2).
98
+ # - legend_ncol: The number of columns in the legend, by default 2.
99
+ # - legend_fontsize: The font size of the legend, by default 12.
100
+ # - plot_genes: A list of genes to be plotted on the volcano plot, by default None.
101
+ # - plot_genes_num: The number of genes to be plotted on the volcano plot, by default 10.
102
+ # - plot_genes_fontsize: The font size of the genes to be plotted on the volcano plot, by default 10.
103
+ # - ticks_fontsize: The font size of the ticks, by default 12.
104
+
105
+ # In[3]:
106
+
107
+
108
+ result=ov.read('data/dds_result.csv',index_col=0)
109
+ result.head()
110
+
111
+
112
+ # In[4]:
113
+
114
+
115
+ ov.pl.volcano(result,pval_name='qvalue',fc_name='log2FoldChange',
116
+ pval_threshold=0.05,fc_max=1.5,fc_min=-1.5,
117
+ pval_max=10,FC_max=10,
118
+ figsize=(4,4),title='DEGs in Bulk',titlefont={'weight':'normal','size':14,},
119
+ up_color='#e25d5d',down_color='#7388c1',normal_color='#d7d7d7',
120
+ up_fontcolor='#e25d5d',down_fontcolor='#7388c1',normal_fontcolor='#d7d7d7',
121
+ legend_bbox=(0.8, -0.2),legend_ncol=2,legend_fontsize=12,
122
+ plot_genes=None,plot_genes_num=10,plot_genes_fontsize=11,
123
+ ticks_fontsize=12,)
124
+
125
+
126
+ # ## Box plot
127
+ #
128
+ # For differentially expressed genes in different groups, we sometimes need to compare the differences between different groups, and this is when we need to use box-and-line plots to do the comparison
129
+ #
130
+ # **Function**: `ov.pl.boxplot`:
131
+ #
132
+ # - data: the data to visualize the boxplt example could be found in `seaborn.load_dataset("tips")`
133
+ # - x_value, y_value, hue: Inputs for plotting long-form data. See examples for interpretation.
134
+ # - figsize: The size of the generated figure, by default (4,4).
135
+ # - fontsize: The font size of the tick and labels, by default 12.
136
+ # - title: The title of the plot, by default ''.
137
+ #
138
+ #
139
+ # **Function**: `ov.pl.add_palue`:
140
+ # - ax: the axes of bardotplot
141
+ # - line_x1: The left side of the p-value line to be plotted
142
+ # - line_x2: The right side of the p-value line to be plotted|
143
+ # - line_y: The height of the p-value line to be plotted
144
+ # - text_y: How much above the p-value line is plotted text
145
+ # - text: the text of p-value, you can set `***` to instead `p<0.001`
146
+ # - fontsize: the fontsize of text
147
+ # - fontcolor: the color of text
148
+ # - horizontalalignment: the location of text
149
+
150
+ # In[3]:
151
+
152
+
153
+ import seaborn as sns
154
+ data = sns.load_dataset("tips")
155
+ data.head()
156
+
157
+
158
+ # In[7]:
159
+
160
+
161
+ fig,ax=ov.pl.boxplot(data,hue='sex',x_value='day',y_value='total_bill',
162
+ palette=ov.pl.red_color,
163
+ figsize=(4,2),fontsize=12,title='Tips',)
164
+
165
+ ov.pl.add_palue(ax,line_x1=-0.5,line_x2=0.5,line_y=40,
166
+ text_y=0.2,
167
+ text='$p={}$'.format(round(0.001,3)),
168
+ fontsize=11,fontcolor='#000000',
169
+ horizontalalignment='center',)
170
+
ovrawm/t_visualize_colorsystem.txt ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Color system
5
+ #
6
+ # In OmicVerse, we offer a color system based on Eastern aesthetics, featuring 384 representative colors derived from the Forbidden City. We will utilize these colors in combination for future visualizations.
7
+ #
8
+ # All color come from the book: "中国传统色:故宫里的色彩美学" (ISBN: 9787521716054)
9
+
10
+ # In[ ]:
11
+
12
+
13
+ import omicverse as ov
14
+ import scanpy as sc
15
+ #import scvelo as scv
16
+ ov.plot_set()
17
+
18
+
19
+ # We utilized single-cell RNA-seq data (GEO accession: GSE95753) obtained from the dentate gyrus of the hippocampus in mouse.
20
+
21
+ # In[2]:
22
+
23
+
24
+ adata = ov.read('data/DentateGyrus/10X43_1.h5ad')
25
+ adata
26
+
27
+
28
+ # ## Understanding the Color System
29
+ #
30
+ # In OmicVerse, we offer a color system based on Eastern aesthetics, featuring 384 representative colors derived from the Forbidden City. We will utilize these colors in combination for future visualizations.
31
+
32
+ # In[3]:
33
+
34
+
35
+ fb=ov.pl.ForbiddenCity()
36
+
37
+
38
+ # In[7]:
39
+
40
+
41
+ from IPython.display import HTML
42
+ HTML(fb.visual_color(loc_range=(0,384),
43
+ num_per_row=24))
44
+
45
+
46
+ # we can get a color using `get_color`
47
+
48
+ # In[14]:
49
+
50
+
51
+ fb.get_color(name='凝夜紫')
52
+
53
+
54
+ # ## Default Colormap
55
+ #
56
+ # We have provided a range of default colors including `green`, `red`, `pink`, `purple`, `yellow`, `brown`, `blue`, and `grey`. Each of these colors comes with its own set of sub-colormaps, providing a more granular level of color differentiation.
57
+ #
58
+ # Here's a breakdown of the sub-colormaps available for each default color:
59
+ #
60
+ # - **Green**:
61
+ # - `green1`: `Forbidden_Cmap(range(1, 19))`
62
+ # - `green2`: `Forbidden_Cmap(range(19, 41))`
63
+ # - `green3`: `Forbidden_Cmap(range(41, 62))`
64
+ #
65
+ # - **Red**:
66
+ # - `red1`: `Forbidden_Cmap(range(62, 77))`
67
+ # - `red2`: `Forbidden_Cmap(range(77, 104))`
68
+ #
69
+ # - **Pink**:
70
+ # - `pink1`: `Forbidden_Cmap(range(104, 134))`
71
+ # - `pink2`: `Forbidden_Cmap(range(134, 148))`
72
+ #
73
+ # - **Purple**:
74
+ # - `purple1`: `Forbidden_Cmap(range(148, 162))`
75
+ # - `purple2`: `Forbidden_Cmap(range(162, 176))`
76
+ #
77
+ # - **Yellow**:
78
+ # - `yellow1`: `Forbidden_Cmap(range(176, 196))`
79
+ # - `yellow2`: `Forbidden_Cmap(range(196, 207))`
80
+ # - `yellow3`: `Forbidden_Cmap(range(255, 276))`
81
+ #
82
+ # - **Brown**:
83
+ # - `brown1`: `Forbidden_Cmap(range(207, 228))`
84
+ # - `brown2`: `Forbidden_Cmap(range(228, 246))`
85
+ # - `brown3`: `Forbidden_Cmap(range(246, 255))`
86
+ # - `brown4`: `Forbidden_Cmap(range(276, 293))`
87
+ #
88
+ # - **Blue**:
89
+ # - `blue1`: `Forbidden_Cmap(range(293, 312))`
90
+ # - `blue2`: `Forbidden_Cmap(range(312, 321))`
91
+ # - `blue3`: `Forbidden_Cmap(range(321, 333))`
92
+ # - `blue4`: `Forbidden_Cmap(range(333, 339))`
93
+ #
94
+ # - **Grey**:
95
+ # - `grey1`: `Forbidden_Cmap(range(339, 356))`
96
+ # - `grey2`: `Forbidden_Cmap(range(356, 385))`
97
+ #
98
+ # Each main color can be represented as a combination of its sub-colormaps:
99
+ #
100
+ # - `green = green1 + green2 + green3`
101
+ # - `red = red1 + red2`
102
+ # - `pink = pink1 + pink2`
103
+ # - `purple = purple1 + purple2`
104
+ # - `yellow = yellow1 + yellow2 + yellow3`
105
+ # - `brown = brown1 + brown2 + brown3 + brown4`
106
+ # - `blue = blue1 + blue2 + blue3 + blue4`
107
+ # - `grey = grey1 + grey2`
108
+ #
109
+ # These colormaps can be utilized in various applications where color differentiation is necessary, providing flexibility in visual representation.
110
+ #
111
+ # `palette` is the argument we need to revise
112
+
113
+ # In[13]:
114
+
115
+
116
+ import matplotlib.pyplot as plt
117
+ fig, axes = plt.subplots(1,3,figsize=(9,3))
118
+ ov.pl.embedding(adata,
119
+ basis='X_umap',
120
+ frameon='small',
121
+ color=["clusters"],
122
+ palette=fb.red[:],
123
+ ncols=3,
124
+ show=False,
125
+ legend_loc=None,
126
+ ax=axes[0])
127
+
128
+ ov.pl.embedding(adata,
129
+ basis='X_umap',
130
+ frameon='small',
131
+ color=["clusters"],
132
+ palette=fb.pink1[:],
133
+ ncols=3,show=False,
134
+ legend_loc=None,
135
+ ax=axes[1])
136
+
137
+ ov.pl.embedding(adata,
138
+ basis='X_umap',
139
+ frameon='small',
140
+ color=["clusters"],
141
+ palette=fb.red1[:4]+fb.blue1,
142
+ ncols=3,show=False,
143
+ ax=axes[2])
144
+
145
+
146
+
147
+
148
+ # In[31]:
149
+
150
+
151
+ color_dict={'Astrocytes': '#e40414',
152
+ 'Cajal Retzius': '#ec5414',
153
+ 'Cck-Tox': '#ec4c2c',
154
+ 'Endothelial': '#d42c24',
155
+ 'GABA': '#2c5ca4',
156
+ 'Granule immature': '#acd4ec',
157
+ 'Granule mature': '#a4bcdc',
158
+ 'Microglia': '#8caccc',
159
+ 'Mossy': '#8cacdc',
160
+ 'Neuroblast': '#6c9cc4',
161
+ 'OL': '#6c94cc',
162
+ 'OPC': '#5c74bc',
163
+ 'Radial Glia-like': '#4c94c4',
164
+ 'nIPC': '#3474ac'}
165
+
166
+ ov.pl.embedding(adata,
167
+ basis='X_umap',
168
+ frameon='small',
169
+ color=["clusters"],
170
+ palette=color_dict,
171
+ ncols=3,show=False,
172
+ )
173
+
174
+
175
+
176
+ # ## Segmented Colormap
177
+ #
178
+ # When we need to create a continuous color gradient, we will use another function: `get_cmap_seg`, and we can combine the colors we need for visualization.
179
+
180
+ # In[22]:
181
+
182
+
183
+ colors=[
184
+ fb.get_color_rgb('群青'),
185
+ fb.get_color_rgb('半见'),
186
+ fb.get_color_rgb('丹罽'),
187
+ ]
188
+ fb.get_cmap_seg(colors)
189
+
190
+
191
+ # In[24]:
192
+
193
+
194
+ colors=[
195
+ fb.get_color_rgb('群青'),
196
+ fb.get_color_rgb('山矾'),
197
+ fb.get_color_rgb('丹罽'),
198
+ ]
199
+ fb.get_cmap_seg(colors)
200
+
201
+
202
+ # In[25]:
203
+
204
+
205
+ colors=[
206
+ fb.get_color_rgb('山矾'),
207
+ fb.get_color_rgb('丹罽'),
208
+ ]
209
+ fb.get_cmap_seg(colors)
210
+
211
+
212
+ # In[27]:
213
+
214
+
215
+ ov.pl.embedding(adata,
216
+ basis='X_umap',
217
+ frameon='small',
218
+ color=["Sox7"],
219
+ cmap=fb.get_cmap_seg(colors),
220
+ ncols=3,show=False,
221
+ #vmin=-1,vmax=1
222
+ )
223
+
ovrawm/t_visualize_single.txt ADDED
@@ -0,0 +1,534 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # Visualization of single cell RNA-seq
5
+ #
6
+ # In this part, we will introduce the tutorial of special plot of `omicverse`.
7
+
8
+ # In[1]:
9
+
10
+
11
+ import omicverse as ov
12
+ import scanpy as sc
13
+ #import scvelo as scv
14
+ ov.plot_set()
15
+
16
+
17
+ # We utilized single-cell RNA-seq data (GEO accession: GSE95753) obtained from the dentate gyrus of the hippocampus in mouse.
18
+
19
+ # In[2]:
20
+
21
+
22
+ adata = ov.read('data/DentateGyrus/10X43_1.h5ad')
23
+ adata
24
+
25
+
26
+ # ## Optimizing color mapping
27
+ #
28
+ # Visualizing spatially resolved biological data with appropriate color mapping can significantly facilitate the exploration of underlying patterns and heterogeneity. Spaco (spatial colorization) provides a spatially constrained approach that generates discriminate color assignments for visualizing single-cell spatial data in various scenarios.
29
+ #
30
+ # Jing Z, Zhu Q, Li L, Xie Y, Wu X, Fang Q, et al. [Spaco: A comprehensive tool for coloring spatial data at single-cell resolution.](https://doi.org/10.1016/j.patter.2023.100915) Patterns. 2024;100915
31
+ #
32
+ #
33
+ # **Function**: `ov.pl.optim_palette`:
34
+ # - adata: the datasets of scRNA-seq
35
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
36
+ # - colors: Specify the colour to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
37
+ # - palette: You can also re-specify the colour bar that needs to be drawn, just set `palette=['#FFFFFF','#000000']`, we have prepared `ov.pl.red_color`,`ov.pl.blue_color`,`ov.pl.green_color`,`ov.pl.orange_color`, by default.
38
+
39
+ # In[ ]:
40
+
41
+
42
+ optim_palette=ov.pl.optim_palette(adata,basis='X_umap',colors='clusters')
43
+
44
+
45
+ # In[4]:
46
+
47
+
48
+ import matplotlib.pyplot as plt
49
+ fig,ax=plt.subplots(figsize = (4,4))
50
+ ov.pl.embedding(adata,
51
+ basis='X_umap',
52
+ color='clusters',
53
+ frameon='small',
54
+ show=False,
55
+ palette=optim_palette,
56
+ ax=ax,)
57
+ plt.title('Cell Type of DentateGyrus',fontsize=15)
58
+
59
+
60
+ # In[5]:
61
+
62
+
63
+ ov.pl.embedding(adata,
64
+ basis='X_umap',
65
+ color='age(days)',
66
+ frameon='small',
67
+ show=False,)
68
+
69
+
70
+ # ## Stacked histogram of cell proportions
71
+ #
72
+ # This is a graph that appears widely in various CNS-level journals, and is limited to the fact that `scanpy` does not have a proper way of plotting it, and we provide `ov.pl.cellproportion` for plotting it here.
73
+ #
74
+ # **Function**: `ov.pl.cellproportion`:
75
+ # - adata: the datasets of scRNA-seq
76
+ # - celltype_clusters: Specify the colour to plot, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
77
+ # - groupby: The group variable for the different groups of cell types we need to display, in this case we are displaying different ages, so we set it to `age(days)`
78
+ # - groupby_li: If there are too many groups, we can also select the ones we are interested in plotting, here we use groupby_li to plot the groups
79
+ # - figsize: If we specify axes, then this variable can be left empty
80
+ # - legend: Whether to show a legend
81
+
82
+ # In[6]:
83
+
84
+
85
+ import matplotlib.pyplot as plt
86
+ fig,ax=plt.subplots(figsize = (1,4))
87
+ ov.pl.cellproportion(adata=adata,celltype_clusters='clusters',
88
+ groupby='age(days)',legend=True,ax=ax)
89
+
90
+
91
+ # In[7]:
92
+
93
+
94
+ fig,ax=plt.subplots(figsize = (2,2))
95
+ ov.pl.cellproportion(adata=adata,celltype_clusters='age(days)',
96
+ groupby='clusters',groupby_li=['nIPC','Granule immature','Granule mature'],
97
+ legend=True,ax=ax)
98
+
99
+
100
+ # If you are interested in the changes in cell types in different groups, we recommend using a stacked area graph.
101
+
102
+ # In[8]:
103
+
104
+
105
+ fig,ax=plt.subplots(figsize = (2,2))
106
+ ov.pl.cellstackarea(adata=adata,celltype_clusters='age(days)',
107
+ groupby='clusters',groupby_li=['nIPC','Granule immature','Granule mature'],
108
+ legend=True,ax=ax)
109
+
110
+
111
+ # ## A collection of some interesting embedded plot
112
+ #
113
+ # Our first presentation is an embedding map with the number and proportion of cell types. This graph visualises the low-dimensional representation of cells in addition to the number of cell proportions, etc. It should be noted that the cell proportions plotted on the left side may be distorted when there are too many cell types, and we would be grateful if anyone would be interested in fixing this bug.
114
+ #
115
+ # **Function**: `ov.pl.embedding_celltype`:
116
+ # - adata: the datasets of scRNA-seq
117
+ # - figsize: Note that we don't usually provide the ax parameter for combinatorial graphs, this is due to the fact that combinatorial graphs are made up of multiple axes, so the figsize parameter is more important, here we set it to `figsize=(7,4)`.
118
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
119
+ # - celltype_key: Specify the colour to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
120
+ # - title: Note that the space entered in title is used to control the position.
121
+ # - celltype_range: Since our number of cell types is different in each data, we want to have the flexibility to control where the cell scale plot is drawn, here we set it to `(1,10)`. You can also tweak the observations yourself to find the parameter that best suits your data.
122
+ # - embedding_range: As with the positional parameters of the cell types, they need to be adjusted several times on their own for optimal results.
123
+
124
+ # In[9]:
125
+
126
+
127
+ ov.pl.embedding_celltype(adata,figsize=(7,4),basis='X_umap',
128
+ celltype_key='clusters',
129
+ title=' Cell type',
130
+ celltype_range=(1,10),
131
+ embedding_range=(4,10),)
132
+
133
+
134
+ # Sometimes we want to be able to circle a certain type of cell that we are interested in, and here we use convex polygons to achieve this, while the shape of the convex polygons may be optimised in future versions.
135
+ #
136
+ # **Function**: `ov.pl.ConvexHull`:
137
+ # - adata: the datasets of scRNA-seq
138
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
139
+ # - cluster_key: Specify the celltype to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
140
+ # - hull_cluster: the target celltype to be circled.
141
+
142
+ # In[10]:
143
+
144
+
145
+ import matplotlib.pyplot as plt
146
+ fig,ax=plt.subplots(figsize = (4,4))
147
+
148
+ ov.pl.embedding(adata,
149
+ basis='X_umap',
150
+ color=['clusters'],
151
+ frameon='small',
152
+ show=False,
153
+ ax=ax)
154
+
155
+ ov.pl.ConvexHull(adata,
156
+ basis='X_umap',
157
+ cluster_key='clusters',
158
+ hull_cluster='Granule mature',
159
+ ax=ax)
160
+
161
+
162
+ # Besides, if you don't want to plot convexhull, you can plot the contour instead.
163
+ #
164
+ # **Function**: `ov.pl.contour`:
165
+ # - adata: the datasets of scRNA-seq
166
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
167
+ # - groupby: Specify the celltype to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
168
+ # - clusters: the target celltype to be circled.
169
+ # - colors: the color of the contour
170
+ # - linestyles: the linestyles of the contour
171
+ # - **kwargs: more kwargs could be found from `plt.contour`
172
+
173
+ # In[11]:
174
+
175
+
176
+ import matplotlib.pyplot as plt
177
+ fig,ax=plt.subplots(figsize = (4,4))
178
+
179
+ ov.pl.embedding(adata,
180
+ basis='X_umap',
181
+ color=['clusters'],
182
+ frameon='small',
183
+ show=False,
184
+ ax=ax)
185
+
186
+ ov.pl.contour(ax=ax,adata=adata,groupby='clusters',clusters=['Granule immature','Granule mature'],
187
+ basis='X_umap',contour_threshold=0.1,colors='#000000',
188
+ linestyles='dashed',)
189
+
190
+
191
+ # In scanpy's default `embedding` plotting function, when we set legend=True, legend masking may occur. To solve this problem, we introduced `ov.pl.embedding_adjust` in omicverse to automatically adjust the position of the legend.
192
+ #
193
+ # **Function**: `ov.pl.embedding_adjust`:
194
+ # - adata: the datasets of scRNA-seq
195
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
196
+ # - groupby: Specify the celltype to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
197
+ # - exclude: We can specify which cell types are not to be plotted, in this case we set it to `OL`
198
+ # - adjust_kwargs: We can manually specify the parameters of [adjustText](https://adjusttext.readthedocs.io/en/latest/), the specific parameters see the documentation of adjustText, it should be noted that we have to use dict to specify the parameters here.
199
+ # - text_kwargs: We can also specify the font colour manually by specifying the [text_kwargs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html) parameter
200
+
201
+ # In[12]:
202
+
203
+
204
+ from matplotlib import patheffects
205
+ import matplotlib.pyplot as plt
206
+ fig, ax = plt.subplots(figsize=(4,4))
207
+
208
+ ov.pl.embedding(adata,
209
+ basis='X_umap',
210
+ color=['clusters'],
211
+ show=False, legend_loc=None, add_outline=False,
212
+ frameon='small',legend_fontoutline=2,ax=ax
213
+ )
214
+
215
+ ov.pl.embedding_adjust(
216
+ adata,
217
+ groupby='clusters',
218
+ exclude=("OL",),
219
+ basis='X_umap',
220
+ ax=ax,
221
+ adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
222
+ text_kwargs=dict(fontsize=12 ,weight='bold',
223
+ path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
224
+ )
225
+
226
+
227
+ # Sometimes we are interested in the distribution density of a certain class of cell types in a categorical variable, which is cumbersome to plot in the `scanpy` implementation, so we have simplified the implementation in omicverse and ensured the same plotting.
228
+ #
229
+ # **Function**: `ov.pl.embedding_density`:
230
+ # - adata: the datasets of scRNA-seq
231
+ # - basis: he position on the plane should be set to `X_spatial` in spatial RNA-seq, `X_umap`,`X_tsne`,`X_mde` in scRNA-seq and should not be set to `X_pca`
232
+ # - groupby: Specify the celltype to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
233
+ # - target_clusters: We can specify which cell types are to be plotted, in this case we set it to `Granule mature`
234
+ # - kwargs: other parameter can be found in `scanpy.pl.embedding`
235
+
236
+ # In[13]:
237
+
238
+
239
+ ov.pl.embedding_density(adata,
240
+ basis='X_umap',
241
+ groupby='clusters',
242
+ target_clusters='Granule mature',
243
+ frameon='small',
244
+ show=False,cmap='RdBu_r',alpha=0.8)
245
+
246
+
247
+ # ## Bar graph with overlapping dots (Bar-dot) plot
248
+ #
249
+ # In biological research, bardotplot plots are the most common class of graphs we use, but unfortunately, there is no direct implementation of plotting functions in either matplotlib, seaborn or scanpy. To compensate for this, we implement bardotplot plotting in omicverse and provide manual addition of p-values (it should be noted that manual addition refers to the manual addition of p-values for model fitting rather than making up p-values yourself).
250
+
251
+ # In[14]:
252
+
253
+
254
+ ov.single.geneset_aucell(adata,
255
+ geneset_name='Sox',
256
+ geneset=['Sox17', 'Sox4', 'Sox7', 'Sox18', 'Sox5'])
257
+
258
+
259
+ # In[15]:
260
+
261
+
262
+ ov.pl.embedding(adata,
263
+ basis='X_umap',
264
+ color=['Sox4'],
265
+ frameon='small',
266
+ show=False,)
267
+
268
+
269
+ # In[18]:
270
+
271
+
272
+ ov.pl.violin(adata,keys='Sox4',groupby='clusters',figsize=(6,3))
273
+
274
+
275
+ # **Function**: `ov.pl.embedding_density`:
276
+ # - adata: the datasets of scRNA-seq
277
+ # - groupby: Specify the celltype to be optimised, which should be for one of the columns in adata.obs, noting that it should have the colour first, and that we can use ov.pl.embedding to colour the cell types. If there is no colour then colour blind optimisation colour will be used.
278
+ # - color: The size of the variable to be plotted, which can be a gene, stored in adata.var, or a cell value, stored in adata.obs.
279
+ # - bar_kwargs: We provide the parameters of the barplot for input, see the matplotlib documentation for more [details](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)
280
+ # - scatter_kwargs: We also provide the parameters of the scatter for input, see the matplotlib documentation for more [details](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)
281
+ #
282
+ # **Function**: `ov.pl.add_palue`:
283
+ # - ax: the axes of bardotplot
284
+ # - line_x1: The left side of the p-value line to be plotted
285
+ # - line_x2: The right side of the p-value line to be plotted|
286
+ # - line_y: The height of the p-value line to be plotted
287
+ # - text_y: How much above the p-value line is plotted text
288
+ # - text: the text of p-value, you can set `***` to instead `p<0.001`
289
+ # - fontsize: the fontsize of text
290
+ # - fontcolor: the color of text
291
+ # - horizontalalignment: the location of text
292
+
293
+ # In[19]:
294
+
295
+
296
+ fig, ax = plt.subplots(figsize=(6,2))
297
+ ov.pl.bardotplot(adata,groupby='clusters',color='Sox_aucell',figsize=(6,2),
298
+ ax=ax,
299
+ ylabel='Expression',
300
+ bar_kwargs={'alpha':0.5,'linewidth':2,'width':0.6,'capsize':4},
301
+ scatter_kwargs={'alpha':0.8,'s':10,'marker':'o'})
302
+
303
+ ov.pl.add_palue(ax,line_x1=3,line_x2=4,line_y=0.1,
304
+ text_y=0.02,
305
+ text='$p={}$'.format(round(0.001,3)),
306
+ fontsize=11,fontcolor='#000000',
307
+ horizontalalignment='center',)
308
+
309
+
310
+ # In[20]:
311
+
312
+
313
+ fig, ax = plt.subplots(figsize=(6,2))
314
+ ov.pl.bardotplot(adata,groupby='clusters',color='Sox17',figsize=(6,2),
315
+ ax=ax,
316
+ ylabel='Expression',xlabel='Cell Type',
317
+ bar_kwargs={'alpha':0.5,'linewidth':2,'width':0.6,'capsize':4},
318
+ scatter_kwargs={'alpha':0.8,'s':10,'marker':'o'})
319
+
320
+ ov.pl.add_palue(ax,line_x1=3,line_x2=4,line_y=2,
321
+ text_y=0.2,
322
+ text='$p={}$'.format(round(0.001,3)),
323
+ fontsize=11,fontcolor='#000000',
324
+ horizontalalignment='center',)
325
+
326
+
327
+ # ## Boxplot with jitter points
328
+ # A box plot, also known as a box-and-whisker plot, is a graphical representation used to display the distribution and summary statistics of a dataset. It provides a concise and visual way to understand the central tendency, spread, and potential outliers in the data.
329
+
330
+ # **Function**: `ov.pl.single_group_boxplot`:
331
+ #
332
+ # - adata (AnnData object): The data object containing the information for plotting.
333
+ # - groupby (str): The variable used for grouping the data
334
+ # - color (str): The variable used for coloring the data points.
335
+ # - type_color_dict (dict): A dictionary mapping group categories to specific colors.
336
+ # - scatter_kwargs (dict): Additional keyword arguments for customizing the scatter plot.
337
+ # - ax (matplotlib.axes.Axes): A pre-existing axes object for plotting (optional). (optional).(optional).
338
+ #
339
+
340
+ # In[21]:
341
+
342
+
343
+ import pandas as pd
344
+ import seaborn as sns
345
+ #sns.set_style('white')
346
+
347
+ ov.pl.single_group_boxplot(adata,groupby='clusters',
348
+ color='Sox_aucell',
349
+ type_color_dict=dict(zip(pd.Categorical(adata.obs['clusters']).categories, adata.uns['clusters_colors'])),
350
+ x_ticks_plot=True,
351
+ figsize=(5,2),
352
+ kruskal_test=True,
353
+ ylabel='Sox_aucell',
354
+ legend_plot=False,
355
+ bbox_to_anchor=(1,1),
356
+ title='Expression',
357
+ scatter_kwargs={'alpha':0.8,'s':10,'marker':'o'},
358
+ point_number=15,
359
+ sort=False,
360
+ save=False,
361
+ )
362
+ plt.grid(False)
363
+ plt.xticks(rotation=90,fontsize=12)
364
+
365
+
366
+ # ## Complexheatmap
367
+ #
368
+ # A complex heatmap, also known as a clustered heatmap, is a data visualization technique used to represent complex relationships and patterns in multivariate data. It combines several elements, including clustering, color mapping, and hierarchical organization, to provide a comprehensive view of data across multiple dimensions.
369
+
370
+ # **Function**: `ov.pl.single_group_boxplot`:
371
+ #
372
+ # - adata (AnnData): Annotated data object containing single-cell RNA-seq data.
373
+ # - groupby (str, optional): Grouping variable for the heatmap. Default is ''.
374
+ # - figsize (tuple, optional): Figure size. Default is (6, 10).
375
+ # - layer (str, optional): Data layer to use. Default is None.
376
+ # - use_raw (bool, optional): Whether to use the raw data. Default is False.
377
+ # - var_names (list or None, optional): List of genes to include in the heatmap. Default is None.
378
+ # - gene_symbols (None, optional): Not used in the function.
379
+ # - standard_scale (str, optional): Method for standardizing values. Options: 'obs', 'var', None. Default is None.
380
+ # - col_color_bars (dict, optional): Dictionary mapping columns types to colors.
381
+ # - col_color_labels (dict, optional): Dictionary mapping column labels to colors.
382
+ # - left_color_bars (dict, optional): Dictionary mapping left types to colors.
383
+ # - left_color_labels (dict, optional): Dictionary mapping left labels to colors.
384
+ # - right_color_bars (dict, optional): Dictionary mapping right types to colors.
385
+ # - right_color_labels (dict, optional): Dictionary mapping right labels to colors.
386
+ # - marker_genes_dict (dict, optional): Dictionary mapping cell types to marker genes.
387
+ # - index_name (str, optional): Name for the index column in the melted DataFrame. Default is ''.
388
+ # - value_name (str, optional): Name for the value column in the melted DataFrame. Default is ''.
389
+ # - cmap (str, optional): Colormap for the heatmap. Default is 'parula'.
390
+ # - xlabel (str, optional): X-axis label. Default is ''.
391
+ # - ylabel (str, optional): Y-axis label. Default is ''.
392
+ # - label (str, optional): Label for the plot. Default is ''.
393
+ # - save (bool, optional): Whether to save the plot. Default is False.
394
+ # - save_pathway (str, optional): File path for saving the plot. Default is ''.
395
+ # - legend_gap (int, optional): Gap between legend items. Default is 7.
396
+ # - legend_hpad (int, optional): Horizontal space between the heatmap and legend, default is 2 [mm].
397
+ # - show (bool, optional): Whether to display the plot. Default is False.
398
+ #
399
+ #
400
+
401
+ # In[22]:
402
+
403
+
404
+ import pandas as pd
405
+ marker_genes_dict = {
406
+ 'Sox':['Sox4', 'Sox7', 'Sox18', 'Sox5'],
407
+ }
408
+
409
+ color_dict = {'Sox':'#EFF3D8',}
410
+
411
+ gene_color_dict = {}
412
+ gene_color_dict_black = {}
413
+ for cell_type, genes in marker_genes_dict.items():
414
+ cell_type_color = color_dict.get(cell_type)
415
+ for gene in genes:
416
+ gene_color_dict[gene] = cell_type_color
417
+ gene_color_dict_black[gene] = '#000000'
418
+
419
+ cm = ov.pl.complexheatmap(adata,
420
+ groupby ='clusters',
421
+ figsize =(5,2),
422
+ layer = None,
423
+ use_raw = False,
424
+ standard_scale = 'var',
425
+ col_color_bars = dict(zip(pd.Categorical(adata.obs['clusters']).categories, adata.uns['clusters_colors'])),
426
+ col_color_labels = dict(zip(pd.Categorical(adata.obs['clusters']).categories, adata.uns['clusters_colors'])),
427
+ left_color_bars = color_dict,
428
+ left_color_labels = None,
429
+ right_color_bars = color_dict,
430
+ right_color_labels = gene_color_dict_black,
431
+ marker_genes_dict = marker_genes_dict,
432
+ cmap = 'coolwarm', #parula,jet
433
+ legend_gap = 15,
434
+ legend_hpad = 0,
435
+ left_add_text = True,
436
+ col_split_gap = 2,
437
+ row_split_gap = 1,
438
+ col_height = 6,
439
+ left_height = 4,
440
+ right_height = 6,
441
+ col_split = None,
442
+ row_cluster = False,
443
+ col_cluster = False,
444
+ value_name='Gene',
445
+ xlabel = "Expression of selected genes",
446
+ label = 'Gene Expression',
447
+ save = True,
448
+ show = False,
449
+ legend = False,
450
+ plot_legend = False,
451
+ #save_pathway = "complexheatmap.png",
452
+ )
453
+
454
+
455
+ # ## Marker gene plot
456
+ #
457
+ # In single-cell analysis, a marker gene heatmap is a powerful visualization tool that helps researchers to understand the expression patterns of specific marker genes across different cell populations. Here we provide `ov.pl.marker_heatmap` for visualizing the patterns of marker genes.
458
+
459
+ # We first preprocess the data and define the dictionary of cell type and marker gene.
460
+ # **Please ensure that each gene in the dictionary appears only once** (i.e. different cells cannot have the same marker gene, otherwise an error will be reported).
461
+
462
+ # In[23]:
463
+
464
+
465
+ adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
466
+
467
+ marker_genes_dict = {'Granule immature': ['Sepw1', 'Camk2b', 'Cnih2'],
468
+ 'Radial Glia-like': ['Dbi', 'Fabp7', 'Aldoc'],
469
+ 'Granule mature': ['Malat1', 'Rasl10a', 'Ppp3ca'],
470
+ 'Neuroblast': ['Igfbpl1', 'Tubb2b', 'Tubb5'],
471
+ 'Microglia': ['Lgmn', 'C1qa', 'C1qb'],
472
+ 'Cajal Retzius': ['Diablo', 'Ramp1', 'Stmn1'],
473
+ 'OPC': ['Olig1', 'C1ql1', 'Pllp'],
474
+ 'Cck-Tox': ['Tshz2', 'Cck', 'Nap1l5'],
475
+ 'GABA': ['Gad2', 'Gad1', 'Snhg11'],
476
+ 'Endothelial': ['Sparc', 'Myl12a', 'Itm2a'],
477
+ 'Astrocytes': ['Apoe', 'Atp1a2'],
478
+ 'OL': ['Plp1', 'Mog', 'Mag'],
479
+ 'Mossy': ['Arhgdig', 'Camk4'],
480
+ 'nIPC': ['Hmgn2', 'Ptma', 'H2afz']}
481
+
482
+
483
+ # **Function**: `ov.pl.marker_heatmap`:
484
+ #
485
+ # - adata: AnnData object
486
+ # Annotated data matrix.
487
+ # - marker_genes_dict: dict
488
+ # A dictionary containing the marker genes for each cell type.
489
+ # - groupby: str
490
+ # The key in adata.obs that will be used for grouping the cells.
491
+ # - color_map: str
492
+ # The color map to use for the value of heatmap.
493
+ # - use_raw: bool
494
+ # Whether to use the raw data of AnnDta object for plotting.
495
+ # - standard_scale: str
496
+ # The standard scale for the heatmap.
497
+ # - expression_cutoff: float
498
+ # The cutoff value for the expression of genes.
499
+ # - bbox_to_anchor: tuple
500
+ # The position of the legend bbox (x, y) in axes coordinates.
501
+ # - figsize: tuple
502
+ # The size of the plot figure in inches (width, height).
503
+ # - spines: bool
504
+ # Whether to show the spines of the plot.
505
+ # - fontsize: int
506
+ # The font size of the text in the plot.
507
+ # - show_rownames: bool
508
+ # Whether to show the row names in the heatmap.
509
+ # - show_colnames: bool
510
+ # Whether to show the column names in the heatmap.
511
+ # - save_pathway: str
512
+ # The file path for saving the plot.
513
+ # - ax: matplotlib.axes.Axes
514
+ # A pre-existing axes object for plotting (optional).
515
+
516
+ # In[24]:
517
+
518
+
519
+ ov.pl.marker_heatmap(
520
+ adata,
521
+ marker_genes_dict,
522
+ groupby='clusters',
523
+ color_map="RdBu_r",
524
+ use_raw=False,
525
+ standard_scale="var",
526
+ expression_cutoff=0.0,
527
+ fontsize=12,
528
+ bbox_to_anchor=(7, -2),
529
+ figsize=(8.5,4),
530
+ spines=False,
531
+ show_rownames=False,
532
+ show_colnames=True,
533
+ )
534
+
ovrawm/t_wgcna.txt ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # # WGCNA (Weighted gene co-expression network analysis) analysis
5
+ # Weighted gene co-expression network analysis (WGCNA) is a systems biology approach to characterize gene association patterns between different samples and can be used to identify highly synergistic gene sets and identify candidate biomarker genes or therapeutic targets based on the endogeneity of the gene sets and the association between the gene sets and the phenotype.
6
+ #
7
+ # Paper: [WGCNA: an R package for weighted correlation network analysis](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559#Sec21)
8
+ #
9
+ # Narges Rezaie, Farilie Reese, Ali Mortazavi, PyWGCNA: a Python package for weighted gene co-expression network analysis, Bioinformatics, Volume 39, Issue 7, July 2023, btad415, https://doi.org/10.1093/bioinformatics/btad415
10
+ #
11
+ # Code: Reproduce by Python. Raw is http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA
12
+ #
13
+ # Colab_Reproducibility:https://colab.research.google.com/drive/1EbP-Tq1IwYO9y1_-zzw23XlPbzrxP0og?usp=sharing
14
+ #
15
+ # Here, you will be briefly guided through the basics of how to use omicverse to perform wgcna anlysis. Once you are set
16
+
17
+ # In[1]:
18
+
19
+
20
+ import scanpy as sc
21
+ import omicverse as ov
22
+ import matplotlib.pyplot as plt
23
+ ov.plot_set()
24
+
25
+
26
+ # ## Load the data
27
+ # The analysis is based on the in-built WGCNA tutorial data. All the data can be download from https://github.com/mortazavilab/PyWGCNA/tree/main/tutorials/5xFAD_paper
28
+
29
+ # In[2]:
30
+
31
+
32
+ import pandas as pd
33
+ data=ov.utils.read('data/5xFAD_paper/expressionList.csv',
34
+ index_col=0)
35
+ data.head()
36
+
37
+
38
+ # In[3]:
39
+
40
+
41
+ from statsmodels import robust #import package
42
+ gene_mad=data.apply(robust.mad) #use function to calculate MAD
43
+ data=data.T
44
+ data=data.loc[gene_mad.sort_values(ascending=False).index[:2000]]
45
+ data.head()
46
+
47
+
48
+ # In[5]:
49
+
50
+
51
+ #import PyWGCNA
52
+ pyWGCNA_5xFAD = ov.bulk.pyWGCNA(name='5xFAD_2k',
53
+ species='mus musculus',
54
+ geneExp=data.T,
55
+ outputPath='',
56
+ save=True)
57
+ pyWGCNA_5xFAD.geneExpr.to_df().head(5)
58
+
59
+
60
+ # ## Pre-processing workflow
61
+ #
62
+ # PyWGCNA allows you to easily preproces the data including removing genes with too many missing values or lowly-expressed genes across samples (by default we suggest to remove genes without that are expressed below 1 TPM) and removing samples with too many missing values. Keep in your mind that these options can be adjusted by changing `TPMcutoff` and `cut`
63
+
64
+ # In[6]:
65
+
66
+
67
+ pyWGCNA_5xFAD.preprocess()
68
+
69
+
70
+ # ## Construction of the gene network and identification of modules
71
+ #
72
+ # PyWGCNA compresses all the steps of network construction and module detection in one function called `findModules` which performs the following steps:
73
+ # 1. Choosing the soft-thresholding power: analysis of network topology
74
+ # 2. Co-expression similarity and adjacency
75
+ # 3. Topological Overlap Matrix (TOM)
76
+ # 4. Clustering using TOM
77
+ # 5. Merging of modules whose expression profiles are very similar
78
+ #
79
+ # In this tutorial, we will perform the analysis step by step.
80
+
81
+ # In[7]:
82
+
83
+
84
+ pyWGCNA_5xFAD.calculate_soft_threshold()
85
+
86
+
87
+ # In[8]:
88
+
89
+
90
+ pyWGCNA_5xFAD.calculating_adjacency_matrix()
91
+
92
+
93
+ # In[9]:
94
+
95
+
96
+ pyWGCNA_5xFAD.calculating_TOM_similarity_matrix()
97
+
98
+
99
+ # ## Building a network of co-expressions
100
+ #
101
+ # We use the dynamicTree to build the co-expressions module basing TOM matrix
102
+
103
+ # In[10]:
104
+
105
+
106
+ pyWGCNA_5xFAD.calculate_geneTree()
107
+ pyWGCNA_5xFAD.calculate_dynamicMods(kwargs_function={'cutreeHybrid': {'deepSplit': 2, 'pamRespectsDendro': False}})
108
+ pyWGCNA_5xFAD.calculate_gene_module(kwargs_function={'moduleEigengenes': {'softPower': 8}})
109
+
110
+
111
+ # In[11]:
112
+
113
+
114
+ pyWGCNA_5xFAD.plot_matrix(save=False)
115
+
116
+
117
+ # ## Saving and loading your PyWGCNA
118
+ # You can save or load your PyWGCNA object with the `saveWGCNA()` or `readWGCNA()` functions respectively.
119
+
120
+ # In[12]:
121
+
122
+
123
+ pyWGCNA_5xFAD.saveWGCNA()
124
+
125
+
126
+ # In[2]:
127
+
128
+
129
+ pyWGCNA_5xFAD=ov.bulk.readWGCNA('5xFAD_2k.p')
130
+
131
+
132
+ # In[14]:
133
+
134
+
135
+ pyWGCNA_5xFAD.mol.head()
136
+
137
+
138
+ # In[15]:
139
+
140
+
141
+ pyWGCNA_5xFAD.datExpr.var.head()
142
+
143
+
144
+ # ## Sub co-expression module
145
+ #
146
+ # Sometimes we are interested in a gene, or a module of a pathway, and we need to extract the sub-modules of the gene for analysis and mapping. For example, we have selected two modules, 6 and 12, as sub-modules for analysis
147
+
148
+ # In[13]:
149
+
150
+
151
+ sub_mol=pyWGCNA_5xFAD.get_sub_module(['gold','lightgreen'],
152
+ mod_type='module_color')
153
+ sub_mol.head(),sub_mol.shape
154
+
155
+
156
+ # We found a total of 151 genes for 'gold' and 'lightgreen'. Next, we used the scale-free network constructed earlier, with the threshold set to 0.95, to construct a gene correlation network graph for modules 'gold' and 'lightgreen'
157
+
158
+ # In[17]:
159
+
160
+
161
+ G_sub=pyWGCNA_5xFAD.get_sub_network(mod_list=['lightgreen'],
162
+ mod_type='module_color',correlation_threshold=0.2)
163
+ G_sub
164
+
165
+
166
+ # In[18]:
167
+
168
+
169
+ len(G_sub.edges())
170
+
171
+
172
+ # pyWGCNA provides a simple visualisation function `plot_sub_network` to visualise the gene-free network of our interest.
173
+
174
+ # In[19]:
175
+
176
+
177
+ pyWGCNA_5xFAD.plot_sub_network(['gold','lightgreen'],pos_type='kamada_kawai',pos_scale=10,pos_dim=2,
178
+ figsize=(8,8),node_size=10,label_fontsize=8,correlation_threshold=0.2,
179
+ label_bbox={"ec": "white", "fc": "white", "alpha": 0.6})
180
+
181
+
182
+ # We also can merge two previous steps by calling `runWGCNA()` function.
183
+ #
184
+ # ## Updating sample information and assiging color to them for dowstream analysis
185
+
186
+ # In[3]:
187
+
188
+
189
+ pyWGCNA_5xFAD.updateSampleInfo(path='data/5xFAD_paper/sampleInfo.csv', sep=',')
190
+
191
+ # add color for metadata
192
+ pyWGCNA_5xFAD.setMetadataColor('Sex', {'Female': 'green',
193
+ 'Male': 'yellow'})
194
+ pyWGCNA_5xFAD.setMetadataColor('Genotype', {'5xFADWT': 'darkviolet',
195
+ '5xFADHEMI': 'deeppink'})
196
+ pyWGCNA_5xFAD.setMetadataColor('Age', {'4mon': 'thistle',
197
+ '8mon': 'plum',
198
+ '12mon': 'violet',
199
+ '18mon': 'purple'})
200
+ pyWGCNA_5xFAD.setMetadataColor('Tissue', {'Hippocampus': 'red',
201
+ 'Cortex': 'blue'})
202
+
203
+
204
+ # **note**: For doing downstream analysis, we keep aside the Gray modules which is the collection of genes that could not be assigned to any other module.
205
+ #
206
+ # ## Relating modules to external information and identifying important genes
207
+ # PyWGCNA gather some important analysis after identifying modules in `analyseWGCNA()` function including:
208
+ #
209
+ # 1. Quantifying module–trait relationship
210
+ # 2. Gene relationship to trait and modules
211
+ #
212
+ # Keep in your mind before you start analysis to add any sample or gene information.
213
+ #
214
+ # For showing module relationship heatmap, PyWGCNA needs user to choose and set colors from [Matplotlib colors](https://matplotlib.org/stable/gallery/color/named_colors.html) for metadata by using `setMetadataColor()` function.
215
+ #
216
+ # You also can select which data trait in which order you wish to show in module eigengene heatmap
217
+
218
+ # In[4]:
219
+
220
+
221
+ pyWGCNA_5xFAD.analyseWGCNA()
222
+
223
+
224
+ # In[5]:
225
+
226
+
227
+ metadata = pyWGCNA_5xFAD.datExpr.obs.columns.tolist()
228
+
229
+
230
+ # In[10]:
231
+
232
+
233
+ pyWGCNA_5xFAD.plotModuleEigenGene('lightgreen', metadata, show=True)
234
+
235
+
236
+ # In[11]:
237
+
238
+
239
+ pyWGCNA_5xFAD.barplotModuleEigenGene('lightgreen', metadata, show=True)
240
+
241
+
242
+ # ## Finding hub genes for each modules
243
+ #
244
+ # you can also ask about hub genes in each modules based on their connectivity by using `top_n_hub_genes()` function.
245
+ #
246
+ # It will give you dataframe sorted by connectivity with additional gene information you have in your expression data.
247
+
248
+ # In[12]:
249
+
250
+
251
+ pyWGCNA_5xFAD.top_n_hub_genes(moduleName="lightgreen", n=10)
252
+