File size: 5,860 Bytes
2999286
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
#!/usr/bin/env python
# coding: utf-8

# ## Using scMulan to annotate cell types in Heart, Lung, Liver, Bone marrow, Blood, Brain, and Thymus

# In this study, the authors enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. They represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. They construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. They pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts.

# #### we provide a liver dataset sampled (percentage of 20%) from Suo C, 2022 (doi/10.1126/science.abo0510)
# **Paper:** [scMulan: a multitask generative pre-trained language model for single-cell analysis](https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1)    
# **Data download:** https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1   
# **Pre-train model download:** https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
# 
# If you found this tutorial helpful, please cite scMulan and OmicVerse:         
# Bian H, Chen Y, Dong X, et al. scMulan: a multitask generative pre-trained language model for single-cell analysis[C]//International Conference on Research in Computational Molecular Biology. Cham: Springer Nature Switzerland, 2024: 479-482.

# In[36]:


import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # if use CPU only
import scanpy as sc
import omicverse as ov
ov.plot_set()
#import scMulan
#from scMulan import GeneSymbolUniform


# ## 1. load h5ad
# You can download the liver dataset from the following link: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1 
# 
# It's recommended that you use h5ad here with raw count (and after your QC)
# 

# In[4]:


adata = sc.read('./data/liver_test.h5ad')


# In[5]:


adata


# In[6]:


from scipy.sparse import csc_matrix
adata.X = csc_matrix(adata.X)


# ## 2. transform original h5ad with uniformed genes (42117 genes)

# This step transform the genes in input adata to 42117 gene symbols and reserves the corresponding gene expression values. The gene symbols are the same as the pre-trained scMulan model.

# In[7]:


adata_GS_uniformed = ov.externel.scMulan.GeneSymbolUniform(input_adata=adata,
                                 output_dir="./data",
                                 output_prefix='liver')


# ## 3. process uniformed data (simply norm and log1p)

# In[8]:


## you can read the saved uniformed adata

adata_GS_uniformed=sc.read_h5ad('./data/liver_uniformed.h5ad')


# In[9]:


adata_GS_uniformed


# In[10]:


# norm and log1p count matrix
# in some case, the count matrix is not normalized, and log1p is not applied.
# So we need to normalize the count matrix
if adata_GS_uniformed.X.max() > 10:
    sc.pp.normalize_total(adata_GS_uniformed, target_sum=1e4) 
    sc.pp.log1p(adata_GS_uniformed)


# ## 4. load scMulan

# In[11]:


# you should first download ckpt from https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
# put it under .ckpt/ckpt_scMulan.pt
# by: wget https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1  -O ckpt/ckpt_scMulan.pt

ckp_path = './ckpt/ckpt_scMulan.pt'


# In[12]:


scml = ov.externel.scMulan.model_inference(ckp_path, adata_GS_uniformed)
base_process = scml.cuda_count()


# In[13]:


scml.get_cell_types_and_embds_for_adata(parallel=True, n_process = 1)
# scml.get_cell_types_and_embds_for_adata(parallel=False) # for only using CPU, but it is really slow.


# The predicted cell types are stored in scml.adata.obs['cell_type_from_scMulan'], besides the cell embeddings (for multibatch integration) in scml.adata.obsm['X_scMulan'] (not used in this tutorial).

# ## 5. visualization
# 
# Here, we visualize the cell types predicted by scMulan. And we also visualize the original cell types in the dataset.

# In[14]:


adata_mulan = scml.adata.copy()


# In[15]:


# calculated the 2-D embedding of the adata using pyMDE
ov.pp.scale(adata_mulan)
ov.pp.pca(adata_mulan)

#sc.pl.pca_variance_ratio(adata_mulan)
ov.pp.mde(adata_mulan,embedding_dim=2,n_neighbors=15, basis='X_mde',
          n_pcs=10, use_rep='scaled|original|X_pca',)


# In[26]:


# Here, we can see the cell type annotation from scMulan
ov.pl.embedding(adata_mulan,basis='X_mde',
                color=["cell_type_from_scMulan",],
                ncols=1,frameon='small')


# In[29]:


adata_mulan.obsm['X_umap']=adata_mulan.obsm['X_mde']


# In[30]:


# you can run smoothing function to filter the false positives
ov.externel.scMulan.cell_type_smoothing(adata_mulan, threshold=0.1)


# In[31]:


# cell_type_from_mulan_smoothing: pred+smoothing
# cell_type: original annotations by the authors
ov.pl.embedding(adata_mulan,basis='X_mde',
                color=["cell_type_from_mulan_smoothing","cell_type"],
                ncols=1,frameon='small')


# In[32]:


adata_mulan


# In[33]:


top_celltypes = adata_mulan.obs.cell_type_from_scMulan.value_counts().index[:20]


# In[34]:


# you can select some cell types of interest (from scMulan's prediction) for visulization
# selected_cell_types = ["NK cell", "Kupffer cell", "Conventional dendritic cell 2"] # as example
selected_cell_types = top_celltypes
ov.externel.scMulan.visualize_selected_cell_types(adata_mulan,selected_cell_types,smoothing=True)


# In[ ]: