Spaces:
Sleeping
Sleeping
File size: 5,860 Bytes
2999286 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
#!/usr/bin/env python # coding: utf-8 # ## Using scMulan to annotate cell types in Heart, Lung, Liver, Bone marrow, Blood, Brain, and Thymus # In this study, the authors enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. They represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. They construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. They pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. # #### we provide a liver dataset sampled (percentage of 20%) from Suo C, 2022 (doi/10.1126/science.abo0510) # **Paper:** [scMulan: a multitask generative pre-trained language model for single-cell analysis](https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1) # **Data download:** https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1 # **Pre-train model download:** https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1 # # If you found this tutorial helpful, please cite scMulan and OmicVerse: # Bian H, Chen Y, Dong X, et al. scMulan: a multitask generative pre-trained language model for single-cell analysis[C]//International Conference on Research in Computational Molecular Biology. Cham: Springer Nature Switzerland, 2024: 479-482. # In[36]: import os #os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # if use CPU only import scanpy as sc import omicverse as ov ov.plot_set() #import scMulan #from scMulan import GeneSymbolUniform # ## 1. load h5ad # You can download the liver dataset from the following link: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1 # # It's recommended that you use h5ad here with raw count (and after your QC) # # In[4]: adata = sc.read('./data/liver_test.h5ad') # In[5]: adata # In[6]: from scipy.sparse import csc_matrix adata.X = csc_matrix(adata.X) # ## 2. transform original h5ad with uniformed genes (42117 genes) # This step transform the genes in input adata to 42117 gene symbols and reserves the corresponding gene expression values. The gene symbols are the same as the pre-trained scMulan model. # In[7]: adata_GS_uniformed = ov.externel.scMulan.GeneSymbolUniform(input_adata=adata, output_dir="./data", output_prefix='liver') # ## 3. process uniformed data (simply norm and log1p) # In[8]: ## you can read the saved uniformed adata adata_GS_uniformed=sc.read_h5ad('./data/liver_uniformed.h5ad') # In[9]: adata_GS_uniformed # In[10]: # norm and log1p count matrix # in some case, the count matrix is not normalized, and log1p is not applied. # So we need to normalize the count matrix if adata_GS_uniformed.X.max() > 10: sc.pp.normalize_total(adata_GS_uniformed, target_sum=1e4) sc.pp.log1p(adata_GS_uniformed) # ## 4. load scMulan # In[11]: # you should first download ckpt from https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1 # put it under .ckpt/ckpt_scMulan.pt # by: wget https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1 -O ckpt/ckpt_scMulan.pt ckp_path = './ckpt/ckpt_scMulan.pt' # In[12]: scml = ov.externel.scMulan.model_inference(ckp_path, adata_GS_uniformed) base_process = scml.cuda_count() # In[13]: scml.get_cell_types_and_embds_for_adata(parallel=True, n_process = 1) # scml.get_cell_types_and_embds_for_adata(parallel=False) # for only using CPU, but it is really slow. # The predicted cell types are stored in scml.adata.obs['cell_type_from_scMulan'], besides the cell embeddings (for multibatch integration) in scml.adata.obsm['X_scMulan'] (not used in this tutorial). # ## 5. visualization # # Here, we visualize the cell types predicted by scMulan. And we also visualize the original cell types in the dataset. # In[14]: adata_mulan = scml.adata.copy() # In[15]: # calculated the 2-D embedding of the adata using pyMDE ov.pp.scale(adata_mulan) ov.pp.pca(adata_mulan) #sc.pl.pca_variance_ratio(adata_mulan) ov.pp.mde(adata_mulan,embedding_dim=2,n_neighbors=15, basis='X_mde', n_pcs=10, use_rep='scaled|original|X_pca',) # In[26]: # Here, we can see the cell type annotation from scMulan ov.pl.embedding(adata_mulan,basis='X_mde', color=["cell_type_from_scMulan",], ncols=1,frameon='small') # In[29]: adata_mulan.obsm['X_umap']=adata_mulan.obsm['X_mde'] # In[30]: # you can run smoothing function to filter the false positives ov.externel.scMulan.cell_type_smoothing(adata_mulan, threshold=0.1) # In[31]: # cell_type_from_mulan_smoothing: pred+smoothing # cell_type: original annotations by the authors ov.pl.embedding(adata_mulan,basis='X_mde', color=["cell_type_from_mulan_smoothing","cell_type"], ncols=1,frameon='small') # In[32]: adata_mulan # In[33]: top_celltypes = adata_mulan.obs.cell_type_from_scMulan.value_counts().index[:20] # In[34]: # you can select some cell types of interest (from scMulan's prediction) for visulization # selected_cell_types = ["NK cell", "Kupffer cell", "Conventional dendritic cell 2"] # as example selected_cell_types = top_celltypes ov.externel.scMulan.visualize_selected_cell_types(adata_mulan,selected_cell_types,smoothing=True) # In[ ]: |