Spaces:

KeTuTu
/

OV_Agentic_EXP_SambaNova

Sleeping

App Files Files Community

OV_Agentic_EXP_SambaNova / ovrawm /t_scmulan.txt

KeTuTu

Upload 46 files

2999286 verified 29 days ago

raw

history blame

5.86 kB

	#!/usr/bin/env python
	# coding: utf-8

	# ## Using scMulan to annotate cell types in Heart, Lung, Liver, Bone marrow, Blood, Brain, and Thymus

	# In this study, the authors enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. They represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. They construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. They pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts.

	# #### we provide a liver dataset sampled (percentage of 20%) from Suo C, 2022 (doi/10.1126/science.abo0510)
	# Paper: [scMulan: a multitask generative pre-trained language model for single-cell analysis](https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1)
	# Data download: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1
	# Pre-train model download: https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
	#
	# If you found this tutorial helpful, please cite scMulan and OmicVerse:
	# Bian H, Chen Y, Dong X, et al. scMulan: a multitask generative pre-trained language model for single-cell analysis[C]//International Conference on Research in Computational Molecular Biology. Cham: Springer Nature Switzerland, 2024: 479-482.

	# In[36]:


	import os
	#os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # if use CPU only
	import scanpy as sc
	import omicverse as ov
	ov.plot_set()
	#import scMulan
	#from scMulan import GeneSymbolUniform


	# ## 1. load h5ad
	# You can download the liver dataset from the following link: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1
	#
	# It's recommended that you use h5ad here with raw count (and after your QC)
	#

	# In[4]:


	adata = sc.read('./data/liver_test.h5ad')


	# In[5]:


	adata


	# In[6]:


	from scipy.sparse import csc_matrix
	adata.X = csc_matrix(adata.X)


	# ## 2. transform original h5ad with uniformed genes (42117 genes)

	# This step transform the genes in input adata to 42117 gene symbols and reserves the corresponding gene expression values. The gene symbols are the same as the pre-trained scMulan model.

	# In[7]:


	adata_GS_uniformed = ov.externel.scMulan.GeneSymbolUniform(input_adata=adata,
	output_dir="./data",
	output_prefix='liver')


	# ## 3. process uniformed data (simply norm and log1p)

	# In[8]:


	## you can read the saved uniformed adata

	adata_GS_uniformed=sc.read_h5ad('./data/liver_uniformed.h5ad')


	# In[9]:


	adata_GS_uniformed


	# In[10]:


	# norm and log1p count matrix
	# in some case, the count matrix is not normalized, and log1p is not applied.
	# So we need to normalize the count matrix
	if adata_GS_uniformed.X.max() > 10:
	sc.pp.normalize_total(adata_GS_uniformed, target_sum=1e4)
	sc.pp.log1p(adata_GS_uniformed)


	# ## 4. load scMulan

	# In[11]:


	# you should first download ckpt from https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
	# put it under .ckpt/ckpt_scMulan.pt
	# by: wget https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1 -O ckpt/ckpt_scMulan.pt

	ckp_path = './ckpt/ckpt_scMulan.pt'


	# In[12]:


	scml = ov.externel.scMulan.model_inference(ckp_path, adata_GS_uniformed)
	base_process = scml.cuda_count()


	# In[13]:


	scml.get_cell_types_and_embds_for_adata(parallel=True, n_process = 1)
	# scml.get_cell_types_and_embds_for_adata(parallel=False) # for only using CPU, but it is really slow.


	# The predicted cell types are stored in scml.adata.obs['cell_type_from_scMulan'], besides the cell embeddings (for multibatch integration) in scml.adata.obsm['X_scMulan'] (not used in this tutorial).

	# ## 5. visualization
	#
	# Here, we visualize the cell types predicted by scMulan. And we also visualize the original cell types in the dataset.

	# In[14]:


	adata_mulan = scml.adata.copy()


	# In[15]:


	# calculated the 2-D embedding of the adata using pyMDE
	ov.pp.scale(adata_mulan)
	ov.pp.pca(adata_mulan)

	#sc.pl.pca_variance_ratio(adata_mulan)
	ov.pp.mde(adata_mulan,embedding_dim=2,n_neighbors=15, basis='X_mde',
	n_pcs=10, use_rep='scaled\|original\|X_pca',)


	# In[26]:


	# Here, we can see the cell type annotation from scMulan
	ov.pl.embedding(adata_mulan,basis='X_mde',
	color=["cell_type_from_scMulan",],
	ncols=1,frameon='small')


	# In[29]:


	adata_mulan.obsm['X_umap']=adata_mulan.obsm['X_mde']


	# In[30]:


	# you can run smoothing function to filter the false positives
	ov.externel.scMulan.cell_type_smoothing(adata_mulan, threshold=0.1)


	# In[31]:


	# cell_type_from_mulan_smoothing: pred+smoothing
	# cell_type: original annotations by the authors
	ov.pl.embedding(adata_mulan,basis='X_mde',
	color=["cell_type_from_mulan_smoothing","cell_type"],
	ncols=1,frameon='small')


	# In[32]:


	adata_mulan


	# In[33]:


	top_celltypes = adata_mulan.obs.cell_type_from_scMulan.value_counts().index[:20]


	# In[34]:


	# you can select some cell types of interest (from scMulan's prediction) for visulization
	# selected_cell_types = ["NK cell", "Kupffer cell", "Conventional dendritic cell 2"] # as example
	selected_cell_types = top_celltypes
	ov.externel.scMulan.visualize_selected_cell_types(adata_mulan,selected_cell_types,smoothing=True)


	# In[ ]: