Spaces:

KeTuTu
/

OV_Agentic_EXP_SambaNova

Sleeping

App Files Files Community

OV_Agentic_EXP_SambaNova / ovrawm /t_bulk_combat.txt

KeTuTu

Upload 46 files

2999286 verified 29 days ago

raw

history blame

4.3 kB

	#!/usr/bin/env python
	# coding: utf-8

	# # Batch correction in Bulk RNA-seq or microarray data
	#
	# Variability in datasets are not only the product of biological processes: they are also the product of technical biases (Lander et al, 1999). ComBat is one of the most widely used tool for correcting those technical biases called batch effects.
	#
	# pyComBat (Behdenna et al, 2020) is a new Python implementation of ComBat (Johnson et al, 2007), a software widely used for the adjustment of batch effects in microarray data. While the mathematical framework is strictly the same, pyComBat:
	#
	# - has similar results in terms of batch effects correction;
	# - is as fast or faster than the R implementation of ComBat and;
	# - offers new tools for the community to participate in its development.
	#
	# Paper: [pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods](https://doi.org/10.1101/2020.03.17.995431)
	#
	# Code: https://github.com/epigenelabs/pyComBat
	#
	# Colab_Reproducibility：https://colab.research.google.com/drive/121bbIiI3j4pTZ3yA_5p8BRkRyGMMmNAq?usp=sharing

	# In[7]:


	import anndata
	import pandas as pd
	import omicverse as ov
	ov.ov_plot_set()


	# ## Loading dataset
	#
	# This minimal usage example illustrates how to use pyComBat in a default setting, and shows some results on ovarian cancer data, freely available on NCBI’s [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/), namely:
	#
	# - GSE18520
	# - GSE66957
	# - GSE69428
	#
	# The corresponding expression files are available on [GitHub](https://github.com/epigenelabs/pyComBat/tree/master/data).

	# In[15]:


	dataset_1 = pd.read_pickle("data/combat/GSE18520.pickle")
	adata1=anndata.AnnData(dataset_1.T)
	adata1.obs['batch']='1'
	adata1


	# In[16]:


	dataset_2 = pd.read_pickle("data/combat/GSE66957.pickle")
	adata2=anndata.AnnData(dataset_2.T)
	adata2.obs['batch']='2'
	adata2


	# In[17]:


	dataset_3 = pd.read_pickle("data/combat/GSE69428.pickle")
	adata3=anndata.AnnData(dataset_3.T)
	adata3.obs['batch']='3'
	adata3


	# We use the concat function to join the three datasets together and take the intersection for the same genes

	# In[18]:


	adata=anndata.concat([adata1,adata2,adata3],merge='same')
	adata


	# ## Removing batch effect

	# In[31]:


	ov.bulk.batch_correction(adata,batch_key='batch')


	# ## Saving results
	#
	# Raw datasets

	# In[70]:


	raw_data=adata.to_df().T
	raw_data.head()


	# Removing Batch datasets

	# In[71]:


	removing_data=adata.to_df(layer='batch_correction').T
	removing_data.head()


	# save

	# In[ ]:


	raw_data.to_csv('raw_data.csv')
	removing_data.to_csv('removing_data.csv')


	# You can also save adata object

	# In[ ]:


	adata.write_h5ad('adata_batch.h5ad',compression='gzip')
	#adata=ov.read('adata_batch.h5ad')


	# ## Compare the dataset before and after correction
	#
	# We specify three different colours for three different datasets

	# In[51]:


	color_dict={
	'1':ov.utils.red_color[1],
	'2':ov.utils.blue_color[1],
	'3':ov.utils.green_color[1],
	}


	# In[57]:


	fig,ax=plt.subplots( figsize = (20,4))
	bp=plt.boxplot(adata.to_df().T,patch_artist=True)
	for i,batch in zip(range(adata.shape[0]),adata.obs['batch']):
	bp['boxes'][i].set_facecolor(color_dict[batch])
	ax.axis(False)
	plt.show()


	# In[58]:


	fig,ax=plt.subplots( figsize = (20,4))
	bp=plt.boxplot(adata.to_df(layer='batch_correction').T,patch_artist=True)
	for i,batch in zip(range(adata.shape[0]),adata.obs['batch']):
	bp['boxes'][i].set_facecolor(color_dict[batch])
	ax.axis(False)
	plt.show()


	# In addition to using boxplots to observe the effect of batch removal, we can also use PCA to observe the effect of batch removal

	# In[59]:


	adata.layers['raw']=adata.X.copy()


	# We first calculate the PCA on the raw dataset

	# In[60]:


	ov.pp.pca(adata,layer='raw',n_pcs=50)
	adata


	# We then calculate the PCA on the batch_correction dataset

	# In[61]:


	ov.pp.pca(adata,layer='batch_correction',n_pcs=50)
	adata


	# In[62]:


	ov.utils.embedding(adata,
	basis='raw\|original\|X_pca',
	color='batch',
	frameon='small')


	# In[63]:


	ov.utils.embedding(adata,
	basis='batch_correction\|original\|X_pca',
	color='batch',
	frameon='small')


	# In[ ]: