Preprocessing | H3ABioNet ML Glossary

Edit me

Preprocessing example 2: Batch effects

Biological processes are not the only source of variability in datasets. Technical biases are also a source of variability in a sample. One of the challenges of data integration of large-scale transcriptomic datasets generated using different technologies is batch-specific systematic variations that present a challenge to batch-effect removal. ComBat is one of the most widely used tool for correcting those technical biases, called batch effects, in microarray expression data.

1. Microarray

Microarray experiments are very sensitive to experimental conditions: Equipment, agents,technicians,etc. Data generated from different “batches” (lab,time,etc.) can be quite different, but data from the same batch tend to be more similar. So batch effects are structured noise/bias common to all replicates in the same batch, but markedly different from batch to batch. ComBat. pyComBat (Behdenna et al, 2020) is a new Python implementation of ComBat (Johnson et al, 2007), a software which is one of the most widely used tool for correcting those technical biases called batch effects.

1.1 Methods to remove batch effects

-Based on linear model: batches cause location/scale changes. -Based on dimension reduction technique: SVD, PCA, factor analysis, etc.

2. RNAseq

Batch effects in RNAseq are also due to technical differences between your samples, such as the type of sequencing machine or even the technician that ran the sample. Removing this variability means changing the data for individual samples. The package BatchQC shows us if we have a batch effect. This R module provides interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts our data. ComBat-seq is a batch effect adjustment tool for bulk RNA-seq count data. removeBatchEffect() in Limma and batch in Deseq2 is essentially treated as a covariate in the regression model.

3. scRNAseq

Effective batch-effect removal is also essential in scRNAseq. Batch effects can be highly nonlinear, making it difficult to correctly align different datasets while preserving key biological variations. The tools already developed for microarray data batch correction such as ComBat and limma could be employed on single-cell RNA-seq (scRNA-seq) data to address these challenges. SCANPY is one of the scalable toolkit used for analyzing single-cell gene expression data. It’s a Python-based implementation with advanced machine-learning packages, such as TENSORFLOW

3.1 Example

The following data has been used in the scGen paper Lotfollahi19, has been used here. It contains data for human pancreas from 4 different studies (Segerstolpe16, Baron16, Wang16, Muraro16), which have been used in the seminal papers on single-cell dataset integration (Butler18, Haghverdi18) and many times ever since.

import scgen
import scanpy as sc

Reading the train data

# note that this collection of batches is already intersected on the genes
adata_all = sc.read('data/pancreas.h5ad', backup_url='https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1')
adata_all.shape

(14693, 2448)

counts = adata_all.obs.celltype.value_counts()
counts

alpha                     4214
beta                      3354
ductal                    1804
acinar                    1368
not applicable            1154
delta                      917
gamma                      571
endothelial                289
activated_stellate         284
dropped                    178
quiescent_stellate         173
mesenchymal                 80
macrophage                  55
PSC                         54
unclassified endocrine      41
co-expression               39
mast                        32
epsilon                     28
mesenchyme                  27
schwann                     13
t_cell                       7
MHC class II                 5
unclear                      4
unclassified                 2
Name: celltype, dtype: int64

To simplify visualization, it’s better to remove the 5 minority classes.

minority_classes = counts.index[-5:].tolist()        # get the minority classes
adata_all = adata_all[                               # actually subset
    ~adata_all.obs.celltype.isin(minority_classes)]
adata_all.obs.celltype.cat.reorder_categories(       # reorder according to abundance
    counts.index[:-5].tolist(), inplace=True)

Seeing the batch effect

sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)

sc.pl.umap(adata_all, color=['batch', 'celltype'], palette=sc.pl.palettes.vega_20_scanpy)

There’s a batch effect. It can be well-resolved using BBKNN [Polanski19].

%%time
sc.external.pp.bbknn(adata_all, batch_key='batch')

CPU times: user 2.31 s, sys: 79.4 ms, total: 2.39 s
Wall time: 2.23 s

sc.tl.umap(adata_all)

sc.pl.umap(adata_all, color=['batch', 'celltype'])

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Tags: