Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

Split, Filter, Normalize and Integrate Sequencing Data

A Snakemake 8 workflow to split, filter, normalize, integrate and select highly variable features of count matrices resulting from experiments with sequencing readout (e.g., RNA-seq, ATAC-seq, ChIP-seq, Methyl-seq, miRNA-seq, …) including confounding factor analyses and diagnostic visualizations documenting the respective data transformations. This often represents the first analysis after signal processing, critically influencing all downstream analyses.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project’s repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our focus for future work!

[!IMPORTANT]
If you use this workflow in a publication, please don’t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.8144219.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn’t be possible without the following software and their dependencies.

Software Reference (DOI)
ComplexHeatmap https://doi.org/10.1093/bioinformatics/btw313
CQN https://doi.org/10.1093/biostatistics/kxr054
edgeR https://doi.org/10.1093/bioinformatics/btp616
fastcluster https://doi.org/10.18637/jss.v053.i09
ggplot2 https://ggplot2.tidyverse.org/
limma https://doi.org/10.1093/nar/gkv007
matplotlib https://doi.org/10.1109/MCSE.2007.55
pandas https://doi.org/10.5281/zenodo.3509134
patchwork https://CRAN.R-project.org/package=patchwork
reComBat https://doi.org/10.1093/bioadv/vbac071
reshape2 https://doi.org/10.18637/jss.v021.i12
scikit-learn http://jmlr.org/papers/v12/pedregosa11a.html
seaborn https://doi.org/10.21105/joss.03021
Snakemake https://doi.org/10.12688/f1000research.29032.2
statsmodels https://www.statsmodels.org/stable/index.html#citation
TMM https://doi.org/10.1186/gb-2010-11-3-r25

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post execution in the result directory (spilterlize_integrate/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Split. The input data was split by [split_by], with each split denoted by [split_by]_{annotation_level}. The complete data was retained in the “all” split. Sample filtering was achieved by removing sample rows from the annotation file or using NA in the respective annotation column. Annotations were also split and provided separately. The data was loaded, split, and saved using the Python library pandas (ver)[ref].

All downstream analyses were performed for each split separately.

Filter. The features were filtered using the filterByExpr function from the R package edgeR(ver)[ref]. The function was configured with the following parameters: group set to [group], min.count to [min.count], min.total.count to [min.total.count], large.n to [large.n], and min.prop to [min.prop]. The number of features was reduced from [X] to [X] by filtering.

Normalize. Normalization of the data was performed to correct for technical biases.

The CalcNormFactors function from the R package edgeR(ver)[ref] was used to normalize the data using the [edgeR_parameters.method] method with subsequent [edgeR_parameters.quantification] quantification. The parameters used for this normalization included [edgeR_parameters].

Conditional Quantile Normalization (CQN) was performed using the R package cqn(ver)[ref]. The parameters used for this normalization included [cqn_parameters].

The VOOM method from the R package limma(ver)[ref] was used to estimate the mean-variance relationship of the log-counts and generate a precision weight for each observation. The parameters used for this normalization included [voom_parameters].

The normalization results were log2-normalized for downstream analyses.

Integrate. The data integration was performed using the reComBat method(ver)[ref] and applied to the log-normalized data. This method adjusts for batch effects and unwanted sources of variation while retaining desired (e.g., biological) variability. The following effects were modeled within the integration: batch [batch_column], desired variation [desired_categorical] and [desired_numerical], unwanted variation [unwanted_categorical] and [unwanted_numerical]. The parameters used for the integration included [reComBat_parameters].

Highly Variable Feature (HVF) selection. Highly variable features (HVF) were selected based on the binned normalized dispersion of features adapted from Zheng (2017) Nature Communications. The top [hvf_parameters.top_percentage] percent of features were selected, resulting in [X] features. The dispersion for each feature across all samples was calculated as the standard deviation. Features were binned based on their means, and the dispersion of each feature was normalized by subtracting the median dispersion of its bin and dividing by the median absolute deviation (MAD) of its bin using the Python package statsmodels (ver)[ref]. The number of bins used for dispersion normalization was [hvf_parameters.bins_n]. The selected HVFs were visualized by histograms before and after normalization, mean to normalized dispersion scatterplots, and a scatterplot of the ranked normalized dispersion, always highlighting the selected features.

Confounding Factor Analysis (CFA). We assessed the potential confounding effects of metadata on principal components (PCs) by quantifying their statistical associations with the first ten PCs from principal component analysis (PCA). Categorical metadata were tested using the Kruskal-Wallis test, while numeric metadata were analyzed using Kendall’s Tau correlation. Metadata without variation were excluded, and numeric metadata with fewer than 25 unique values were converted to factors. P-values for associations between PCs and metadata were calculated and adjusted for multiple testing using the Benjamini-Hochberg method. The results were visualized as a heatmap with hierarchically clustered rows (metadata) displaying -log10 adjusted p-values, distinguishing between numeric and categorical metadata.

Correlation Heatmaps. To assess sample similarities we generated heatmaps of the sample-wise Pearson correlation matrix. using the cor function in R with the “pearson” method. Two versions of heatmaps were created: one hierarchically clustered and one sorted alphabetically by sample name. Hierarchical clustering was performed with the hclust function from the fastcluster package [ver] using “euclidean” distance and “complete” linkage. Heatmaps were visualized using the ComplexHeatmap package [ver] and annotated with relevant metadata.

Visualization. The quality of the data and the effectiveness of the processing steps were assessed through the following visualizations (raw/filtered counts were log2(x+1)-normalized): the mean-variance relationship of all features, densities of log2-values per sample, boxplots of log2-values per sample, and Principal Component Analysis (PCA) plots. For the PCA plots, features with zero variance were removed beforehand and colored by [visualization_parameters.annotate]. The plots were generated using the R libraries ggplot2, reshape2, and patchwork(ver)[ref].

The analyses and visualizations described here were performed using a publicly available Snakemake [ver](ref) workflow (ver) 10.5281/zenodo.8144219.

🚀 Features

The workflow performs the following steps to produce the outlined results:

🛠️ Usage

Here are some tips for the usage of this workflow:

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

— COMING SOON —

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart