Skip to the content.

MR.PARETO DOI GitHub license GitHub Release Snakemake

Single-cell RNA sequencing (scRNA-seq) Data Processing & Visualization Workflow

A Snakemake 8 workflow for processing and visualizing (multimodal) sc/snRNA-seq data generated with 10X Genomics Kits or in the MTX file format powered by the R package Seurat.

[!NOTE]
This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.10679327.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn't be possible without the following software and their dependencies.

Software Reference (DOI)
inspectdf https://github.com/alastairrushworth/inspectdf/
data.table https://r-datatable.com
SCTransform https://doi.org/10.1186/s13059-019-1874-1
Seurat https://doi.org/10.1016/j.cell.2021.04.048
Snakemake https://doi.org/10.12688/f1000research.29032.2

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory (scrnaseq_processing_seurat/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

The outlined analyses were performed using the R package Seurat (ver) [ref] unless stated otherwise.

Merge. The preprocessed/quantified samples were merged using the function Seurat::merge that concatenates individual samples and their metadata into one Seurat object.

Metadata. Metadata was extended with Seurat::PercentageFeatureSet using [X] and by recombination of existing metadata rules [X].

Guide RNA assignment. The guide RNA (gRNA) assignment was performed using protospacer call information provided by the CRISPR functionality of 10x Genomics Cell Ranger (ver) [ref], with additional filtering by UMI thresholds [X] to select high-confidence signals. Each cell was assigned all detected gRNAs and inferred KO targets. (optional) To ensure specificity of the phenotype and avoid cell multiplets, only cells with a single gRNA assignment were kept.

Split. The merged data set was split into subsets by the metadata column(s) [X].

Filtering. The cells were filtered by [X], which resulted in [X] high-quality cells with confident condition and gRNA assignment.

Pseudobulking. We performed pseudobulking of single-cell data to aggregate cells based on [metadata_columns], using the [aggregation_method] method and with a cell count threshold set at [cell_count_threshold] to ensure robust representations. Data from different modalities, including Antibody Capture, CRISPR Guide Capture, and Custom assays, were pseudobulked in the same way. Additionally, the distribution of cell counts across pseudobulked samples was visualized using a histogram and density plot. The resulting pseudobulked data and aggregated metadata were saved for downstream bulk analyses.

Normalization. Filtered count data was normalized using Seurat::SCTransform v2 [ref] with the method parameter glmGamPoi to increase computational efficiency. Other modalities [X] were normalized with Seurat::NormalizeData using method CLR (Centered Log-Ratio) and margin 2.

Cell Cycle Scoring. Cell-cycle scores were determined using the function Seurat::CellCycleScoring using gene lists for M and G2M phase provided by Seurat::cc.genes (Tirosh et al 2015) or [gene lists].

Cell Scoring. Cell-module scores were determined using the function Seurat::AddModuleScore using [gene lists].

Correction. Filtered count data was normalized and corrected using Seurat::SCTransform [ref] with the method parameter glmGamPoi to increase computational efficiency. Identified confounders [X] were used as covariates to be regressed out.

Visualization. To visualize the metadata after each processing step inspectdf (ver) [ref] was used. For the visualization of expression, multimodal [X] data and calculated metadata like module scores, the Seurat functions RidgePlot for ridge plots, VlnPlot for violin plots, DotPlot for dot plots and DoHeatmap for heatmaps were used.

The processing, analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.10679327].

🚀 Features

The workflow perfroms the following steps. Outputs are always in the respective folder {split}/{step}.

The following steps are performed on each data split separately (including the "merged" split).

🛠️ Usage

Here are some tips for the usage of this workflow:

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

We selected a scRNA-seq data set consisting of 15 CRC samples from Lee et al (2020) Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nature Genetics. Downloaded from the Weizmann Institute - Curated Cancer Cell Atlas (3CA) - Colorectal Cancer section.

A comparison of the cell type marker expression split by cell types visualized as a dot plot.

data source/authors Workflow Output
Cell Type Marker Dot plot Cell Type Marker Dot plot

We provide metadata, annotation and configuration files for this data set in ./test. The UMI count martix has to be downloaded by following the instructions below.

### Download example CRC scRNA-seq data from Lee 2020 Nature Genetics

# Start from repo root
cd scrnaseq_processing_seurat

# Download the .zip file
wget -O data.zip "https://www.dropbox.com/sh/pvauenviguopkue/AADVbccY9ueRVAFTeJEEPxRwa?dl=1" || curl -L "https://www.dropbox.com/sh/pvauenviguopkue/AADVbccY9ueRVAFTeJEEPxRwa?dl=1" -o data.zip

# Unzip and delete the .zip archive
unzip data.zip -d Data_Lee2020_Colorectal
rm data.zip

# Move and rename the UMI count matrix
mv Data_Lee2020_Colorectal/Exp_data_UMIcounts.mtx test/data/Lee2020NatGenet/matrix.mtx

# Remove the unzipped folder
rm -r Data_Lee2020_Colorectal

Beyond this the workflow was tested on multimodal scCRISPR-seq data sets with >350,000 cells and 340 different KO groups (<4h; 99 jobs; 256GB RAM).

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart