Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

RNA-seq Data Processing, Quantification & Annotation Pipeline

A Snakemake 8 workflow for end-to-end processing, quantification, and annotation of gene expression for RNA-seq experiments, starting from single- or paired-end reads within raw/unaligned/unmapped uBAM files, including a comprehensive MultiQC report.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the projectโ€™s repository.

โญ๏ธ Star and share modules you find valuable ๐Ÿ“ค - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please donโ€™t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.15119355 and acknowledging the rna-seq-star-deseq2 workflow DOI 10.5281/zenodo.4737358 from which some structure and code were adapted.

Workflow Rulegraph

๐Ÿ–‹๏ธ Authors

๐Ÿ’ฟ Software

This project wouldnโ€™t be possible without the following software and their dependencies.

Software Reference (DOI / URL)
Snakemake https://doi.org/10.12688/f1000research.29032.2
STAR https://doi.org/10.1093/bioinformatics/bts635
Samtools https://doi.org/10.1093/bioinformatics/btp352
MultiQC https://doi.org/10.1093/bioinformatics/btw354
RSeQC https://doi.org/10.1093/bioinformatics/bts356
biomaRt https://doi.org/10.1038/nprot.2009.97
EDASeq https://doi.org/10.1186/1471-2105-12-480
gffutils https://github.com/daler/gffutils
pandas https://doi.org/10.5281/zenodo.3509134
fastp https://doi.org/10.1093/bioinformatics/bty560
pigz https://zlib.net/pigz/

๐Ÿ”ฌ Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory (rnaseq_pipeline/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Processing. All unmapped BAM files for each sample were merged, converted to FASTQ format using Samtools [ref] (ver), processed for adapter trimming and quality filtering with fastp [ref] (ver) using parameters [config["fastp_args"]] (and adapters [config["adapter_fasta"]]), and finally de-interleaved (if paired-end) and compressed into separate R1/R2 files using shell commands and pigz [ref] (ver).

Quantification. Gene expression quantification was performed on the filtered and trimmed reads. The STAR aligner (ver) [ref] was utilized in --quantMode GeneCounts mode to count reads overlapping annotated genes based on the Ensembl [ver config: ref: release] gene annotation for the [config: ref: species] genome (build [config: ref: build]) and using parameters [config["star_args"]]. Library strandedness ([none/yes/reverse], specified per sample in config/annotation.csv) was accounted for during counting. Counts for individual samples were aggregated into a single gene-by-sample count matrix using a custom Python script utilizing the pandas library (ver) [ref]. Quality control metrics from various tools, including fastp (ver) [ref], RSeQC (ver) [ref] and STAR (ver) [ref], were aggregated using MultiQC (ver) [ref].

Annotation. Gene annotations from Ensembl were retrieved using the R package biomaRt (ver) [ref]. Annotations included Ensembl gene ID, gene symbol (external_gene_name), gene biotype, and description. Additionally, exon-based GC content (exon_gc) and cumulative exon length (exon_length) were calculated for each gene using a custom R function adapted from EDASeq (ver) [ref], leveraging biomaRt (ver) [ref] to fetch exon coordinates and sequences. This exon-based approach was chosen as sequencing reads in poly(A)-selected libraries primarily derive from exonic regions, making these metrics more appropriate for downstream bias correction (e.g., Conditional Quantile Normalization - CQN) than whole-gene metrics. A sample annotation file was generated, integrating input annotations with QC metrics.

The processing and quantification described here was performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.15119355], which adopted code from anotehr workflow (ref) 10.5281/zenodo.4737358.

๐Ÿš€ Features

This workflow offers several key advantages for RNA-seq analysis over existing pipelines:


The workflow performs the following steps that produce the outlined results:


The workflow produces the following directory structure:

{result_path}/
โ””โ”€โ”€ rnaseq_pipeline/
    โ”œโ”€โ”€ report/
    โ”‚   โ””โ”€โ”€ multiqc_report.html         # Aggregated QC report for all samples
    โ”œโ”€โ”€ fastp/                          # fastp QC/filtering and adapter trimming outputs per sample
    โ”œโ”€โ”€ rseqc/                          # RSeQC output per sample
    โ”œโ”€โ”€ star/                           # STAR output per sample
    โ”œโ”€โ”€ counts/                         # Final quantification and annotation outputs
    โ”‚   โ”œโ”€โ”€ counts.csv                  # Aggregated gene count matrix (genes x samples)
    โ”‚   โ”œโ”€โ”€ sample_annotation.csv       # Annotation for samples (columns of counts.csv)
    โ”‚   โ””โ”€โ”€ gene_annotation.csv         # Annotation for genes (rows of counts.csv)
    โ”œโ”€โ”€ envs/                           # Exported Conda environment specifications
    โ””โ”€โ”€ configs/                        # Exported configuration and annotation files used for the run

[!IMPORTANT]
Resources are downloaded automatically to resources/{config::project_name}/rnaseq_pipeline/), are large (>3GB) and have to be manually removed if not needed anymore.

๐Ÿ› ๏ธ Usage

Here are some tips for the usage of this workflow:

This workflow is written with Snakemake and its usage is described in the Snakemake Workflow Catalog.

โš™๏ธ Configuration

Detailed specifications can be found here ./config/README.md

๐Ÿ“– Examples

Explore detailed examples showcasing module usage in comprehensive end-to-end analyses (including data, configuration, annotation and results) in our MrBiomics Recipes:

๐Ÿ” Quality Control

Below are some guidelines for the manual quality control of each sample using the generated MultiQC report, but keep in mind that every experiment/dataset is different. Thresholds are general suggestions and may vary based on experiment type, organism, and library prep.

๐Ÿ”— Links

๐Ÿ“š Resources

๐Ÿ“‘ Publications

The following publications successfully used this module for their analyses.

โญ Star History

Star History Chart