Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

Differential Analysis & Visualization Snakemake Workflow Using limma

A Snakemake 8 workflow for performing and visualizing differential expression (or accessibility) analyses (DEA) of NGS data (eg RNA-seq, ATAC-seq, scRNA-seq,…) powered by the R package limma.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project’s repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our focus for future work!

[!IMPORTANT]
If you use this workflow in a publication, please don’t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.7808516.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn’t be possible without the following software and their dependencies:

Software Reference (DOI)
edgeR https://doi.org/10.1093/bioinformatics/btp616
EnhancedVolcano https://doi.org/10.18129/B9.bioc.EnhancedVolcano
ggplot2 https://ggplot2.tidyverse.org/
patchwork https://CRAN.R-project.org/package=patchwork
pheatmap https://cran.r-project.org/package=pheatmap
limma https://doi.org/10.1093/nar/gkv007
Snakemake https://doi.org/10.12688/f1000research.29032.2

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory (dea_limma/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Differential Expression Analysis (DEA). DEA was performed on the quality-controlled filtered [raw/normalized] counts using the limma (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), effect size (absolute log2 fold change > [X]), and expression (average expression > [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.

Visualization. The filtered result statistics, i.e., number of relevant features split by positive (up) and negative (down) effect sizes, were visualized with stacked bar plots using ggplot (ver) [ref]. To visually summarize results of all performed comparisons, the effect size (log2 fold change) values of all relevant features in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref]. Volcano plots were generated for each comparison using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [pCutoff] and log2 fold change threshold of [FCcutoff] as visual cut-offs for the y- and x-axis, respectively. Finally, quality control plots of the fitted mean-variance relationship and raw p-values of the features were generated.

The analysis and visualizations described were performed using a publicly available Snakemake (ver) [ref] workflow (ver) [10.5281/zenodo.7808516].

🚀 Features

The workflow performs the following steps that produce the outlined results:

[!NOTE]

🛠️ Usage

Here are some tips for the usage of this workflow:

⚖️ Contrasts

Currently, we do not support contrasts. If you have any idea how to implement contrasts, feel invited to reach out. If you require contrasts to ask questions your model does not answer, just load the fitted model and perform the contrast manually (see examle in limma’s User’s Guide chapter 9.5):

# load model and design
fit <- readRDS(file.path('{result_path}/dea_limma/{analysis}/lmfit_object.rds'))
design <- data.frame(fread(file.path("{result_path}/dea_limma/{analysis}/model_matrix.csv"), header=TRUE), row.names=1)

# define and create contrasts of interest
# e.g., if you modeled "~ 0 + group", where group = {celltype}_{genotype}
# you can find the genotype effect for cell types A, B, C using the following contrasts
contrasts_all <- list(
                ctA = "groupCelltypeA_KO - groupCelltypeA_WT",
                ctB = "groupCelltypeB_KO - groupCelltypeB_WT",
                ctC = "groupCelltypeC_KO - groupCelltypeC_WT"
                )
contrast_matrix <- makeContrasts(contrasts=contrasts_all, levels = design)

# perform contrasts
fit2 <- contrasts.fit(fit, contrast_matrix)

# estimate/correct variance with eBayes
fit2 <- eBayes(fit2)

# extract results
contrast_result <- data.frame()

for(coefx in colnames(coef(fit2))){
    tmp_res <- topTable(fit2, coef=coefx, number=nrow(fit2), sort.by="P")
    tmp_res$feature <- rownames(tmp_res)
    tmp_res <- tmp_res[,c(ncol(tmp_res),1:(ncol(tmp_res)-1))]
    rownames(tmp_res) <- NULL
    
    # beautify group names by replacing them (the contrast formula) with the names of the comparison
    tmp_res$group <- names(contrasts_all)[contrasts_all == coefx]
    
    if(dim(contrast_result)[1]==0){
        contrast_result <- tmp_res
    }else{
        contrast_result <- rbind(contrast_result, tmp_res)
    }
}

# remove rows with adj.P.Val=NA
contrast_result <- contrast_result[!is.na(contrast_result$adj.P.Val),]

# save results
fwrite(as.data.frame(contrast_result), file=file.path("path/to/contrast_results.csv"), row.names=FALSE)

# process and visualize results as you wish, feel free to reuse code from this module

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

— COMING SOON —

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart