Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

Genomic Region & Gene Set Enrichment Analysis & Visualization Workflow for Human and Mouse Genomes.

A Snakemake 8 workflow for enrichment analysis and visualization of human (hg19 or hg38) or mouse (mm9 or mm10) based genomic region sets and (ranked) gene sets. Together with the respective background region/gene sets, the enrichment within the configured databases is determined using LOLA, GREAT, GSEApy (over-representation analysis (ORA) & preranked GSEA), pycisTarget, RcisTarget and results saved as CSV files. Additionally, the most significant results are plotted for each region/gene set, database queried, and analysis performed. Finally, the results within the same “group” (e.g., stemming from the same analysis) are aggregated per database and analysis in summary CSV files and visualized using hierarchically clustered heatmaps and bubble plots. For collaboration, communication and documentation of results, methods and workflow information a detailed self-contained HTML report can be generated.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project’s repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our focus for future work!

[!IMPORTANT]
If you use this workflow in a publication, please don’t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.7810621.

Workflow Rulegraph

🖋️ Authors

Software Reference (DOI)
Enrichr https://doi.org/10.1002/cpz1.90
ggplot2 https://ggplot2.tidyverse.org/
GREAT https://doi.org/10.1371/journal.pcbi.1010378
GSEA https://doi.org/10.1073/pnas.0506580102
GSEApy https://doi.org/10.1093/bioinformatics/btac757
LOLA https://doi.org/10.1093/bioinformatics/btv612
pandas https://doi.org/10.5281/zenodo.3509134
pheatmap https://cran.r-project.org/package=pheatmap
pycisTarget https://doi.org/10.1038/s41592-023-01938-4
RcisTarget https://doi.org/10.1038/nmeth.4463
rGREAT https://doi.org/10.1093/bioinformatics/btac745
Snakemake https://doi.org/10.12688/f1000research.29032.2

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory (enrichment_analysis/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

The outlined analyses were performed using the programming languages R (ver) [ref] and Python (ver) [ref] unless stated otherwise. All approaches statistically correct their results using expressed/accessible background genomic region/gene sets from the respective analyses that yielded the query region/gene sets.

Genomic region set enrichment analyses

LOLA. Genomic region set enrichment analysis was performed using LOLA (ver) [ref], which uses Fisher’s exact test. The following databases were queried [lola_databases].

GREAT. Genomic region set enrichment analysis was performed using GREAT [ref] implemented with rGREAT (ver) [ref]. The following databases were queried [local_databases].

pycisTarget. Genomic region set TFBS (Transcription Factor Binding Site) motif enrichment analysis was performed using pycisTarget (ver) [ref]. The following databases were queried [pycisTarget_databases].

Furthermore, genomic regions (query- and background-sets) were mapped to genes using GREAT (without background) and then analyzed as gene sets as described below for a complementary and extended perspective.

Gene set enrichment analyses (GSEA)

Over-representation analysis (ORA). Gene set ORA was performed using Enrichr [ref], which uses Fisher’s exact test (i.e., hypergeometric test), implemented with GSEApy’s (ver) [ref] function enrich. The following databases were queried [local_databases].

Preranked GSEA. Preranked GSEA was performed using GSEA [ref], implemented with GSEApy’s (ver) [ref] function prerank. The following databases were queried [local_databases].

RcisTarget. Gene set TFBS (Transcription Factor Binding Site) motif enrichment analysis was performed using RcisTarget (ver) [ref]. The following databases were queried [RcisTarget_databases].

Aggregation The results of all queries belonging to the same analysis [group] were aggregated by method and database. Additionally, we filtered the results by retaining only the union of terms that were statistically significant (i.e. [adj_pvalue]<=[adjp_th]) in at least one query.

Visualization All analysis results were visualized in the same way.

For each query, method and database combination an enrichment dot plot was used to visualize the most important results. The top [top_n] terms were ranked (along the y-axis) by the mean rank of statistical significance ([p_value]), effect-size ([effect_size]), and overlap ([overlap]) with the goal to make the results more balanced and interpretable. The significance (adjusted p-value) is denoted by the dot color, effect-size by the x-axis position, and overlap by the dot size. Note that in any output file in this module, the regions are indexed with the BED indexing convention, i.e the start is 0-based and inclusive while the end is exclusive ([start, end), a region with start=0 and end=100 spans exactly 100 bases: positions 0 to 99).

The aggregated results per analysis [group], method and database combination were visualized using hierarchically clustered heatmaps and bubble plots. The union of the top [top_terms_n] most significant terms per query were determined and their effect-size and significance were visualized as hierarchically clustered heatmaps, and statistical significance ([adj_pvalue] < [adjp_th]) was denoted by *. Furthermore, a hierarchically clustered bubble plot encoding both effect-size (color) and statistical significance (size) is provided, with statistical significance denoted by *. All summary visualizations’ values were capped by [adjp_cap]/[or_cap]/[nes_cap] to avoid shifts in the coloring scheme caused by outliers.

The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow [10.5281/zenodo.7810621].

🚀 Features

The five tools LOLA, GREAT, pycisTarget, RcisTarget and GSEApy (over-representation analysis (ORA) & preranked GSEA) are used for various enrichment analyses. Databases to be queried can be configured (see ./config/config.yaml). All approaches statistically correct their results using the provided background region/gene sets.

Note:

🛠️ Usage

Here are some tips for the usage of this workflow:

  1. Download all relevant databases (see Resources).
  2. Configure the analysis using the configuration YAML and an annotation file (see Configuration)
  3. Run the analysis on every query gene/region set of interest (e.g., results of differential analyses) with the respective background genes/regions (e.g., all expressed genes or consensus regions).
  4. generate the Snakemake Report
  5. look through the overview plots of your dedicated groups and queried databases in the report
  6. dig deeper by looking at the
    • aggregated result table underlying the summary/overview plot
    • enrichment plots for the individual query sets
  7. investigate interesting hits further by looking into the individual query result tables.

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

🧬 How to convert feature lists to BED files

This enrichment analysis workflow requires genomic regions to be in standard BED format. At minimum, provide the first 3 BED columns: chromosome, start, end. A 4th column such as name is optional but often useful. BED uses 0-based, start-inclusive, end-exclusive coordinates ([start, end)). However, upstream differential analysis workflows (e.g., using limma) often produce a simple text file containing only a list of significant feature IDs (e.g., peak_1024, peak_5531). To use these results, you must first convert this list of IDs into a valid BED file by mapping each ID to its genomic coordinates.

We provide a Python helper script, features_to_bed.py, and a Snakemake rule to automate this crucial step. Simply adapt them to your setup. Both take two files as input:

  1. Your list of significant feature IDs (e.g., features.txt).
  2. An annotation file that contains the genomic coordinates for all features in your experiment (e.g., consensus_annotation.csv).

Both implementations merge these files, finds the coordinates for your significant features, and saves them in the required BED format. In the Snakemake rule input and output paths can be replaced with wildcards to fit your analysis.

rule convert_features2bed:
    input:
        consensus_annotation = "path/to/consensus_annotation.csv",
        features_txt = "path/to/dea_limma/features.txt",
    output:
        features_bed = "path/to/features.bed",
    params:
        region_col = "peak_id",
        chr_col = "gencode_chr",
        start_col = "gencode_start",
        end_col = "gencode_end",
    run:
        # load files as pandas df
        consensus_df = pd.read_csv(input.consensus_annotation)
        features_df = pd.read_csv(input.features_txt, header=None, names=[params.region_col])
        # map using params
        merged_df = pd.merge(features_df, consensus_df, on=params.region_col, how="inner")
        # Select and order the columns required for the BED file format.
        bed_df = merged_df[[params.chr_col, params.start_col, params.end_col, params.region_col]]
        # save in BED format
        bed_df.to_csv(output.features_bed, sep="\t", header=False, index=False)

📖 Examples

Explore detailed examples showcasing module usage in our comprehensive end-to-end MrBiomics Recipes, including data, configuration, annotation and results:

Additionally, we provide a minimal example that tests all features of this workflow using intermediate results from the MrBiomics ATAC-seq and RNA-seq recipes. Briefly, the input data consist of 4 feature sets (2 genomic region sets, 2 ranked gene lists) of B cell and erythroid cell-specific transcriptional/epigenetic differences, compared to all other hematopoietic cell types, within the Corces et al. (2016) dataset, determined using our dea_limma module.

The example is configured using config/config.yaml and config/annotation.csv, and restores test data and resources from test/compressed_resources and the web. Run this test to confirm that your setup is working, and to get familiar with the workflow’s features.

Follow these steps to run the complete analysis:

  1. Restore the minimal test data and resources
     # enter enrichment_analysis directory
     cd enrichment_analysis
     # extract the input data and libraries
     bash test/setup_test_resources.sh
    
  2. Activate your conda Snakemake environment, run a dry-run, run the workflow and generate the report
     conda activate snakemake
     # dry-run
     snakemake -p --use-conda -n
     # real run
     snakemake -p --use-conda --cores=1
     # report
     snakemake --report test/report.html
    

    With one core this should finish within ~20 minutes.

[!IMPORTANT] The 3 cisTarget resource files used in the test are synthetic. They were created to test all features, but they are not biologically meaningful and must not be used for real analyses.

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart