Skip to the content.

Ultimate ATAC-seq Data Processing & Analysis Pipeline

DOI

From rAw (unaligned) BAM files to normaliZed counts.

A Snakemake implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream processing and unsupervised analyses steps using bash, python and R. Reproducibility and Scalability is ensured by using Snakemake, conda and Singularity.

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.6323634.

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software

Software Reference (DOI)
bedtools https://doi.org/10.1093/bioinformatics/btq033
Bowtie2 https://doi.org/10.1038/nmeth.1923
CQN https://doi.org/10.1093/biostatistics/kxr054
deeptools https://doi.org/10.1093/nar/gkw257
ENCODE https://doi.org/10.1038/s41598-019-45839-z
fastp https://doi.org/10.1093/bioinformatics/bty560
HOMER https://doi.org/10.1016/j.molcel.2010.05.004
MACS2 https://doi.org/10.1186/gb-2008-9-9-r137
matplotlib https://doi.org/10.1109/MCSE.2007.55
MultiQC https://doi.org/10.1093/bioinformatics/btw354
pybedtools https://doi.org/10.1093/bioinformatics/btr539
pandas https://doi.org/10.5281/zenodo.3509134
samblaster https://doi.org/10.1093/bioinformatics/btu314
samtools https://doi.org/10.1093/bioinformatics/btp352
SCANPY https://doi.org/10.1186/s13059-017-1382-0.
scikit-learn http://jmlr.org/papers/v12/pedregosa11a.html
seaborn https://doi.org/10.21105/joss.03021
Snakemake https://doi.org/10.12688/f1000research.29032.2
TMM https://doi.org/10.1186/gb-2010-11-3-r25
UMAP https://doi.org/10.21105/joss.00861
UROPA https://doi.org/10.1038/s41598-017-02464-y

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

Processing. Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup auto” and “--extsize 147” options on each sample. Homer (ver) [ref] function findMotifs was used for motif enrichment analysis over the detected open chromatin regions.

Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.

Quantification. A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by 250 bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. Consensus regions were annotated using annotatePeaks function from Homer (ver) [ref].

Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic.

The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref].

Optional. We split up of data into subsets according to the annotation [X] and performed all downstream analyses for each subset separately.

Downstream Analysis. For all downstream analyses, we filtered the [X] consensus regions to [X] regions which had reads in at least [X] samples, were covered by at least [X] reads (normalized by median library size) in at least [X] proportion of samples of the smallest subsample group with potential signal determined with the annotation [X], and by at least [X] total reads across all samples.

Next, we determined highly variable regions (HVR) using an adaption of a SCANPY (ver) [ref] function highly_variable_genes with flavor 'cellranger', but instead of dispersion=variation/mean we use dispersion=standard deviation, because in ATAC-seq data the number of regions might be very large. This could lead to log(cpm) values and log(cpm)-means being negative, resulting in negative dispersion values which are meaningless (additionally avoiding division by 0 problems). Therefore we only employ the binning strategy by mean for stabilization, but not a "coefficient of variation" strategy.

Conditional quantile normalization cqn (ver) [ref] was performed on the filtered accessibility matrix using the region length and GC-content of the consensus regions as conditions, quantified using bedtools (ver) [ref] and pybedtools (ver) [ref].

Trimmed mean of M-values (TMM) normalization (ver) [ref] was performed on the filtered accessibility matrix.

Unsupervised Analysis & Visualization. We applied both linear and non-linear unsupervised dimensionality reduction methods to normalized data to visualize emerging sample-wise patterns in two dimensions. We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach and Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] with the correlation metric as the non-linear approach. For visualization matplotlib (ver) [ref] was used.

The processing and analysis described here was performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.6323634].

Features

Usage

These steps are the recommended usage for the first run of the workflow:

  1. perform only the processing, by setting the downstram_analysis flag to 0 in the project configuration
  2. use the generated multiQC report (project_path/atacseq_report/multiqc_report.html) to judge the quality of your samples
  3. fill out the mandatory quality control column (pass_qc) in the sample metadata configuration file (you can even use some of the quality metrics for plotting eg like I did in the example files with 'FRiP')
  4. finally execute the remaining donwstream analysis steps by setting the respective flag to 1, thereby only the samples that passed quality control will be included

This workflow is written with snakemake and its usage is described in the Snakemake Workflow Catalog.

Installation

This should take less than 10 minutes.

  1. install snakemake, which requires conda & mamba, according to the documentation
  2. clone/download this repository (eg git clone https://github.com/epigen/atacseq_pipeline.git)

All software/package dependencies are installed and managed automatically via Snakemake and conda (and optional Singularity) and installed upon the first run of the workflow.

Configuration

Detailed specifications can be found here ./config/README.md

Execution

1. Change working directory & activate conda environment

Execute always from within top level of the pipeline directory (ie atacseq_pipeline/). Snakemake commands only work from within the snakemake conda environment.

cd atacseq_pipeline
conda activate snakemake

2. Execute a dry-run

command for a dry-run with option -n (-p makes Snakemake print the resulting shell command for illustration)

snakemake -p -n

3. Execute workflow local or on a cluster

3a. Local execution

command for execution with one core

snakemake -p -j1 --use-conda

3b. Cluster execution

command for vanilla cluster execution on cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine), more info in the Snakemake Cluster Execution documentation

snakemake -p --use-conda --cluster qsub -j 32

command for cluster execution by using --profile, submits every task as separate job with dependencies

snakemake -p --use-conda --profile config/slurm.cemm

the profile for CeMM's slurm environment is provided in the config/ directory, of note:

If you are using another setup get your cluster execution profile here: The Snakemake-Profiles project

X. Singularity execution (not tested)

Singularity has to be installed (system wide by root) and available/loaded (eg module load singularity). The pipeline automatically loads the correct singularity image from [Dockerhub](https://hub.docker.com/r/ /atacseq_pipeline)

command for execution with singularity, just add the flag --use-singularity and use --singularity-args to provide all the necessary directories the pipeline needs access to (in the example it is configured for the three relevant partitions at CeMM)

snakemake -p --use-conda --use-singularity --singularity-args "--bind /nobackup:/nobackup --bind /research:/research --bind /home:/home"

Module

Use as module in another Snakemake workflow (soon)

Report

command for report generation (this can take a few minutes, depending on the size of the dataset)

snakemake --report /absolute/path/to/report.zip

The command creates a self contained HTML based report in a zip archive containing the following sections:

Results

Project directory structure:

Examples

We provide configuration files for two example datasets (mm10 & hg38). Additionally, the report zip archive of the hg38 test example is provided to showcase the pipeline results.

Resources

To ensure reproducibility of results and to make the pipeline easy-to-use we provide all required reference data for the analysis of ATAC-seq samples for both supported genomes on Zendodo:

Tips

Here are some tips for troubleshooting & FAQs:

snakemake --rulegraph --forceall | dot -Tsvg > workflow/dags/atacseq_pipeline_rulegraph.svg

provided in workflow/dags

snakemake --dag --forceall | dot -Tsvg > workflow/dags/all_DAG.svg

provided for both test examples in workflow/dags

Links

Publications

The following publications successfully used this module for their analyses.