Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

Fetch Public Sequencing Data and Metadata Using iSeq

A Snakemake 8 workflow to fetch (download) and process public sequencing data and metadata from GSA, SRA, ENA, GEO and DDBJ databases using iSeq.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project’s repository.

⭐️ Star and share modules you find valuable πŸ“€ - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please don’t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.15005419.

Workflow Rulegraph

πŸ–‹οΈ Authors

πŸ’Ώ Software

This project wouldn’t be possible without the following software and their dependencies.

Software Reference (DOI)
iSeq https://github.com/BioOmics/iSeq
pandas https://doi.org/10.5281/zenodo.3509134
Picard https://broadinstitute.github.io/picard/
Snakemake https://doi.org/10.12688/f1000research.29032.2

πŸ”¬ Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory ({module}/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Data Acquisition & Processing. Public sequencing data were retrieved from [GSA SRA ENA DDBJ] under the accession(s) [accession_ids] using iSeq (ver) [ref]. The data were downloaded as FASTQ files (and converted to unmapped BAM (uBAM) files using Picard FastqToSam (ver) [ref], preserving sample information and read groups while supporting both single-end and paired-end sequencing data). Metadata for each dataset was collected and merged into a single Comprehensive reference file.

The data acquisition and processing described here were performed using a publicly available Snakemake (ver) [ref] workflow 10.5281/zenodo.15005419.

πŸš€ Features

The workflow performs the following steps that produce the outlined results:

The workflow produces the following directory structure:

{result_path}/
└── fetch_ngs/
    β”œβ”€β”€ metadata.csv                # merged metadata for all accessions
    β”œβ”€β”€ .fastq_to_bam/              # processing marker files
    β”‚   └── [accession].done
    └── [accession]/                # one directory per accession
        β”œβ”€β”€ [accession].metadata.csv  # metadata for this accession
        └── [sample].[bam/fastq.gz]   # sequence files

πŸ› οΈ Usage

Here are some tips for the usage of this workflow:

βš™οΈ Configuration

Detailed specifications can be found here ./config/README.md

πŸ“– Examples

Explore detailed examples showcasing module usage in comprehensive end-to-end analyses (including data, configuration, annotation and results) in our MrBiomics Recipes:

πŸ”— Links

πŸ“š Resources

πŸ“‘ Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart