Skip to the content.

MrBiomics DOI GitHub license GitHub Release Snakemake

Fetch Public Sequencing Data and Metadata Using iSeq

A Snakemake 8 workflow to fetch (download) and process public sequencing data and metadata from GSA, SRA, ENA, GEO and DDBJ databases using iSeq.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project’s repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please don’t forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.15005419.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn’t be possible without the following software and their dependencies.

Software Reference (DOI)
iSeq https://github.com/BioOmics/iSeq
pandas https://doi.org/10.5281/zenodo.3509134
Picard https://broadinstitute.github.io/picard/
Snakemake https://doi.org/10.12688/f1000research.29032.2

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory ({module}/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Data Acquisition & Processing. Public sequencing data were retrieved from [GSA SRA ENA DDBJ] under the accession(s) [accession_ids] using iSeq (ver) [ref]. The data were downloaded as FASTQ files (and converted to unmapped BAM (uBAM) files using Picard FastqToSam (ver) [ref], preserving sample information and read groups while supporting both single-end and paired-end sequencing data). Metadata for each dataset was collected and merged into a single Comprehensive reference file.

The data acquisition and processing described here were performed using a publicly available Snakemake (ver) [ref] workflow 10.5281/zenodo.15005419.

🚀 Features

The workflow performs the following steps that produce the outlined results:

The workflow produces the following directory structure:

{result_path}/
└── fetch_ngs/
    ├── metadata.csv                # merged metadata for all accessions
    ├── .fastq_to_bam/              # processing marker files
    │   └── [accession].done
    └── [accession]/                # one directory per accession
        ├── [accession].metadata.csv  # metadata for this accession
        └── [sample].[bam/fastq.gz]   # sequence files

🛠️ Usage

Here are some tips for the usage of this workflow:

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

Explore detailed examples showcasing module usage in comprehensive end-to-end analyses (including data, configuration, annotation and results) in our MrBiomics Recipes:

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses.

⭐ Star History

Star History Chart