QC and mapping for RNA-seq¶
This tutorial walks you through the complete RNA-seq preprocessing workflow using the preprocessing_RNAseq.smk pipeline. You will download public paired-end RNA-seq data, prepare reference files, run the pipeline, and inspect the outputs.
By the end of this tutorial, you will have:
- Quality-controlled FASTQ files
- STAR-aligned BAM files
- BigWig coverage tracks
- RNA-seq QC metrics
- A MultiQC summary report
Prerequisites
Before starting, make sure you have the following installed and configured:
- Singularity (≥ 3.7)
- Snakemake (≥ 7.0)
- SnakeNgs repository cloned locally
- ngsfetch for downloading FASTQ files
- ~20 GB of free disk space (for reference genome, STAR index, and output files)
1. Download example data¶
In this tutorial, we use three paired-end RNA-seq samples from a study on nonsense-mediated mRNA decay (NMD) in mouse cortical development (GSE295221, BioProject PRJNA1253720; Lin et al., Cell Rep 2026). These are total RNA-seq libraries (TruSeq Stranded Total RNA) from E18.5 mouse cortex.
| Accession | Sample | Genotype | Instrument | Layout | Read length |
|---|---|---|---|---|---|
| SRR33238275 | GSM8943899 (Control Rep1) | Upf2 fl/fl | NovaSeq X Plus | Paired-end | 2 × 150 bp |
| SRR33238274 | GSM8943900 (Control Rep2) | Upf2 fl/+ | NovaSeq X Plus | Paired-end | 2 × 150 bp |
| SRR33238273 | GSM8943901 (Control Rep3) | Upf2 fl/fl | NovaSeq X Plus | Paired-end | 2 × 150 bp |
Create a working directory and download the FASTQ files using ngsfetch:
1 2 3 4 5 6 7 8 | |
After downloading, your directory should look like this:
1 2 3 4 5 6 7 8 | |
2. Prepare reference files¶
The pipeline requires a STAR genome index and a GTF annotation file. Here we use the mouse reference genome (GRCm38/mm10) from Ensembl.
Download the GTF file¶
1 2 | |
Build the STAR index¶
Note
Building a STAR index for the mouse genome requires ~32 GB of RAM and takes approximately 30–60 minutes. If you already have a STAR index, you can skip this step.
1 2 3 4 5 6 7 8 9 10 11 | |
3. Create the configuration file¶
Create a config.yaml file in the working directory:
1 2 3 4 5 | |
Replace /path/to/ with the actual absolute paths on your system.
4. Run the pipeline¶
Execute the pipeline with Snakemake:
1 2 3 4 5 | |
Note
Adjust --cores based on the number of CPU cores available on your system. The first run will pull the required Singularity containers automatically.
5. Inspect the outputs¶
Once the pipeline completes, the output directory will have the following structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
Key output files¶
fastp/log/*.html— Per-sample QC reports showing read quality, adapter content, and filtering statistics.star/*_Aligned.out.bam— Sorted BAM files with uniquely mapped reads (--outFilterMultimapNmax 1).star/*_Log.final.out— STAR alignment summary with mapping rates.metrics/*.CollectRnaSeqMetrics— Picard RNA-seq metrics including the percentage of reads mapping to coding, UTR, intronic, and intergenic regions.metrics/*.CollectInsertSizeMetrics— Insert size distribution for paired-end data.bigwig/*.bw— Normalized coverage tracks that can be loaded in genome browsers such as IGV or the UCSC Genome Browser.multiqc/multiqc_report.html— Aggregated QC report combining fastp, STAR, and Picard metrics across all samples.
Open multiqc/multiqc_report.html in a web browser to review the overall quality of your experiment at a glance.
6. Summary and next steps¶
In this tutorial, you ran the preprocessing_RNAseq.smk pipeline to perform quality control, alignment, and metric collection for paired-end RNA-seq data.
For detailed parameter descriptions and single-end mode, see the usage documentation.
The aligned BAM files produced by this pipeline can be used as input for downstream analyses available in SnakeNgs:
The BAM files can also be used as input for Shiba, a tool for differential RNA splicing analysis.