QC, mapping, peak calling for ChIP-seq/ATAC-seq¶
This tutorial walks you through the complete ChIP-seq analysis workflow using two SnakeNgs pipelines: preprocessing_ChIPseq.smk for read preprocessing and callpeak_ChIPseq.smk for peak calling. You will download public ChIP-seq data, prepare reference files, run both pipelines sequentially, and inspect the outputs.
By the end of this tutorial, you will have:
- Quality-controlled and deduplicated BAM files
- Fingerprint plots for enrichment assessment
- BigWig coverage tracks
- MACS2 peak calls
- MultiQC summary reports
Prerequisites
Before starting, make sure you have the following installed and configured:
- Singularity (≥ 3.7)
- Snakemake (≥ 7.0)
- SnakeNgs repository cloned locally
- ngsfetch for downloading FASTQ files
- ~20 GB of free disk space (for reference genome, Bowtie2 index, and output files)
1. Download example data¶
In this tutorial, we use three human ChIP-seq samples from a K562 cell line study (PRJEB72776): two target ChIP samples (H3K4me3 and H3K27ac) sharing one input control. We download paired-end FASTQ files using ngsfetch.
| Accession | Sample | Type | Instrument | Layout | Read length | Download size |
|---|---|---|---|---|---|---|
| ERR12721584 | H3K4me3 | ChIP target | HiSeq 1500 | Paired-end | 2 × 51 bp | ~1.3 Gb |
| ERR12721583 | H3K27ac | ChIP target | HiSeq 1500 | Paired-end | 2 × 51 bp | ~1.3 Gb |
| ERR12721582 | Input (shared) | Input control | HiSeq 1500 | Paired-end | 2 × 51 bp | ~1.4 Gb |
1 2 3 4 5 6 7 8 | |
After downloading, your directory should look like this:
1 2 3 4 5 6 7 8 | |
In this example, ERR12721584 (H3K4me3) and ERR12721583 (H3K27ac) are target (IP) samples, and ERR12721582 is the input control shared between both marks.
2. Prepare reference files¶
The preprocessing pipeline requires a Bowtie2 genome index for alignment.
Download and build the Bowtie2 index¶
Note
Building a Bowtie2 index for the human genome requires ~8 GB of RAM and takes approximately 30–60 minutes. If you already have a Bowtie2 index, you can skip this step.
1 2 3 4 5 6 7 8 9 | |
Step A: Preprocessing¶
The first step uses preprocessing_ChIPseq.smk to perform quality control, alignment, duplicate removal, and coverage track generation.
3. Create the preprocessing configuration file¶
Create a config_preprocessing.yaml file:
1 2 3 4 5 | |
Replace /path/to/ with the actual absolute paths on your system.
Note
Include all samples (both target and input control) in the samples list for preprocessing, as all of them need to be aligned and deduplicated.
4. Run the preprocessing pipeline¶
1 2 3 4 5 | |
5. Inspect the preprocessing outputs¶
Once the pipeline completes, the output directory will have the following structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
Key output files¶
bowtie2/*.sort.rmdup.bam— Sorted and deduplicated BAM files ready for peak calling.plotFingerprint/fingerprint.png— Fingerprint plot generated by deepTools. This plot helps assess the enrichment quality of your ChIP-seq experiment—target samples should show a clear deviation from the diagonal compared to input.bigwig/*.bw— CPM-normalized coverage tracks for visualization in genome browsers.multiqc_preprocessing/multiqc_report.html— Aggregated QC report combining fastp, Bowtie2, and Picard metrics.
Step B: Peak calling¶
The second step uses callpeak_ChIPseq.smk to call peaks with MACS2, comparing target samples against input control.
6. Create the experiment table¶
Create an experiment_table.tsv file that maps each target sample to its input control. Use tab-separated columns:
1 2 3 | |
Replace /path/to/ with the actual absolute paths on your system. The target column points to the IP BAM file and the control column points to the input BAM file.
7. Create the peak calling configuration file¶
Create a config_callpeak.yaml file:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Key parameters:
genomesize— Effective genome size. Usehsfor human,mmfor mouse,cefor C. elegans,dmfor Drosophila.fileformat— UseBAMPEfor paired-end BAM files, orBAMfor single-end.broad— Set toTruefor broad histone marks (e.g., H3K27me3, H3K36me3). UseFalsefor narrow peaks (e.g., transcription factors, H3K4me3).
8. Run the peak calling pipeline¶
1 2 3 4 5 | |
9. Inspect the peak calling outputs¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Key peak calling output files¶
macs2/*_peaks.narrowPeak— Called peaks in ENCODE narrowPeak format with coordinates, significance, and fold enrichment.macs2/*_peaks.xls— Detailed peak information including p-values and q-values.macs2/*_summits.bed— Peak summit positions (single base-pair resolution).macs2/*_treat_pileup.bw— Treatment pileup converted to BigWig format for genome browser visualization.multiqc_callpeak/multiqc_report.html— Aggregated report for peak calling statistics.
Note on ATAC-seq¶
The same preprocessing pipeline (preprocessing_ChIPseq.smk) can be used for ATAC-seq data. For peak calling on ATAC-seq, use the callpeak_ATACseq.smk pipeline instead of callpeak_ChIPseq.smk. The key differences are:
- ATAC-seq peak calling does not require a control (input) sample.
- The config uses a sample list instead of an experiment table.
- Use
fileformat: BAMPEfor paired-end ATAC-seq data.
For details, see the callpeak_ATACseq.smk usage documentation.
10. Summary and next steps¶
In this tutorial, you ran two pipelines sequentially:
preprocessing_ChIPseq.smk— QC, Bowtie2 alignment, deduplication, fingerprint plotting, and BigWig generation.callpeak_ChIPseq.smk— MACS2 peak calling with input control normalization.
For detailed parameter descriptions, see the usage documentation for preprocessing_ChIPseq.smk and callpeak_ChIPseq.smk.
The peak files produced by these pipelines can be used as input for downstream analyses available in SnakeNgs:
- differential_ATACseq.smk — Differential accessibility analysis with DESeq2
- footprinting_ATACseq.smk — Transcription factor footprinting with TOBIAS
- footprinting_timeseries_ATACseq.smk — Time-series footprinting analysis