Denoising Amplicon Sequence Variants Using DADA2, DeBlur, and UNOISE3 with QIIME2
Below I provide scripts to implement several workflows for denoising 16s rRNA gene sequences used by the Microbial Metagenomics Analysis Center (MMAC) at CCHMC for paired-end data. These scripts are written to run on the CCHMC high-performance computing (HPC) cluster. This post is intended to provide collaborators working with the MMAC detailed information on how their sequence data was processed. If we did not discuss any modifications to these workflows for your project, this is how we processed the raw sequence files. We continually work to improve our workflows, so please check with us before citing these packages and versions to ensure the website is up-to-date.
For those investigators looking to run the DADA2 or DeBlur workflows on your own, you would simply need to convert the scripts to .bat files to be run on the cluster. See here for some instructions or reach out to us if you have questions. UNOISE3 requires a license. So reach out to us if you are interested in applying this approach to your data.
The scripts below denoise paired-end fastq files obtained from Illumina sequencing using well-established tools and return amplicon sequence variants (ASVs)…a.k.a. exact sequence variants (ESVs), sub operational taxonomic units (sOTUs), zero radius OTUs (zOTUs), etc. The goal being to resolve errors in amplicon sequence data at the level of single nucliotides versus clustering sequences based on some percent similarity (i.e., 97%). Here is a link to a paper from Ben Callahan, Paul McMurdie, and Susan Holmes arguing that Exact sequence variants should replace operational taxonomic units in marker-gene data analysis if you want to learn more about the distinction in approaches and the advantages denoising may provide.
Denoising at the level of single nucliotide resolution is of course not without limitations. Some highly insightful thoughts are provided by Pat Schloss here in response to Robert Edgar’s paper Updating the 97% identity threshold for 16S ribosomal RNA OTUs. That a given bacterial genome can contain multiple copies of the 16S rRNA gene that differ by one or more nucliotide, and denoising algorithms may split these into separate variants, is perhaps of most concern; however, these approaches have shown excellent performance on simulated and mock data sets.
One approach, that I do not highlight here, is that one could generate ASVs as the first step in an analysis, and then assess the sensitivity of the study results after aggregating the ASVs at differing percent similarities as described here or by just lumping ASVs of interest that differ by only a single (or small number of) base(s). I think the main point is to be cognizant of the differences in the approaches and to appreciate the limits in the amount of information that can be provided by small amplicon spanning regions.
The workflows provided below denoise raw fastq file using:
- DADA2 - DADA2: High-resolution sample inference from Illumina amplicon data
- DeBlur - Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns
- UNOISE3 - UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing
The primary steps of each include:
- denoising paired-end reads
- assigning taxonomy
- fragment insertion via SEPP to generate a phylogenetic tree
Taxonomic classification and fragment insertion is performed using QIIME2 against the greengenes database. We use this approach since constructing de-novo phylogenies using ~250bp fragments is likely fraught with difficultly (and not an area for which I possess expertise). So likely better to insert fragments into an existing tree developed by those who know a lot about this. Here is a paper describing the approach and highlighting improved performance over de-novo construction. The NB classifier is commonly used, so I will not discuss that here, other than to say once you have the representative sequences it is straight forward to apply labels using other reference databases and algorithms. Reach out to us if you have nay questions.
The scripts assume you are working with 2 x 250bp files generated using the 515F-806R primer set (V4) since this is the most common set up used by the Microbial Genomics and Metagenomics Laboratory and the DNA Core. If you have single-end files (e.g., forward reads only) send us an email and we can provide modified scripts. If you want to use a different primer set up, reach out to us before conducting the sequencing as this will require modifications to the workflows below.
Denoising with DADA2
The script requests the process be run on 10 cores with 64G RAM and 10 hr of wall time. You will need to modify this as needed. You will also need to change the path below to point to your fastq files and follow the instructions provided at the top of the script properly set up the file structure etc. You will also need to change the path for the error (-e) and output (-o) files from /data/MMAC-Shared-Drive/scratch/%J.err to wherever you want your log files to be saved (unless you have access to the MMAC-Shared-Drive). Otherwise you will not be able to access them. You will need to do this for all three scripts.
Note that DADA2 learns the error model for single sequencing run. So if you want to process samples from more than one run you will need to modify this code to denoise each run separately and then combine the “ASV/OTU” tables before calling the subsequent commands.
#BSUB -W 10:00
#BSUB -M 64000
#BSUB -n 10
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out
#Script to run DADA2 workflow for 16S V4 reads via QIIME2
# - Denoising: DADA2
# - Taxonomic classification: GG13_8 Naive Bayesian classifier
# - Phylogenetic tree: SEPP
#Program Assumes:
# - demultiplexed F and R fastq files are placed in "seqs" folder within project folder
# - files are compressed (.gz)
# - primers and all non-biologic sequences have been removed
# - sequence files are in Casava 1.8 format (if not need to generate QIIME2 sample manifest and place in folder; see example code at bottom)
# - example format for manifest_file.tsv provided below
# - do not generate manifest file in excel due to incompatible line endings
#Sample Manifest Format:
#sample-id forward-absolute-filepath reverse-absolute-filepath
#F3D0.S188 /scratch/olljt2/project/seqs/F3D0_S188_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D0_S188_L001_R2_001.fastq
#F3D141.S207 /scratch/olljt2/project/seqs/F3D141_S207_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D141_S207_L001_R2_001.fastq
#F3D142.S208 /scratch/olljt2/project/seqs/F3D142_S208_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D142_S208_L001_R2_001.fastq
#To run the script
# - create a new folder for the project
# - change first line of code below to navigate to the project folder
# - enter bsub at command line to run: bsub < dada2_v4_qiime2.bat
# - may need to modify wall time etc. given size of project (above should be overkill for single MiSeq run)
#Navigate to project folder (change me)
cd /data/MMAC-Shared-Drive/example_data/example_runs/amplicon/dada2_v4_q2
#Load qiime2
module load qiime/2.2020.2
#Convert fastq files to q2 artifact
mkdir q2
mkdir q2/reports
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path seqs/ \
--output-path q2/paired-end-demux.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
qiime demux summarize \
--i-data q2/paired-end-demux.qza \
--o-visualization q2/reports/paired-end-demux_summary.qzv
#Denoise reads with DADA2
cd q2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end-demux.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 240 \
--p-trunc-len-r 160 \
--o-representative-sequences rep-seqs.qza \
--o-table table.qza \
--o-denoising-stats stats-dada2.qza\
--p-n-threads 10
#notes: - trunc-len and -p-timr-lef are tuning parameters and the length should be based on visual examination of the q-score distribution over the read length; maxee and other parameters can also be modifed
qiime metadata tabulate --m-input-file stats-dada2.qza --o-visualization reports/stats-dada2.qzv
qiime feature-table summarize --i-table table.qza --o-visualization reports/table.qzv
qiime feature-table tabulate-seqs --i-data rep-seqs.qza --o-visualization reports/rep-seqs.qzv
#Assign taxonomy using pre-trained (515F-806R) GG ref db
qiime feature-classifier classify-sklearn \
--i-classifier /data/MMAC-Shared-Drive/ref_databases/qiime2/gg-13-8-99-515-806-nb-classifier.qza \
--i-reads rep-seqs.qza \
--o-classification taxonomy.qza \
--p-confidence .8 \
--p-n-jobs -10
qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization reports/taxonomy.qzv
#Fragment insertion via SEPP for phylogenetic tree
qiime fragment-insertion sepp \
--i-representative-sequences rep-seqs.qza \
--i-reference-database /data/MMAC-Shared-Drive/ref_databases/qiime2/sepp-refs-gg-13-8.qza \
--o-tree tree.qza \
--o-placements insertion-placements.qza \
--p-threads 10
#Cleaning up files
mkdir other
mv paired-end-demux.qza other/.
mv stats-dada2.qza other/.
mv insertion-placements.qza other/.
#Can drop in code if fastq files are not in Casava 1.8 format (will need to provide manifest_file.tsv)
#qiime tools import \
# --type 'SampleData[PairedEndSequencesWithQuality]' \
# --input-path manifest_file.tsv \
# --output-path q2/paired-end-demux.qza \
# --input-format PairedEndFastqManifestPhred33V2
Denoising with DeBlur
DeBlur may be an attractive option if you have samples form multiple sequencing runs since it uses a “pre-computed” error model and scales well to many samples. It may also be attractive if you have memory or computational constraints or shorter bp fragments (say ~150bp), but appears to perform well for longer reads.
#BSUB -W 10:00
#BSUB -M 64000
#BSUB -n 10
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out
#Script to run DeBlur workflow for 16S V4 reads via QIIME2
# - Denoising: DeBlur (merge F and R reads using vsearch)
# - Taxonomic classification: GG13_8 Naive Bayesian classifier
# - Phylogenetic tree: SEPP
#Program Assumes:
# - demultiplexed F and R fastq files are placed in "seqs" folder within project folder
# - files are compressed (.gz)
# - primers and all non-biologic sequences have been removed
# - sequence files are in Casava 1.8 format (if not need to generate QIIME2 sample manifest and place in folder; see example code at bottom)
# - example format for manifest_file.tsv provided below
# - do not generate manifest file in excel due to incompatible line endings
# - DeBlur will not accept underscores in the sample-id field
#Sample Manifest Format:
#sample-id forward-absolute-filepath reverse-absolute-filepath
#F3D0.S188 /scratch/olljt2/project/seqs/F3D0_S188_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D0_S188_L001_R2_001.fastq
#F3D141.S207 /scratch/olljt2/project/seqs/F3D141_S207_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D141_S207_L001_R2_001.fastq
#F3D142.S208 /scratch/olljt2/project/seqs/F3D142_S208_L001_R1_001.fastq /scratch/olljt2/project/seqs/F3D142_S208_L001_R2_001.fastq
#To run the script
# - create a new folder for the project
# - change first line of code below to navigate to the project folder
# - navigate to the project folder
# - enter bsub at command line to run: bsub < deblur_v4_qiime2.bat
# - may need to modify wall time etc. given size of project (above should be overkill for single MiSeq run)
#Navigate to project folder (change me)
cd /data/MMAC-Shared-Drive/example_data/example_runs/amplicon/deblur_v4_q2
#Load qiime2
module load qiime/2.2020.2
#Convert fastq files to q2 artifact
mkdir q2
mkdir q2/reports
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path seqs/ \
--output-path q2/paired-end-demux.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
qiime demux summarize \
--i-data q2/paired-end-demux.qza \
--o-visualization q2/reports/paired-end-demux_summary.qzv
#Denoise reads with DeBlur
cd q2
qiime vsearch join-pairs \
--i-demultiplexed-seqs paired-end-demux.qza \
--o-joined-sequences demux-joined.qza \
--p-minmergelen 240 \
--p-maxmergelen 270 \
--p-maxdiffs 10
qiime demux summarize --i-data demux-joined.qza --o-visualization reports/demux-joined.qzv
qiime quality-filter q-score-joined \
--i-demux demux-joined.qza \
--o-filtered-sequences demux-filtered.qza \
--o-filter-stats demux-filter-stats.qza
qiime metadata tabulate --m-input-file demux-filter-stats.qza --o-visualization reports/demux-filter-stats.qzv
qiime deblur denoise-16S \
--i-demultiplexed-seqs demux-filtered.qza \
--p-trim-length 240 \
--o-representative-sequences rep-seqs.qza \
--o-table table.qza \
--p-sample-stats \
--p-jobs-to-start 10 \
--o-stats deblur-stats.qza
#notes: - trim-length is a tuning parameter and the length should be based on visual examination of the q-score distribution over the read length
qiime deblur visualize-stats --i-deblur-stats deblur-stats.qza --o-visualization reports/deblur-stats.qzv
qiime feature-table summarize --i-table table.qza --o-visualization reports/table.qzv
qiime feature-table tabulate-seqs --i-data rep-seqs.qza --o-visualization reports/rep-seqs.qzv
#Assign taxonomy using pre-trained (515F-806R) GG ref db
qiime feature-classifier classify-sklearn \
--i-classifier /data/MMAC-Shared-Drive/ref_databases/qiime2/gg-13-8-99-515-806-nb-classifier.qza \
--i-reads rep-seqs.qza \
--o-classification taxonomy.qza \
--p-confidence .8 \
--p-n-jobs -10
qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization reports/taxonomy.qzv
#Fragment insertion via SEPP for phylogenetic tree
qiime fragment-insertion sepp \
--i-representative-sequences rep-seqs.qza \
--i-reference-database /data/MMAC-Shared-Drive/ref_databases/qiime2/sepp-refs-gg-13-8.qza \
--o-tree tree.qza \
--o-placements insertion-placements.qza \
--p-threads 10
#Cleaning up files
mkdir other
mv deblur.log other/.; mv demux-filter-stats.qza other/.; mv paired-end-demux.qza other/.
mv deblur-stats.qza other/.; mv demux-joined.qza other/.; mv demux-filtered.qza other/.
mv insertion-placements.qza other/.
#Can drop in code if fastq files are not in Casava 1.8 format (will need to provide manifest_file.tsv)
#qiime tools import \
# --type 'SampleData[PairedEndSequencesWithQuality]' \
# --input-path manifest_file.tsv \
# --output-path q2/paired-end-demux.qza \
# --input-format PairedEndFastqManifestPhred33V2
Denoising with UNOISE3
UNOISE3 can be applied across sequencing runs and is very fast (mapping reads back to zOTUs often longest step). However, 64-bit version requires a license.
#BSUB -W 10:00
#BSUB -M 64000
#BSUB -n 10
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out
#Script to run UNOISE3 workflow for 16S V4 reads
# - Denoising: USEARCH UNOISE3
# - Taxonomic classification: GG13_8 Naive Bayesian classifier via QIIME2
# - Phylogenetic tree: SEPP via QIIME2
#Program Assumes:
# - demultiplexed F and R fastq files (decompressed) are placed in "seqs" folder within project folder
# - primers and all non-biologic sequences have been removed
# - files are unziped
# - file names end in _R1.fastq and _R2.fastq (allows us to use -relabel option of the fastq_mergepairs command)
# - if files do not meet formating requirments rename with modification to command below (or similar)
#for file in *; do mv "$file" "${file//_1/_R1}"; done
#for file in *; do mv "$file" "${file//_2/_R2}"; done
#for file in *; do mv "$file" "${file//_001/}"; done
#for file in *; do mv "$file" "`echo $file | sed "s/_[^_]*_/_/"`"; done
#To run the script
# - create a new folder for the project
# - change first line of code below to navigate to the project folder
# - navigate to the project folder
# - enter bsub at command line to run: bsub < unoise3_v4_qiime2.bat
# - may need to modify wall time etc. given size of project (above should be overkill for single MiSeq run)
#Navigate to project folder (change me)
cd /data/MMAC-Shared-Drive/example_data/example_runs/amplicon/unoise3_v4_q2
#UNOISE3 pipeline
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -fastq_mergepairs seqs/*_R1.fastq -relabel @ -fastaout merged.fa -fastqout merged.fq -fastq_maxdiffs 10 -fastq_minmergelen 240 -fastq_maxmergelen 270 -log merge_log.txt -threads 10
#notes: - examine totals output; consider modification of the filtering option if too few reads are being retained (and assess why)
# - command options: https://drive5.com/usearch/manual/merge_options.html
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -fastq_filter merged.fq -fastq_maxee 1.0 -fastaout filtered.fa -relabel Filt -threads 10 -log filter_log.txt
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -fastx_uniques filtered.fa -sizeout -relabel Uniq -fastaout uniques.fa -threads 10 -log uniques_log.txt
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -unoise3 uniques.fa -zotus zotus.fa -log unoise_log.txt
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -fastx_learn uniques.fa -output denoising_report.txt
/data/MMAC-Shared-Drive/software/usearch/v11/usearch11.0.667_i86linux64 -otutab merged.fq -zotus zotus.fa -strand plus -otutabout unoise3_zotu_table.txt -biomout unoise3_zotu_table.biom -mapout zmap.txt -threads 10 -log otutab_log.txt
#Load QIIME2
module load qiime/2.2020.2
#Convert zOTU table and rep-seqs to q2 artifacts
mkdir q2
mkdir q2/reports
qiime tools import \
--input-path zotus.fa \
--output-path q2/rep-seqs.qza \
--type "FeatureData[Sequence]"
qiime tools import \
--input-path unoise3_zotu_table.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV100Format \
--output-path q2/table.qza
#Clean up USEARCH files
mkdir unoise3
mv *fq unoise3/.
mv *fa unoise3/.
mv *txt unoise3/.
mv *biom unoise3/.
#Assign taxonomy using pre-trained (515F-806R) GG ref db
cd q2
qiime feature-classifier classify-sklearn \
--i-classifier /data/MMAC-Shared-Drive/ref_databases/qiime2/gg-13-8-99-515-806-nb-classifier.qza \
--i-reads rep-seqs.qza \
--o-classification taxonomy.qza \
--p-confidence .8 \
--p-n-jobs -10
qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization reports/taxonomy.qzv
#Fragment insertion via SEPP for phylogenetic tree
qiime fragment-insertion sepp \
--i-representative-sequences rep-seqs.qza \
--i-reference-database /data/MMAC-Shared-Drive/ref_databases/qiime2/sepp-refs-gg-13-8.qza \
--o-tree tree.qza \
--o-placements insertion-placements.qza \
--p-threads 10
mv insertion-placements.qza reports/.
Once these scripts are run the processing of the raw fastq files is complete. The scripts will return an “OTU” table, taxonomy, tree file, and representative sequences file. The next step for us is typically to merge the microbiome data with the relevant metadata files using the phyloseq package in R to facilitate downstream statistical analyses (or to analyze in QIIME2, etc.).
Please reach out to us if you have any questions and we will have a MMAC member respond.