Zika-USVI Pipeline
Pipeline for processing raw nanopore-sequenced Zika samples to create consensus genomes. Developed for use on Rhino.
pipeline.py
This script constitutes the bulk of the pipeline, generating consensus genomes from demultiplexed FASTAs.
Arguments:
--data_dir
: Directory containing all libraries and data. Default is/fh/fast/bedford_t/zika-seq/data/
. Contained in this directory should be demultiplexed FASTAs in the subfolder<run>/alba121/workspace/demux/
--samples_dir
: Directory containing aruns.tsv
file and asamples.tsv
file. These files are parsed bypipeline.py
to generate metadata for each sample. Default is/fh/fast/bedford_t/zika-seq/samples/
.--build_dir
: Directory into which all intermediary files and output frompipeline.py
will be written. Default is/fh/fast/bedford_t/zika-seq/build/
.--prefix
: String that will be prepended onto all output consensus genome files. Default isZIKA_USVI
.--samples
: Names of samples that should be processed. Acceptable samples can be found in<samples_dir>/samples.tsv
. Samples should be listed separated by spaces. If excluded from command, all samples listed insamples.tsv
will be processed.--dimension
: Dimension of the library being processed. Options are1d
or2d
; default is2d
.--run_steps
: Numbered steps to run (explained below in more detail):- Construct sample FASTAs
- Process sample FASTAs
- Gather consensus FASTAs
- Generate coverage overlap plots
- Calculate per-base error rates
Pipeline overview
Construct sample FASTAs
FASTAs for each sample are constructed by concatenating the two demultiplexed FASTAs that correspond to a given sample. The complete FASTA is written to <build_dir>
.
Process sample FASTAs
Take a complete FASTA and construct a consensus genome. This is done by calling the script fasta_to_consensus_<dimension>.sh
, which does the following:
- Aligns reads with
bwa mem
. - Trims alignment to primer start sites with
align_trim.py
. - Normalizes reads to sequencing depth of 500 in order to save time with
align_trim.py
. - Creates a sorted BAM file with
samtools sort
andsamtools index
. - Variant calling using
nanopolish variants
. - Consensus genome construction with
margin_cons.py
. Output consensus genomes are written to<build_dir>
by sample name.Gather consensus FASTAs
Iterates over all samples in
<build_dir>
to determine the percent of the genome that was called as āNā, and joins the consensus FASTAs into one of three files depending on percent N:ZIKA_USVI_good.fasta
: Less than 20% NZIKA_USVI_partial.fasta
: Between 20% and 50% NZIKA_USVI_poor.fasta
: Less than 50% NGenerate coverage overlap plots
Looks at the BAM files to determine the depth of sequencing at each site along the reference genome. Using this, plots are generated using
depth_coverage.R
. PNG files for each plot are written to<build_dir>
.- Note: These plots can be generated without consensus genomes by running
depth_process.py
before callingdepth_coverage.R
.Calculate per-base error rates
Walks through VCF files made by
nanopolish variants
to determine per-base error rates. Currently broken