Zika-USVI Pipeline
Pipeline for processing raw nanopore-sequenced Zika samples to create consensus genomes. Developed for use on Rhino.
pipeline.py
This script constitutes the bulk of the pipeline, generating consensus genomes from demultiplexed FASTAs.
Arguments:
--data_dir: Directory containing all libraries and data. Default is/fh/fast/bedford_t/zika-seq/data/. Contained in this directory should be demultiplexed FASTAs in the subfolder<run>/alba121/workspace/demux/--samples_dir: Directory containing aruns.tsvfile and asamples.tsvfile. These files are parsed bypipeline.pyto generate metadata for each sample. Default is/fh/fast/bedford_t/zika-seq/samples/.--build_dir: Directory into which all intermediary files and output frompipeline.pywill be written. Default is/fh/fast/bedford_t/zika-seq/build/.--prefix: String that will be prepended onto all output consensus genome files. Default isZIKA_USVI.--samples: Names of samples that should be processed. Acceptable samples can be found in<samples_dir>/samples.tsv. Samples should be listed separated by spaces. If excluded from command, all samples listed insamples.tsvwill be processed.--dimension: Dimension of the library being processed. Options are1dor2d; default is2d.--run_steps: Numbered steps to run (explained below in more detail):- Construct sample FASTAs
- Process sample FASTAs
- Gather consensus FASTAs
- Generate coverage overlap plots
- Calculate per-base error rates
Pipeline overview
Construct sample FASTAs
FASTAs for each sample are constructed by concatenating the two demultiplexed FASTAs that correspond to a given sample. The complete FASTA is written to <build_dir>.
Process sample FASTAs
Take a complete FASTA and construct a consensus genome. This is done by calling the script fasta_to_consensus_<dimension>.sh, which does the following:
- Aligns reads with
bwa mem. - Trims alignment to primer start sites with
align_trim.py. - Normalizes reads to sequencing depth of 500 in order to save time with
align_trim.py. - Creates a sorted BAM file with
samtools sortandsamtools index. - Variant calling using
nanopolish variants. - Consensus genome construction with
margin_cons.py. Output consensus genomes are written to<build_dir>by sample name.Gather consensus FASTAs
Iterates over all samples in
<build_dir>to determine the percent of the genome that was called as āNā, and joins the consensus FASTAs into one of three files depending on percent N:ZIKA_USVI_good.fasta: Less than 20% NZIKA_USVI_partial.fasta: Between 20% and 50% NZIKA_USVI_poor.fasta: Less than 50% NGenerate coverage overlap plots
Looks at the BAM files to determine the depth of sequencing at each site along the reference genome. Using this, plots are generated using
depth_coverage.R. PNG files for each plot are written to<build_dir>.- Note: These plots can be generated without consensus genomes by running
depth_process.pybefore callingdepth_coverage.R.Calculate per-base error rates
Walks through VCF files made by
nanopolish variantsto determine per-base error rates. Currently broken