Pipelines to do MinION sequencing of Zika virus

Zika bioinformatics pipeline

This directory contains scripts and files to run the bioinformatic analysis of Zika genomes from the USVI. Broadly, there are 3 distinct pipelines.

1) Pipeline for genomes sequenced on the Illumina MiSeq. 2) Pipeline for genomes sequenced on the MinION, that runs on the Rhino cluster at Fred Hutch. 3) Pipeline for genomes sequenced on the MinION, run via a Docker image (currently deprecated).

The vast majority of this README describes the bioinformatic pipeline to analyze MinION data on the Rhino cluster. Information on the MiSeq pipeline can be found in the miseq_pipeline directory.

For explanation of the Docker pipeline, see docker_README.md.


Running the MinION bioinformatic pipeline on Rhino

Required modules

Different modules are required for different parts of the pipeline; they can be loaded using module load <module name> from Rhino. Descriptions of when each module should be loaded are in the step by step instructions below.

  • Basecalling:
    • Python/3.5.2-foss-2016b-fh2: Albacore 1.2.1
  • Barcode demultiplexing:
    • Python/2.7.13-foss-2016b-fh2: Poretools
    • Python/3.5.2-foss-2016b-fh1: Porechop
  • Consensus genome generation: (all necessary for pipeline.py step)
    • module load Python/3.6.1-foss-2016b-fh1
    • R/3.4.0-foss-2016b-fh1
    • BWA/0.7.15-foss-2016b
    • SAMtools/1.3.1-foss-2016b
    • nanopolish/0.7.1-foss-2016b: Breaks one step of pipeline, waiting for update from SciComp.

Step-by-step running instructions

Notes
  • needs to be a global path, for instance /fh/fast/bedford_t/zika-seq/data/<RUN>/raw_reads/
  • is assumed to be /fh/fast/bedford_t/zika-seq/data/<RUN>/alba121/ by pipeline.py. If this is not the case, errors will arise on step 4.2.
  • is r94_250bps_nsk007_2d.cfg for 2D reads and r94_450bps_linear.cfg for 1D reads.
  • are, for example: VI1 VI2 VI3.
Basecall raw reads using Albacore 1.2.1 installed on the cluster:
  1. Load Albacore with module load Python/3.5.2-foss-2016b-fh2.
  2. Sbatch out the Albacore job as follows sbatch --time=48:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="read_fast5_basecaller.py -i <RAW_READS_DIRECTORY> -t 8 --config <CONFIG_FILE> -r -s <BASECALLED_READS_DIRECTORY> -o fast5
Reorganizing the workspace directory:

Albacore writes basecalled fast5 files into subdirectories within the workspace directory. Each directory is given a number (starts at 0/). Albacore writes 4000 files into a directory, and then makes a new directory, in numerical order. In the end, you'll have a lot of subdirectories within workspace. Importantly, Porechop, which we use for demultiplexing the barcoded reads, requires that all the files be in a single directory. Therefore, the first step once you have basecalled reads is to use prep_lib.py to move all the files out of the subdirectories into a single directory. This script also makes a directory called demux/, which is where Porechop will write the demultiplexed fasta files.

  1. Submit the library prep job as sbatch --time=48:00:00 --mem=10000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/basecall/prep_lib.py --library <LIBRARY_NUMBER>.
Note: Your library might be huge, in which case there will be too many files in workspace for Porechop to handle.

If this is the case, then you will need to split the files back up into smaller subdirectories, and you'll need to submit a Porechop job for each subdirectory. If your library has more than 4 million reads, you can use the split_1d_library.py to sort reads into directories that contain 500,000 reads per directory.

Extract fasta files from basecalled reads and demultiplex reads based on barcoding:
  1. Load Poretools with ml nanopolish/0.7.1-foss-2016b (Uses Python 2).
  2. Change working directory to <BASECALLED_READS_DIRECTORY>/workspace/demux/.
  3. Submit job as sbatch --time=96:00:00 --mem=32000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="$EBROOTNANOPOLISH/nanopolish extract -b albacore -t template -o nanopolish_full.fasta ../".
  4. Load Porechop with module load Python/3.5.2-foss-2016b-fh1 (uses Python 3).
  5. Submit job as sbatch --time=24:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="porechop -i <FILENAME.fasta> -b . --barcode_threshold 75 --threads 16 --check_reads 100000".
  6. Once job completes, run gunzip NB* from within the directory to unzip the files in preparation for consensus genome generation.

Porechop will write out a fasta file for each barcode in the demux/ directory (for example, demux/NB01.fasta). If you had a large library, and you split basecalled reads into subdirectories of 500,000 files each, and submitted a Porechop job for each subdirectory, then you'll have multiple fastas with the demultiplexed reads. In such cases, use cat_demux_fastas.py to consolidate all of the reads into a single fasta for each barcode.

Generating consensus sequences:
  1. Change working directory to /fh/fast/bedford_t/zika-seq/
  2. Load all of the following modules (ml is a default Rhino alias for module load):
  3. ml R/3.4.0-foss-2016b-fh1
  4. ml Python/3.6.1-foss-2016b-fh1
  5. ml BWA/0.7.15-foss-2016b
  6. ml SAMtools/1.3.1-foss-2016b
  7. ml nanopolish/0.7.1-foss-2016b
  8. Submit the job as sbatch --time=48:00:00 --mem=30000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/scripts/pipeline.py --samples <SAMPLES_TO_RUN> --dimension <DIMENSION>".

The pipeline.py script does all the heavy lifting in terms of alignment to a reference, calling variants, and writing consensus genomes. Details are in a separate README. Note that output is written to the build/ directory.

Description of directories within pipeline:

demux/

Contains scripts and files relating to the demultiplexing of basecalled reads.

basecall/

Contains scripts used to prepare libraries for basecalling. This directory is a work in progress, it will be made more user friendly soon.

scripts/

Contains the majority of the scripts necessary to make consensus genomes from basecalled, demultiplexed reads, as well as for the generation of summary statistics and figures.