Zika bioinformatics pipeline
This directory contains scripts and files to run the bioinformatic analysis of Zika genomes from the USVI. Broadly, there are 3 distinct pipelines.
1) Pipeline for genomes sequenced on the Illumina MiSeq. 2) Pipeline for genomes sequenced on the MinION, that runs on the Rhino cluster at Fred Hutch. 3) Pipeline for genomes sequenced on the MinION, run via a Docker image (currently deprecated).
For explanation of the Docker pipeline, see
Running the MinION bioinformatic pipeline on Rhino
Different modules are required for different parts of the pipeline; they can be loaded using
module load <module name> from Rhino. Descriptions of when each module should be loaded are in the step by step instructions below.
Python/3.5.2-foss-2016b-fh2: Albacore 1.2.1
- Barcode demultiplexing:
- Consensus genome generation: (all necessary for
module load Python/3.6.1-foss-2016b-fh1
nanopolish/0.7.1-foss-2016b: Breaks one step of pipeline, waiting for update from SciComp.
Step-by-step running instructions
needs to be a global path, for instance
is assumed to be
pipeline.py. If this is not the case, errors will arise on step 4.2.
r94_250bps_nsk007_2d.cfgfor 2D reads and
r94_450bps_linear.cfgfor 1D reads.
are, for example:
VI1 VI2 VI3.
Basecall raw reads using Albacore 1.2.1 installed on the cluster:
- Load Albacore with
module load Python/3.5.2-foss-2016b-fh2.
- Sbatch out the Albacore job as follows
sbatch --time=48:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="read_fast5_basecaller.py -i <RAW_READS_DIRECTORY> -t 8 --config <CONFIG_FILE> -r -s <BASECALLED_READS_DIRECTORY> -o fast5
Albacore writes basecalled
fast5 files into subdirectories within the
workspace directory. Each directory is given a number (starts at
0/). Albacore writes 4000 files into a directory, and then makes a new directory, in numerical order. In the end, you'll have a lot of subdirectories within workspace. Importantly, Porechop, which we use for demultiplexing the barcoded reads, requires that all the files be in a single directory. Therefore, the first step once you have basecalled reads is to use
prep_lib.py to move all the files out of the subdirectories into a single directory. This script also makes a directory called
demux/, which is where Porechop will write the demultiplexed fasta files.
- Submit the library prep job as
sbatch --time=48:00:00 --mem=10000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/basecall/prep_lib.py --library <LIBRARY_NUMBER>.
Note: Your library might be huge, in which case there will be too many files in
workspace for Porechop to handle.
If this is the case, then you will need to split the files back up into smaller subdirectories, and you'll need to submit a Porechop job for each subdirectory. If your library has more than 4 million reads, you can use the
split_1d_library.py to sort reads into directories that contain 500,000 reads per directory.
Extract fasta files from basecalled reads and demultiplex reads based on barcoding:
- Load Poretools with
module load Python/2.7.13-foss-2016b-fh2(Uses Python 2).
- Change working directory to
- Submit job as
sbatch --time=48:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="poretools fasta ../ > <FILENAME.fasta>.
- Load Porechop with
module load Python/3.5.2-foss-2016b-fh1(uses Python 3).
- Submit job as
sbatch --time=24:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="porechop -i <FILENAME.fasta> -b . --barcode_threshold 75 --threads 16 --check_reads 100000".
- Once job completes, run
gunzip NB*from within the directory to unzip the files in preparation for consensus genome generation.
Porechop will write out a fasta file for each barcode in the
demux/ directory (for example,
demux/NB01.fasta). If you had a large library, and you split basecalled reads into subdirectories of 500,000 files each, and submitted a Porechop job for each subdirectory, then you'll have multiple fastas with the demultiplexed reads. In such cases, use
cat_demux_fastas.py to consolidate all of the reads into a single fasta for each barcode.
Generating consensus sequences:
- Change working directory to
- Load all of the following modules (
mlis a default Rhino alias for
- Submit the job as
sbatch --time=48:00:00 --mem=30000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/scripts/pipeline.py --samples <SAMPLES_TO_RUN> --dimension <DIMENSION>".
pipeline.py script does all the heavy lifting in terms of alignment to a reference, calling variants, and writing consensus genomes. Details are in a separate
README. Note that output is written to the
Description of directories within
Contains scripts and files relating to the demultiplexing of basecalled reads.
Contains scripts used to prepare libraries for basecalling. This directory is a work in progress, it will be made more user friendly soon.
Contains the majority of the scripts necessary to make consensus genomes from basecalled, demultiplexed reads, as well as for the generation of summary statistics and figures.