RethinkDB database to support real-time virus analysis

H7N9 Pipeline Notes


Upload documents to VDB

  1. Download sequences and meta information from GISAID
    • In EPIFLU, select host as human, select HA as required segment, select Submission Date >= last upload date to vdb
    • Ideally download about 5000 isolates at a time, may have to split downloads by submission date
    • Download Isolates as XLS with YYYY-MM-DD date format
    • Download Isolates as "Sequences (DNA) as FASTA"
    • Select all DNA
    • Fasta Header as 0: DNA Accession no., 1: Isolate name, 2: Isolate ID, 3: Segment, 4: Passage details/history, 5: Submitting lab
    • DNA Accession no. | Isolate name | Isolate ID | Segment | Passage details/history | Submitting lab
  2. Move files to fauna/data as gisaid_epiflu.xls and gisaid_epiflu.fasta.
  3. Upload to vdb database
    • python vdb/ -db vdb -v h7n9 --source gisaid --fname gisaid_epiflu
    • Recommend running with --preview to confirm strain names and locations are correctly parsed before uploading
    • Can add to geo_synonyms file and flu_fix_location_label file to fix some of the formatting.

Download documents from VDB

python vdb/ -db vdb -v h7n9 --select locus:PB2 --fstem h7n9_pb2
python vdb/ -db vdb -v h7n9 --select locus:PB1 --fstem h7n9_pb1
python vdb/ -db vdb -v h7n9 --select locus:PA --fstem h7n9_pa
python vdb/ -db vdb -v h7n9 --select locus:HA --fstem h7n9_ha
python vdb/ -db vdb -v h7n9 --select locus:NP --fstem h7n9_np
python vdb/ -db vdb -v h7n9 --select locus:NA --fstem h7n9_na
python vdb/ -db vdb -v h7n9 --select locus:MP --fstem h7n9_mp
python vdb/ -db vdb -v h7n9 --select locus:NS --fstem h7n9_ns