Metadata describing the strain name, host species, year and month of sample collection, type of sample, sample collection method, vRNA copies/ul as assessed by RT-qPCR, days post symptom onset for human samples, and viral clade.
All consensus sequences are available here. The fasta header contains the following information: strain name | sample collection date | country of sampling | host species.
These files include annotations for the coding regions for each sample genome, in gtf format.
These files contain coverage and quality information for each base covered by sequence data for each sample in this dataset. These files were used to calculate and plot coverage information. Pileup format is described here.
nucleotide diversity data
Nonsynonymous and synonymous diversity were calculated for each coding region (PB2, PB1, PA, HA, NP, NA, M1, M2, NS1, and NEP) for each sample in this dataset using SNPGenie. Combined results from all genes and samples are available in
Human reads were removed from all raw fastq files by mapping to the human reference genome GRCh38 with bowtie2. Only unmapped reads were further processed and used for data analysis. The raw fastq files with human reads filtered out are all publicly available in the Sequence Read Archive under the accession number PRJNA547644, accession numbers SRX5984186-SRX5984198. All within-host variants reported in the manuscript and analyzed are available here. This data file includes all variants present at a frequency of at least 1% in all human and duck samples. FASTQ files were processed and variants called using this pipeline, briefly outlined below:
- Adapter and quality trimming with Trimmomatic
- Mapping with bowtie2 version 3.2.2.
- Manual inspection of mapping and consensus genome calling with Geneious
- Re-mapping fastq files called consensus with bowtie2 version 3.2.2.
Trimming was performed with Trimmomatic to remove Illumina adapter sequencing and ends of reads with low quality scores. Reads were trimmed in 5 bp windows to a quality score of Q30, and trimmed reads with length < 100 bp were discarded, using the following command:
java -jar Trimmomatic-0.36/trimmomatic-0.36.jar SE input.fastq output.fastq ILLUMINACLIP:Nextera_XT_adapter.fa:1:30:10 SLIDINGWINDOW:5:30 MINLEN:100
We performed a local mapping of our trimmed reads to reference sequences previously released by Rith et al. using bowtie2, with the following command:
bowtie2 -x reference_sequence.fasta -U read1.trimmed.fastq,read2.trimmed.fastq -S output.sam --local
The mapping (bam) file was manually inspected in Geneious.
Consensus sequence calling Consensus sequences were called in Geneious, with nucleotide sites with <100x coverage called as Ns. Consensus genomes were exported in fasta format and are available here.
Remapping To avoid issues with mapping to improper reference sequences, we then remapped each sample's fastq files to its own consensus sequence. These bam files were again manually inspected in Geneious, and a final consensus sequence was called. Consensus genomes are available here as fasta files.
Variants were called using Varscan, requiring minimum coverage of 100x at the polymorphic site, a minimum quality of Q30, and a minimum SNP frequency of 1% with the following command:
java -jar VarScan.v2.3.9.jar mpileup2snp input.pileup --min-coverage 100 --min-avg-qual 30 --min-var-freq 0.01 --strand-filter 1 --output-vcf 1 > output.vcf
Amino acid annotation Coding region changes were annotated using this jupyter notebook.