We just posted a paper to bioRxiv looking at the dynamics of cross-species transmission of SIVs (HIV's close relatives that infect other species of primates). This was my Epidemiology MS thesis project here in the Bedford lab, and was my first computational project.

SIVs infect over 45 different species of primates, and HIV emerged as a human pathogen through at least 12 independent transmissions of SIVs from chimpanzees, gorillas, and sooty mangabeys to humans. Individual occurences of SIVs switching hosts have been sporadically documented, but we still had no idea how regularly SIVs switch hosts -- i.e., we had no idea whether or not the transmissions that sparked the HIV pandemic were unusual occurences.

Many of these viruses have been sequenced in recent years. While we weren't able to study them all, we were able to get enough sequence data (shout out to the fantastic Los Alamos National Labs database) to study the history of SIV cross-species transmission (CST) among 24 different primates. We used this data to assess how frequently viruses from different lineages recombine (part of one genome and part of another genome getting "pasted together"), and to look at how often they've switched hosts over evolutionary time. Our phylogenetic analysis found that SIV evolution has been shaped by at least 13 instances of interlineage recombination, and identified 14 novel, ancient CST events. We found that on average, each linaege of SIV switches hosts about once every 6.25 substitutions per site (these are funny units because SIVs are millions of years old, but they essentially mean the amount of evolutionary time required to see 6.25 substitutions in each site of the genome). We also observed more CST events between closely related primates, and find that viruses and hosts have extensively coevolved (and likely cospeciated). Taken together, our results show that SIV biology has been extensively shaped by CST, but it's still a rare phenomenon over evolutionary time.

A couple of months ago I tweeted that we had our first Bedford lab wet lab (full disclosure: it’s a bench, but start small right?). Well I’m excited to say that we have just released our first bit of Bedford lab-generated sequence data and pushed results to Nextstrain Zika!

These data are 5 (draft) Zika genomes from clinical samples collected in the U.S. Virgin Islands. After getting some experience sequencing on the MinION down in Brazil, I spent the first two weeks of December getting amplicons and sequencing on island in St. Croix. The Caribbean in December, it’s rough, I know. This work is in collaboration with the VI Department of Health, who have generously given me access to their samples and let me take over their lab when I've been down. As a doctoral student in epidemiology, it’s an incredible opportunity to run a study from start to finish, not to mention investigate an outbreak in close to real-time. I’m really excited about it.

Importantly this has been a group effort. The fact that we have this data is a huge testament to the benefits of open science. I’m not a wet-lab scientist by training, and Zika is not the easiest virus to sequence. This project could have been really painful, and the fact that it hasn’t been owes a lot to the openness of other groups to share their knowledge, experience, and protocols. I’m so thankful to Josh Quick and Nate Grubaugh who were incredibly responsive when I had questions or needed help with the protocol, and Nick Loman for freely sharing his entire bioinformatic pipeline. Additionally my lab has been amazing both as a sounding board for ideas and for helping with the data processing and analysis. To have so many people come together to help a project succeed is wonderful, especially so when you’re a student trying to figure things out for the first time. Keep an eye out for more data coming out soon! We'll keep the zika-seq project updated with new sequences as we generate them.

We've just published a paper in Virus Evolution investigating the evolutionary dynamics of infectious hematopoietic necrosis virus (IHNV) in Pacific salmon. This is work that I did during my Master's degree in the Kurath lab and that I continued to develop during the first year of my PhD here in the Bedford lab.

IHNV is endemic along the Pacific coast of Canada and the United States, from California up to Alaska, and also in the Columbia River Basin. The Columbia River Basin is a large and complex watershed draining most of Washington, Oregon and Idaho, and one of the largest salmon runs in the continental US. There's a fair amount of interest in IHNV because it can cause severe epidemics, with up to 90% mortality rates, that can greatly affect conservation hatcheries and commercial aquaculture. Because there are no treatments for IHNV, a lot of effort goes in to understanding viral transmission dynamics in the hopes of preventing big outbreaks.

For this paper we sequenced over 1200 viral isolates collected over a 40-year time period. We combined sequence data with epidemiologic data to explore possible relationships between evolutionary dynamics and epidemiological characteristics of the virus. Our work revealed two previously unrecognized subgroups of U genogroup IHNV which were associated with disctinct epidemiologic patterns. One subgroup was detected more frequently in Chinook salmon and steelhead trout in the Columbia River Basin, while the other was detected more frequently in sockeye salmon in coastal watersheds. These associations were supported by FST and by phylogeographic analysis. Notably the geographic structure we observed supports hypotheses that fish-to-fish transmission of IHNV occurs mainly in fresh water, when migratory fish populations are divided across watersheds.

Back in May, we (Richard Neher and I) learned that nextstrain.org was selected as a finalist for the Open Science Prize, a new initiative jointly funded by the NIH, the Wellcome Trust and HHMI. Each of the six finalists were asked to build a prototype of their project and present this prototype at the BD2K Open Data Science Symposium at the beginning of December. It was interesting seeing the other entries to the competition. As it turned out, everyone made a website. And each group was offering a layer of added value on top of publicly available data. In one example, providing a platform for sharing health and genetic information for people suffering from rare diseases and in another example, implementing a database for worldwide air quality data. A few years ago, I wrote about the possibility of a GitHub of Science. At the time, I wasn't sure exactly what this meant. I had a vague idea that someone could take a paper and fork it and add additional analyses on top of the original. Now, the future seems much more clear —

Just as software APIs allow open source software to be built layer-upon-layer, all six of the Open Science Prize finalists supply something like an API in which inputs of publicly available data are processed to yield derived outputs that encourage sharing, synthesis and understanding. I can totally imagine a scientific ecosystem in which open science projects (websites) rely on a stack of data and outputs from other groups, but produce their own data and outputs for downstream analysis. With nextstrain, we'd like to do something like this for pathogen phylogenetics and provide a basis for downstream epidemiological and evolutionary analyses. It seems like such a model could grow to live alongside the dominant (and incredibly worthwhile) scientific discourse occurring via peer-reviewed publication.

There is now public voting to determine which three entries will move forward to the final round. Although I think all six OSP entries were pretty great, we'd very much appreciate your vote. Please go to www.openscienceprize.org and vote by Jan 6.

Watercolor courtesy of Matt Cotten.

You know how when you're traveling you pick up post cards with every intention of writing them and sending them from your exotic location only to mail them out once you're home? This post is kind of like that.

I'm writing this from my desk in Seattle, but I was in fact away... in Brazil! I had the incredible opportunity to spend two weeks sequencing Zika genomes from clinical samples with Josh Quick (of Ebola sequencing fame) and Sarah Hill from Oliver Pybus' group. You might be familiar with the ZiBRA sequencing road trip that Trevor was a part of back in June. This was a follow-up trip with the goal of generating a lot more genomes with a newly optimized protocol.

Plans for this trip were hatched about three weeks before we all flew down, in a pub in Cornwall where I was for PoreCamp 2016. The vast majority of the logistical details were hashed out over Whatsapp. This type of planning for a pretty major trip seemed kind of crazy and unlikely to work, and yet it totally did. And really, kind of crazy but managing to work is a pretty good description of the trip in its entirety.

The good news first! Josh’s freshly minted protocol worked very well and we were able to sequence 20 genomes on the MinION with over 75% genome coverage (and some with as high as 98% genome coverage). 20 genomes might not sound like many but there were only around 90 genomes publicly available when we flew down, so proportionally this was a lot of information to get out of the trip.

Behind the scenes though we were riding a roller coaster of success and issues, and given that our experiences are probably pretty typical of outbreak sequencing in the field, they warrant some description. We started off in Salvador, where all the RNA extracted during the road trip had been stored. The first major issue we had to deal with was managing contamination. With 45 cycles of PCR needed to amplify sufficient Zika cDNA for sequencing, we were pretty concerned about keeping amplicons away from any RNA yet to be reverse transcribed. This pre- and post-pcr separation was a little more challenging in our Salvador lab, which had a single thermal cycler (we have a two-step PCR protocol) and no separated lab spaces. We ended up turning a biosafety cabinet into our pre-pcr area, complete with a tiny 8 tube thermal cycler that we ran off of a battery pack and programmed using Josh's computer. Given that we could UV the whole setup after every run this actually worked extraordinarily well and despite so many rounds of amplification we had clean negative controls.

As we processed more and more samples we realized that there didn't seem to be as sharp of a relationship between Ct and sequence-ability as we were expecting. Josh figured that RNA degradation was likely to blame for this, as the samples had probably experienced some freeze-thaw during their transportation along the coast of Brazil. There wasn't much that could be done about the degradation at this point, but it did mean that we tried to work as much as possible from fresh extractions in São Paulo.

Another hurdle was ensuring that we had complete epidemiological data for the samples we sequenced. Trevor described some of the challenges with wrangling metadata on the road trip, but in the end all that work meant that the metadata was pretty complete for the samples in Salvador. This was more of a problem in São Paulo, where samples had been received by multiple people, in multiple labs. There really isn't much to say here except you'll probably need to be pretty dogged in your pursuit and you'll probably feel like a bit of a nag. However, the inferential worth of the sequences drops a lot when the associated epidemiological information isn't available, so it really is worth the sweat and tears to hunt it down.

So tl;dr. Thinking of doing some outbreak field sequencing? Awesome, if you can get a good team together it will be a lot of fun (dare I say a sequencing vacation?). Be prepared to be resourceful and creative, to constantly troubleshoot, and to cross your fingers a lot. Try and have fresh RNA, battle to make the lab as clean as you can, and work hard to verify and manage your metadata.

Richard Neher and I have compiled another report on recent patterns of seasonal influenza virus evolution with an eye toward projecting forward to 2016 and 2017 flu seasons. All analyses are best on the nextflu platform. Doing weekly updates on nextflu has forced us to keep pipelines current and has made putting together these reports not such a chore.

This time around, the biggest news is within H3N2, where we're seeing the rapid spread of a subclade within 3c2.a viruses. This subclade is primarily distinguished by the HA1:171K mutation (along with changes HA2:77V/155E). We predict these viruses will predominate in the future H3N2 population. However, we lack antigenic data to really say whether the vaccine needs updating. It's possible to have this sort genetic evolution without strong antigenic evolution necessitating a vaccine update.

In putting together the report this time, it was helpful referring to past reports from last September and this February. Gratifyingly, in February we stated:

Barring substantial changes in other clades, we predict the (HA1:171K, HA2:77V/155E) variant to dominate.

This is exactly what's come to pass in the last 6 months. As we keep doing this, we'll be able to compile hits-and-misses and see where the intuition and models are succeeding and where they are failing.

22 Aug 2016 by trvrb

This has been a busy, but fun and productive, summer. Lots of things going on. I had a couple conferences, traveled to Brazil in June to help with Zika sequencing and traveled to South Korea for a collaborative visit. In addition to lab things, I've been working with Charlton Callender, Richard Neher and Colin Megill on the nextstrain project, trying to get all the pieces of the pipeline together. We're basically doing a full refactor from the existing nextflu codebase to include a database to manage sequence and serological data, improved build pipelines and more flexible visualization tools. I'll try to write more on the nextstrain project at a later date. We're trying to have a prototype ready by Dec 1 when Open Science Prize judging will be held.

Lots of activity in the lab:

I'm looking forward to the coming year. We have lots of momentum at this point and it will be fun to see the science that's produced.

I'm at the Rio airport now, heading home after 9 days in Brazil as part of the ground team of the ZiBRA project. As part of the team, I traveled from Natal to Recife along the northeastern coast collecting clinical samples for mobile Zika genome sequencing and analysis. This has been an illuminating experience and I'm grateful to Nick, Nuno, Luiz and the rest of the team for inviting me to be part of this.

I truly believe that pathogen genome analysis can contribute significantly to epidemiological understanding and outbreak response. However, for this to work, genomes need to be produced and shared quickly enough so that epidemiological insights are actionable. This was a major issue for much of the West African Ebola outbreak, limiting the utility of genomic approaches. The situation is somewhat better for the ongoing Zika epidemic in the Americas in that multiple groups are releasing a genome here and a genome there, but overall depth is still lacking with just 64 outbreak genomes available at this time. The ZiBRA project is an attempt to do real-time genomic surveillance of Zika in Brazil. If all goes according to plan, this project will rapidly provide a dataset for downstream analysis of Zika evolution and epidemiology, aiding understanding of virus spread and epidemic dynamics.

The trip was incredibly eye-opening for me in terms of the messy reality of viral surveillance and the even-more-messy details of Zika surveillance in Brazil. The basic pipeline for Zika surveillance in Brazil by the Ministry of Health (much like other viral surveillance systems) goes something like:

  1. Patient presents at a clinic with symptoms consistent with infection (fever, rash, etc...).
  2. The clinician sends a blood sample to the regional diagnostic laboratory (these are referred to as LACENs).
  3. The LACEN extracts viral RNA and runs RT-PCR to confirm viral presence in the sample.

The RT diagnostic is particularly important as clinical symptoms are difficult to distinguish between Zika, dengue and Chikungunya. With the road trip, we were able to bring in reagents and expertise lacked by the LACENs and burn through a large number of banked clinical specimens to search for additional RT-positives. In some cases, we were able to confirm Zika diagnoses of pregnant women who presented the week before. We reported postive and negative RT diagnostics back to the LACENs. RT-positive samples were then brought forward for PCR amplification and MinION sequencing.

I did help a bit with the lab-work, but I ended up mostly running point on metadata. As might be expected given the circumstances, the lab work was incredibly chaotic and I spent most of my time trying to keep sample data from unraveling. To keep epi metadata attached to a sample required maintaining a linkage between the numbers written from tube-to-tube-to-tube and the original LACEN ID. It also required digging through the LACEN diagnostic reports to pull in important epi metadata like date of collection and municipality of residence. I've never quite appreciated before the degree to which data wants to come apart if continual attention is not paid (proper data is an ordered state that is constantly under attack by entropic forces). I hope I've left the team with systems in place to promote further metadata collection.

We finished base calling and assembly on the first MinION runs on June 8, but realized they need resequencing to have good coverage. That said, we should be releasing genomes soon and hope to keep a flow of genomes going through the next few months. I'm super excited to be able to rapidly incorporate these genomes into nextstrain.org and help with tracking Zika evolution and epidemic spread.