Surveillance sequencing of SARS-CoV-2 at UNC

tl;dr: We conducted a tiny, biased survey of the SARS-CoV-2 genotypes circulating in central NC in early Jan 2021. No specific variants of concern were observed, including 501Y (B.1.1.7, B.1.351, or P.1 with apparent increased transmissibility). We’re going to repeat this for the next few weeks to look for existing or novel SARS-CoV-2 variants of concern.

With the simultaneous emergence of several SARS-CoV-2 variants of concern around the world, we want to get as much genomic surveillance as possible from all over the world. Unfortunately, viral sequence surveillance is an extremely unevenly distributed activity, with the preponderance of the world’s sequences coming from the COVID-19 Genomics UK (COG-UK) Consortium . The geographical non-uniformity is repeated as you zoom into each country and find that only a few states or cities are actively engaged in viral sequencing surveillance. For example, in the US, Texas has contributed almost 15k viral sequences, whereas most states have contributed a few dozen to a few hundred sequences. For example, Tennessee (population ~7M) has contributed fewer SARS-CoV-2 genomic sequences (n=24) than the Diamond Princess cruise ship (n=25).

Closer to home, the state of North Carolina currently has 111 viral genomes in GISAID, the most recent of which is from November 2020. Are N501Y.V1/V2/V3 or any other scary viral lineages circulating locally? We won’t know unless we sequence more.

To shine a light on the dark matter of regional viral evolution, I’m working with Jeremy Wang to sequence recent positive samples from Melissa Miller’s clinical testing lab. Really, Jeremy is doing the sequencing and I’m just helping out with the data analysis. This blog post will explain some of the process and show preliminary results from the first week’s batch of sequencing (n=15).

Methods Summary

The clinical testing lab has been helpfully saving low Ct samples for us and handing them off to Rob Hagan‘s lab, who inactivates and anonymizes them before handing them off to Jeremy. Jeremy then does the molecular heavy lifting: RNA extraction, reverse transcription, library preparation and whole-genome nanopore sequencing based on tiled PCR amplification targeting SARS-CoV-2 using a subset of ARTIC primers. Jeremy likes to use big amplicons and get at least 100x coverge per tile (often much higher):
These tiled sequences are then assembled using Medaka, annotated with Nextclade, and after a little data cleaning, compared with global variant counts in CoV-GLUE.

Mutations Overview

As expected, every sample had the now ubiquitous D614G spike mutation, along with nsp12 P323L. 11/15 samples also had the common ORF3a Q57H mutation. Three samples had mutations at S:Q677 and four samples had idiosyncratic mutations in N which have the potential to partially interfere with either the N1 or N2 primers of the CDC RT-qPCR test but we didn’t see clear evidence this is actually happening. We also, unfortunately, discovered that Nextclade doesn’t annotate frameshift mutations. Hopefully this will be fixed in the near future.

Spike Mutations

We didn’t observe any mutations from recent constellations of concern, such as E484K, N501Y, or anything else in the RBD of the spike protein. We did see 2 samples with S:Q677H, which has received some attention in the news due to its possible interaction with the S1/S2 cleavage site, and 1 sample with Q677P in the spike protein. Here are the counts for all spike protein mutations observed:

Protein      Mutation    Cohort Count    Global Count (CoV-GLUE) 
S            D614G        15               221700
S            D138Y         2                  244
S            Q677H         2                  602
S            T95I          1                  417
S            M1237I        1                  213
S            Q677P         1                    1

In addition to SNVs, we also observed a frame-shifting insertion (21639insC) in two of the samples. The induced frameshift would be after residue 26 of the spike protein, where a +1 reading frame seems to extend for ~75aa before hitting a stop codon. Since the virus clearly couldn’t infect cells without its specificity granting receptor-binding protein, this mutation might correspond with a recurrent defective sub-genomic RNA.

Mutations in the CDC EUA test primers

We also observed sporadic SNVs in regions of the N protein overlapping either the N1 or N2 primers of the CDC RT-qPCR assay. The three samples with the higher N1/N2 Ct values also happened to contain these mutations, but since Ct values for both N1 and N2 targets are high, regardless of which has a mutation, the relationship or significance is unclear.

Sample Mutation Primer RNase P Ct N1 Ct N2 Ct
JW0001 C28311T N1_P 27.118 17.684 17.123
JW0012 C29167T N2_F 30.246 22.961 23.252
JW0013 C28344T N1_R 28.294 24.858 25.078
JW0015 T29194C N2_P 33.451 24.031 25.357

It’s important to note that these definitely aren’t unambiguous N Gene Target Failure (NGTF), like have been found elsewhere with deletions in primer regions of the Thermo Fisher TaqPath assay.

Recurrent Mutations Across Samples

For completeness, here are all mutations observed in at least 3/15 samples:

Protein     Mutation     Cohort Count      Global Count (CoV-GLUE)
nsp12       P323L         15                 221122
S           D614G         15                 221700
ORF3a       Q57H          11                  51624
nsp2        T85I          9                   35006
N           P67S          8                    2285
N           P199L         8                    3935
nsp5        L89F          8                    4112
nsp14       N129D         8                    2496
nsp16       R216C         8                    2440
ORF3a       G172V         8                    2550
ORF8        S24L          8                    8492
nsp3        A1006V        5                      15
N           S194L         3                   12810
N           D377Y         3                    2663
nsp3        M1788I        3                    1256
ORF8        I121fs        3                       0

The last mutation in this list is an interesting single nucleotide deletion in the final residue of ORF8, which frameshifts an isoleucine codon into a multi-residue extension of the protein. It’s unclear how common this mutation is globally since frameshifts seem to less well annotated or analyzed than in-frame mutations.

Clades

The sequences in this batch predominantly belong to clades 20G (n=8) and 20A (n=5), with one sample each for 20B and 20C.
This distribution is unsurprising for a sample from the US:

Future Plans

This batch of sequences helped us learn more about the underlying tools, how to distinguish real indel variants from artifactual ones, and develop some basic analysis scripts. We’re planning on sequencing a few dozen samples every week for the next few weeks and deposit each week’s data into GISAID in real time.

Methods Details

PCR-positive universal transport media (UTM) samples were inactivated by 1:1 dilution in 2X Zymo DNA/RNA Shield. We extracted RNA from 400ul inactivated media using the Zymo Quick Viral RNA Kit per the manufacturer’s protocol, eluting into 20ul nuclease-free water.

We generated cDNA covering the genome consisting of 11 amplicon tiles of ~2.8Kbp each. Briefly, 5ul RNA was reverse-transcribed with SuperScript IV using random hexamer primers for 50 minutes at 42C followed by 10 minutes at 70C to inactivate the transcriptase. Two 25ul PCR reactions were run with 2.5ul of first-strand product, 12.5ul NEB Q5 Hot Start 2X Master Mix, and each of two non-overlapping (odd/even-numbered) primer sets (a subset of the ARTIC v3 primers). Two-step PCR included initial denaturation at 98C for 30 seconds followed by 30 cycles of denaturation at 98C for 15 seconds and annealing/extension at 65C for 5 minutes. PCR products from each sample were pooled and cleaned up using 0.6X Ampure XP beads, washing twice with 80% EtOH, and eluting into 15ul NFW. Amplified cDNA was quantified from 1ul elute using a Qubit fluorometer.

We used the ONT Rapid Barcoding Kit (RBK004) with slight modifications. Briefly, we combined 3ul cDNA (with conc. ~120ng/ul) and 1ul RB01-12 barcoding transposase and incubated at 37C for 1 min followed by 80C for 1 min. Samples were pooled equally by volume to a total of 10ul. Sequencing adapters were attached by addition of 1ul RAP and incubation at RT for 5 min.

Libraries were loaded on MIN-FLO106D (R9.4.1) flow cells and sequenced for 3 hours on GridION running MinKNOW (20.10.6). Basecalling was performed using Guppy 4.4.1 with model ‘dna_r9.4.1_450bps_hac’. Reads were demultiplexed and barcodes trimmed using guppy_barcoder. Medaka was used to generate consensus sequences from trimmed amplicons. Consensus genomes were analyzed, variants and clades called using Nextstrain’s Nextclade web interface. Mutations in ORF14 were excluded since this reading frame has been shown to be untranslated. A frame-shifting deletion at the end of ORF8 was manually annotated. Global counts were determined by intersection with data download from CoV-GLUE. Analysis code available online at https://github.com/til-unc/unc-sars-cov-2-sequencing.

We performed RT-qPCR using the Luna Universal Probe One-Step RT-qPCR Kit (New England Biolabs) according to the manufacturer’s suggested protocol on a QuantStudio 6 Flex with CDC recommended N1, N2, and RNase P primers/probes.

Data starter kit for CD8+ T-cell epitope prediction

I put together a small repository of data files for ML researchers who want to get started on cytotoxic (CD8+) T-cell epitope prediction. These are part of the “core” data used to train MHCflurry 2.0: MHC sequences, peptide-MHC affinity data, and eluted MHC ligands detected by mass spectrometry.

My hope is that the following data files will be enough to get you started on developing a model which includes peptide-MHC affinity and antigen processing.

class1_mhc_sequences.csv: CSV file containing protein sequences of ~15k Class I MHC alleles. These are primarily human (prefixed by “HLA”) but also contain several other species (e.g. mice, cows, &c). Most of the diversity of Class I MHCs occurs in exons 2 and 3, so some sequences are limited to those regions. The most important columns are name (MHC allele) and seq(amino acid sequence).

peptide-mhc-binding-affinity.csv: CSV containing ~200k measurements of affinity between short peptides and different MHC proteins. Most of these are for human alleles (prefixed by “HLA-“) but ~35k come from other species (primarily mouse MHCs, prefixed by “H-2”). The most important columns are:

      • allele: name of MHC allele, e.g. “HLA-A*02:01”
      • peptide: amino acid sequence of peptide
      • measurement_value: nM affinity (smaller is better), most often a IC50 (inhibitory concentration). Many predictors convert these to a value between 0 and 1 through the transformation 1 − log(min(IC50, 50000))/log(50000).
      • measurement_inequality: One of {“=", “>", “<“}. Most often the measurement is exact (“=“) but “<” indicates that the measurement is an upper bound (and a lower bound “>“).

eluted-mhc-ligands-mass-spec.csv: Peptides identified bound to MHCs on the surface of cells by immuno-precipitation -> elution -> mass spectrometry. The most important columns are:

      • peptide: Amino acid sequence of the identified peptide
      • format: “MONOALLELIC” when the profiled cells have a unique MHC and otherwise “MULTIALLELIC”
      • mhc_class: One of “I” or “II”. For CD8+ T-cell epitope prediction only use “I”.
      • hla: MHC allele(s) of the profiled cells

Limits: You will probably also need to generate additional data for different categories of negative samples. For example, you may want to map all mass spec identified peptides to their source proteins in order to sample non-eluted peptides of equivalent abundance. You may want to include RNA/abundance data in a final model of immunogenicity and will want to evaluate predictors on unbiased screens of T-cell responses. Lastly, you may want to use smaller datasets of peptide-MHC stability or antigen processing components.