Data starter kit for CD8+ T-cell epitope prediction

I put together a small repository of data files for ML researchers who want to get started on cytotoxic (CD8+) T-cell epitope prediction. These are part of the “core” data used to train MHCflurry 2.0: MHC sequences, peptide-MHC affinity data, and eluted MHC ligands detected by mass spectrometry.

My hope is that the following data files will be enough to get you started on developing a model which includes peptide-MHC affinity and antigen processing.

class1_mhc_sequences.csv: CSV file containing protein sequences of ~15k Class I MHC alleles. These are primarily human (prefixed by “HLA”) but also contain several other species (e.g. mice, cows, &c). Most of the diversity of Class I MHCs occurs in exons 2 and 3, so some sequences are limited to those regions. The most important columns are name (MHC allele) and seq(amino acid sequence).

peptide-mhc-binding-affinity.csv: CSV containing ~200k measurements of affinity between short peptides and different MHC proteins. Most of these are for human alleles (prefixed by “HLA-“) but ~35k come from other species (primarily mouse MHCs, prefixed by “H-2”). The most important columns are:

      • allele: name of MHC allele, e.g. “HLA-A*02:01”
      • peptide: amino acid sequence of peptide
      • measurement_value: nM affinity (smaller is better), most often a IC50 (inhibitory concentration). Many predictors convert these to a value between 0 and 1 through the transformation 1 − log(min(IC50, 50000))/log(50000).
      • measurement_inequality: One of {“=", “>", “<“}. Most often the measurement is exact (“=“) but “<” indicates that the measurement is an upper bound (and a lower bound “>“).

eluted-mhc-ligands-mass-spec.csv: Peptides identified bound to MHCs on the surface of cells by immuno-precipitation -> elution -> mass spectrometry. The most important columns are:

      • peptide: Amino acid sequence of the identified peptide
      • format: “MONOALLELIC” when the profiled cells have a unique MHC and otherwise “MULTIALLELIC”
      • mhc_class: One of “I” or “II”. For CD8+ T-cell epitope prediction only use “I”.
      • hla: MHC allele(s) of the profiled cells

Limits: You will probably also need to generate additional data for different categories of negative samples. For example, you may want to map all mass spec identified peptides to their source proteins in order to sample non-eluted peptides of equivalent abundance. You may want to include RNA/abundance data in a final model of immunogenicity and will want to evaluate predictors on unbiased screens of T-cell responses. Lastly, you may want to use smaller datasets of peptide-MHC stability or antigen processing components.