Human B cells play a fundamental role in the adaptive immune response to infection and vaccination, as well as the pathology of allergies and many autoimmune diseases. Central to all of these processes is the fact that B cells are an evolutionary system, and undergo rapid somatic hypermutation and antigen-driven selection as part of the adaptive immune response. The similarities between this B cell response and evolution by natural selection have made phylogenetic methods a powerful means of characterizing important processes, such as immunological memory formation. Recent methodological work has led to the development of phylogenetic methods that adjust for the unique features of B cell evolution. Further, advances in single cell sequencing can now provide an unprecedented resolution of information, including linked heavy and light chain data, as well as the associated transcriptional states of individual B cells. In this tutorial, we show how single cell information can be integrated into B cell phylogenetic analysis using the Immcantation suite (Immcantation.org).
This tutorial covers:
Beginning with processed single cell RNA-seq (scRNA-seq) + BCR data from 10X Genomics, we will show:
Watch on YouTube a recorded version of this tutorial, that was presented at the Adaptive Immune Receptor Repertoires Webinar Series organized by the AIRR Community and The Antibody Society (November 9, 2021).
Jupyter Notebook documentation: https://jupyter-notebook.readthedocs.io/en/stable/
Ctrl+Enter will run the code in the selected cell and Shift+Enter will run the code and move to the following cell.
This container comes with software and example data that is ready to use. The commands versions report
and builds report
show the versions and dates respectively of the tools and data.
Use this command to list the software versions
Use this command to list the date and changesets used during the image build.
../data/bcr_phylo_tutorial/BCR.data.tsv
: B-Cell Receptor Data. Adaptive Immune Receptor Repertoire (AIRR) tsv BCRs already aligned to IMGT V, D, and J genes. This process is not covered in this tutorial. To learn more visit https://immcantation.readthedocs.io/en/stable/tutorials/tutorials.html../data/bcr_phylo_tutorial/GEX.data.rds
: Gene Expression Data. This file contains a Seurat object with RNA-seq data already processed and annotated. Processing and annotation are not covered in this tutorial. You can learn more on these topics in Seurat's documentation and tutorials: https://satijalab.org/seurat/articles/pbmc3k_tutorial.htmlThese two files are subsamples of the original 10x scRNA-seq and BCR sequencing data from Turner et al. (2020) Human germinal centres engage memory and naive B cells after influenza vaccination Nature. 586, 127–132 link The study consists of blood and limph node samples taken from a single patient at mulitple time points following influenza vaccination.
Note: The example files are available for download here.
Combining gene expression and BCR sequences.
Identifying clonal clusters, reconstruct germlines.
Building and visualizing trees.
Tree analysis, detecting ongoing evolution.
The next two lines of code are required to be able to use R magic and run R code in this Jupyter notebook.
The is the working directory:
The example files are expected to be in ../data/bcr_phylo_tutorial
.
The example files are expected to be in ../data/bcr_phylo_tutorial
. If you are using a different path, update the code in the cell bellow accordingly.
print
can be used to obtain a general overview of the Seurat object (number of features, number of samples...).
Idents
reports the cell ID and identities. The first annotation in this blood sample is a TCR.
The default file format for all functions in Immcantation is the AIRR-C format as of release 4.0.0. The rearrangement data is stored in a table where each row is a sequence, and each column an annotation fields. To learn more about this format (including the valid field names and their expected values), visit the AIRR-C Rearrangement Schema documentation.
It is possible to subset columns using regular R
functions. The cell below shows how to subset some fields of interest for the first sequence in the table.
Both of the example datasets have been processed separately, and use slighty different cell identifiers. To consolidate the data into one object, we need to standardize the cell identifiers. This step could be different, or not necessary at all, with other datasets.
In addition, the cells in both datasets are not presented in the same order.
Having common cell identifiers, we will be able to bring BCR data into the Seurat object, or the gene expression and annotation data from the Seurat object into the BCR table, by matching cell_id_unique
.
Find the GEX cells in the BCR data
Label GEX data with BCR data availability
Plot UMAP
The vector match.index
contains the position of the GEX cells in the BCR data. If there is not match, the value will be NA
.
With the matching indices, it is possible to label the GEX cells with TRUE
or FALSE
to indicate whether there is BCR information available for the cell, and visualize this information in the UMAP plot.
We expect that for a large proportion of cells labelled as BCR, there will be BCR sequencing data available, and these cells will be highlighted in the UMAP plot.
Find the BCR cells in the GEX data
Transfer GEX annotations into the BCR data
Add UMAP coordinates to the BCR data
Remove cells without GEX data
Ensure information transferred from Seurat object
We repeat the match
step, reversing the order. The vector match.index
will now contain the positions of the BCR sequences in the GEX data.
Some BCRs don't have GEX information. This can happen, for example, if the cell for which BCR's are covered didn't pass the GEX processing and quality controls thresholds.
The GEX cell annotations can be added as additional columns in the BCR table.
The UMAP coordinates can be added as additional columns in the BCR table as well.
The BCR data table now has the UMAP coordinates. We can reproduce part of the UMAP plot with standard ggplot
commands. This plot will have a similar shape to the GEX UMAP, but will only show points for which both GEX and BCR data is available.
The BCR data frame contains additional annotation fields, such us the isotype, that can also be visualized on the UMAP. We expect that naive cell express IgM. The germinal center and plasmablast cells, primarily express IgG and IgA.
Goal: Partition (cluster) sequences into clonally related lineages. Each lineage is a group of sequences that came from the same original naive cell.
Summary of the key steps:
clone_id
) that can be used to identify a group of sequences that came from the same original naive cell.Gupta et al. (2015)
We first split sequences into groups that share the same V and J gene assignments and that have the same junction (or equivalently CDR3) length. This is based on the assumption that members of a clone will share all of these properties. distToNearest
performs this grouping step, then counts the number of mismatches in the junction region between all pairs of sequences in eaach group and returns the smallest non-zero value for each sequence. At the end of this step, a new column (dist_nearest
) which contains the distances to the closest non-identical sequence in each group will be added to the BCR table. findThreshold
uses the distribution of distances calculated in the previous step to determine an appropriate threshold for the dataset. This can be done using either a density
or mixture
based method.
The figure shows the distance-to-nearest distribution for the repertoire. Typically, the distribution is bimodal. The first mode (on the left) represents sequences that have at least one clonal relative in the dataset, while the second mode (on the right) is representative of the sequences that do not have any clonal relatives in the data (sometimes called "singletons"). A reasonable threshold will separate these two modes of the distribution.
Once a threshold is decided, we perform the clonal assignment. At the end of this step, the BCR table will have an additional column (clone_id
) that provides an identifier for each sequence to indicate which clone it belongs to (i.e., sequences that have the same identifier are clonally-related). Note that these identifiers are only unique to the dataset used to carry out the clonal assignments.
Note: will print out “running in bulk mode” because the subsampled example data has only heavy chains. Other options available if light chains are included.
Most real datasets, will have most clones of size 1 (one sequence). In this tutorial, we processed data to remove most of singleton clone and we don't see the much higher peak at 1 that we would normally expect.
The goal is to reconstruct the sequence of the unmutated ancestor of each clone. We use a reference database of known alleles (IMGT). Because it is very difficult to ccurately infer the D region and the junction region for BCR sequences, we mask this region with N
.
Note: If you opted for a native installation to run this tutorial, you can obtain reference germlines from IMGT with:
git clone https://bitbucket.org/kleinstein/immcantation
immcantation/scripts/fetch_imgtdb.sh
Formatting clones
Tree building
Visualize trees
Reconstruct intermediate sequences
In the rearrrangemnt table, each row corresponds to a sequence, and each column is information about that sequence. We will create a new data structure, where each row is a clonal cluster, and each column is information about that clonal cluster. The function formatClones
performs this processing and has options that are relevant to determine how the trees can be built and visualized. For example, traits
determines the columns from the rearrangement data that will be included in the clones
object, and will also be used to determine the uniqueness of the sequences, so they are not collapsed.
Dowser offers multiple ways to build B cell phylogenetic trees. These differ by the method used to estimate tree topology and branch lengths (e.g. maximum parsimony and maximum likelihood) and implementation (IgPhyML, PHYLIP, or R packages ape and phangorn). Each method has pros and cons.
This is the oldest method and very popular. It tries to minimize the number of mutations from the germline to each of the tips. It can produce misleading results when parallel mutations are present.
These methods model each sequence separately. Use a markov model of the mutation process and try to find the tree, not the branch lengths, that maximizes the likehood of seen data.
Similar to the standard maximum likelihood, but incorporating SHM specific mutation biases into the tree building.
All tree building methods are plotted using the same method in dowser, plotTrees
.
Plot trees so that tips are colored by cell type, scaled by sample day, and labelled by isotype.
Get the predicted intermediate sequence at an internal node in the second largest tree. Dots represent IMGT gaps.
Perform root-to-tip regression on each tree to detect if later-sampled timepoints are more diverged from the germline. See this reference for more detail:
Hoehn, K. B. et al. (2021) Human B cell lineages engaged by germinal centers following influenza vaccination are measurably evolving. bioRxiv. https://doi.org/10.1101/2021.01.06.425648
Hoehn, K. B. et al. (2016) The diversity and molecular evolution of B-cell receptors during infection. MBE. https://doi.org/10.1093/molbev/msw015
Hoehn, K. B. et al. (2019) Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination. PNAS 201906020.
Hoehn, K. B. et al. (2020) Phylogenetic analysis of migration, differentiation, and class switching in B cells. bioRxiv. https://doi.org/10.1101/2020.05.30.124446
Hoehn, K. B. et al. (2021) Human B cell lineages engaged by germinal centers following influenza vaccination are measurably evolving. bioRxiv. https://doi.org/10.1101/2021.01.06.425648
Gupta,N.T. et al. (2017) Hierarchical clustering can identify b cell clones with high confidence in ig repertoire sequencing data. The Journal of Immunology, 1601850.
Gupta,N.T. et al. (2015) Change-o: A toolkit for analyzing large-scale b cell immunoglobulin repertoire sequencing data. Bioinformatics, 31, 3356–3358.
Nouri,N. and Kleinstein,S.H. (2018a) A spectral clustering-based method for identifying clones from high-throughput b cell repertoire sequencing data. Bioinformatics, 34, i341–i349.
Nouri,N. and Kleinstein,S.H. (2018b) Optimized threshold inference for partitioning of clones from high-throughput b cell repertoire sequencing data. Frontiers in immunology, 9.
Stern,J.N. et al. (2014) B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Science translational medicine, 6, 248ra107–248ra107.
Vander Heiden,J.A. et al. (2017) Dysregulation of b cell repertoire formation in myasthenia gravis patients revealed through deep sequencing. The Journal of Immunology, 1601415.
Yaari,G. et al. (2012) Quantifying selection in high-throughput immunoglobulin sequencing data sets. Nucleic acids research, 40, e134–e134.
Yaari,G. et al. (2013) Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology, 4, 358.