This tutorial is a basic walkthrough for defining B cell clonal families and building B cell lineage trees using 10x Genomics BCR sequencing data.
Knowledge of basic command line usage is assumed. Please check out the individual documentation sites for the functions detailed in this tutorial before using them on your own data.
docker pull immcantation/lab:devel
The primary steps of processing 10x single cell BCR data include:
CreateGermlines
You may also reference this page for further examples of how to process 10x data with Immcantation. If you have any questions about this tutorial or our software, you can email us at immcantation@googlegroups.com.
The example data is already in the container (/home/magus/data/10x_data_2subj/
). If you want to, you can download it from example data and unzip it.
%%bash
ls ../data/10x_data_2subj
Use the command versions report
to list the versions of the of the Immcantation tools installed in the container:
%%bash
versions report
To process 10x V(D)J data, a combination of AssignGenes.py
and MakeDb.py
can be used to generate a TSV file compliant with the AIRR Community Rearrangement schema that incorporates annotation information provided by the Cell Ranger pipeline. The files of filtered_contig.fasta and filtered_contig_annotations.csv, generated by cellranger vdj
, can be found in the outs directory.
Generate AIRR Rearrangement data from the 10x V(D)J FASTA files using the steps below (the \
just indicates a new line for visual clarity):
%%bash
# assign V, D, and J genes using IgBLAST
AssignGenes.py igblast -s ../data/10x_data_2subj/filtered_contig.fasta -b /usr/local/share/igblast \
--organism human --loci ig --format blast --outdir results
%%bash
# convert IgBLAST output to AIRR format
MakeDb.py igblast -i results/filtered_contig_igblast.fmt7 -s ../data/10x_data_2subj/filtered_contig.fasta \
-r /usr/local/share/germlines/imgt/human/vdj/imgt_human_*.fasta \
--10x ../data/10x_data_2subj/filtered_contig_annotations.csv --extended
%%bash
ls results
After running these commands, you should now have filtered_contig_igblast_db-pass.tsv and filtered_contig_igblast.fmt7 in your results
directory.
--10x filtered_contig_annotations.csv
specifies the path of the contig annotations file generated by cellranger vdj
, which can be found in the outs directory.Please note that:
--organism
and --loci
arguments to AssignGenes.py
accordingly (e.g., --organism mouse
, --loci tcr
) and use the appropriate V(D)J IMGT reference database (e.g., IMGT_Mouse_TR*.fasta)The remaining commands in this tutorial R. The next two lines of code are required to be able to use R magic
and run R
code in this Jupyter notebook.
# R for Python
import rpy2.rinterface
%load_ext rpy2.ipython
%%R
# load libraries
library(alakazam)
library(data.table)
library(dowser)
library(dplyr)
library(ggplot2)
library(scoper)
library(shazam)
You may wish to subset your data to only productive sequences:
%%R
# read in the data
data <- readChangeoDb('results/filtered_contig_igblast_db-pass.tsv')
data <- data %>% filter(productive)
cat(paste("There are", nrow(data), "rows in the data.\n"))
data %>% slice_sample(n = 5) # random examples
When calling clones (B cells that descend from a common naive B cell ancestor) from single cell data, SCOPer will throw an error message such as “xxx cell(s) with multiple heavy chains found. One heavy chain per cell is expected” if any cells in the data contain multiple heavy chains and then stop running. Therefore, if your data contains cells with multiple heavy chains, you need to handle it before calling clones.
A simple solution is just to remove cells with multiple heavy chains from the single cell data:
%%R
# remove cells with multiple heavy chain
multi_heavy <- table(filter(data, locus == "IGH")$cell_id)
multi_heavy_cells <- names(multi_heavy)[multi_heavy > 1]
data <- filter(data, !cell_id %in% multi_heavy_cells)
cat(paste("There are", nrow(data), "rows in the data after filtering out cells with multiple heavy chains.\n"))
Since most of the following analyses are based on heavy chains, we remove cells with only light chains:
%%R
# split cells by heavy and light chains
heavy_cells <- filter(data, locus == "IGH")$cell_id
light_cells <- filter(data, locus == "IGK" | locus == "IGL")$cell_id
no_heavy_cells <- light_cells[which(!light_cells %in% heavy_cells)]
data <- filter(data, !cell_id %in% no_heavy_cells)
cat(paste("There are", nrow(data), "rows in the data after filtering out cells without heavy chains."))
Hierarchical clustering is a widely used distance-based method for identify clonally related sequences. An implementation of the hierarchical clustering approach is provided via the hierachicalClones
function in the SCOPer R package.
It is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones before using this method. The ideal threshold for separating clonal groups is the value that separates the two modes of the nearest-neighbor distance distribution. The nearest-neighbor distance distribution can be generated by using the distToNearest
function in the shazam R package.
%%R
dist_nearest <- distToNearest(filter(data, locus == "IGH"), nproc = 1)
# generate Hamming distance histogram
p1 <- ggplot(subset(dist_nearest, !is.na(dist_nearest)),
aes(x = dist_nearest)) +
theme_bw() +
xlab("Hamming distance") + ylab("Count") +
scale_x_continuous(breaks = seq(0, 1, 0.1)) +
geom_histogram(color = "white", binwidth = 0.02) +
theme(axis.title = element_text(size = 18))
plot(p1)
The resulting distribution is often bimodal, with the first mode representing sequences with clonal relatives in the dataset and the second mode representing singletons. We can inspect the plot of nearest-neighbor distance distribution generated above to manually select a threshold to separates the two modes of the nearest-neighbor distance distribution.
For further details regarding inferring an appropriate threshold for the hierarchical clustering method, see the Distance to Nearest Neighbor vignette in the shazam package.
The threshold itself can be also found using the automatic findThreshold
function. There are different ways to find the threshold and details can also be found in the Distance to Nearest Neighbor vignette in the shazam package.
A robust way that we recommend is to use the nearest-neighbor distance of inter (between) clones as the background and select the threshold based on the specificity of this background distribution.
%%R
# find threshold for cloning automatically
threshold_output <- findThreshold(dist_nearest$dist_nearest,
method = "gmm", model = "gamma-norm",
cutoff = "user", spc = 0.995)
threshold <- threshold_output@threshold
threshold
%%R
plot(threshold_output, binwidth = 0.02, silent = T) +
theme(axis.title = element_text(size = 18))
The nearest-neighbor distance distribution is not always bimodal. In this case, if the data have multiple subjects, we can calculate the nearest neighbor distances across subjects to initialize the Gaussian fit parameters of the nearest-neighbor distance of inter (between) clones distribution.
The nearest neighbor distances across subjects can be calculated by specifying the parameter cross
in the function distToNearest
. And then when we call function findThreshold
, Gaussian fit parameters can be initialized by setting parameter cross = dist_crossSubj$cross_dist_nearest
.
In the example below, BCR data from donor 2 together with the BCR data from donor 1, was used to calculate the nearest neighbor distances across subjects:
%%R
# read in BCR data from donor 2
data_10x_example <- readChangeoDb('../data/10x_data_2subj/sc5p_v2_hs_B_1k_multi_5gex_b_Multiplex_vdj_b_all_contig_igblast_db-pass.tsv')
data_10x_example$subject <- "Subj2"
cat(paste("There are", nrow(data_10x_example), "rows in the donor 2 data.\n"))
# label the donor 1 data
data$subject <- "Subj1"
# combine donors 1 and 2
data_both_subjects <- rbind(data, data_10x_example)
data_both_subjects %>% slice_sample(n = 5) # random examples
%%R
# calculate cross subjects distribution of distance to nearest
dist_crossSubj <- distToNearest(filter(data_both_subjects, locus == "IGH"),
nproc = 1, cross = "subject")
# find threshold for cloning automatically and initialize the Gaussian fit parameters of the nearest-neighbor
# distance of inter (between) clones using cross subjects distribution of distance to nearest
threshold_output <- findThreshold(dist_nearest$dist_nearest,
method = "gmm", model = "gamma-norm",
cross = dist_crossSubj$cross_dist_nearest,
cutoff = "user", spc = 0.995)
threshold <- threshold_output@threshold
threshold
%%R
plot(threshold_output, binwidth = 0.02,
cross = dist_crossSubj$cross_dist_nearest, silent = T) +
theme(axis.title = element_text(size = 18))
In the plot above, the top plot is the nearest-neighbor distance distribution within Subj1, and the bottom plot is the nearest neighbor distances across Subj1 and Subj2.
After we decide the threshold for calling clones, the hierachicalClones
function in SCOPer package can be used to call clones using single cell mode:
%%R
# call clones using hierarchicalClones
results <- hierarchicalClones(data, cell_id = 'cell_id',
threshold = threshold, only_heavy = FALSE,
split_light = TRUE, summarize_clones = FALSE)
HierarchicalClones
clusters B receptor sequences based on junction region sequence similarity within partitions that share the same V gene, J gene, and junction length, thus allowing for ambiguous V or J gene annotations. By setting it up the cell_id
parameter, HierarchicalClones
will run in single-cell mode with paired-chain sequences. With only_heavy = FALSE
and split_light = TRUE
, grouping should be done by using IGH plus IGK/IGL sequences and inferred clones should be split by the light/short chain (IGK and IGL) following heavy/long chain clustering.
After calling clones, a clonal abundance distribution can be displayed:
%%R
# calculate and plot the rank-abundance curve
plot(estimateAbundance(results), colors = "steelblue", silent = T) +
theme(axis.title = element_text(size = 18))
Before B cell lineage trees can be built, it is necessary to construct the unmutated germline sequence for each B cell clone. Typically the IGH D segment is masked because the junction region of heavy chains often cannot be reliably reconstructed. Note that occasionally errors are thrown for some clones - this is typical and usually results in those clones being excluded.
In the example below, we read in the IMGT germline references from our Docker container. If you're using a local installation, you can download the most up-to-date reference genome by cloning the Immcantation repository and running the script:
git clone https://bitbucket.org/kleinstein/immcantation.git
./immcantation/scripts/fetch_imgtdb.sh # will create directories where it is run
And passing "human/vdj/"
to the readIMGT
function.
%%R
# run createGermlines using IMGT files in Docker container.
references <- readIMGT(dir = "/usr/local/share/germlines/imgt/human/vdj")
h <- createGermlines(filter(results, locus == "IGH"), references)
k <- createGermlines(filter(results, locus == "IGK"), references, locus = "IGK")
l <- createGermlines(filter(results, locus == "IGL"), references, locus = "IGL")
%%R
# calculate SHM frequency in the V gene
data_h <- observedMutations(h,
sequenceColumn = "sequence_alignment",
germlineColumn = "germline_alignment_d_mask",
regionDefinition = IMGT_V,
frequency = TRUE,
combine = TRUE,
nproc = 1)
The plot below shows the distribution of median mutation frequency of clones:
%%R
# calculate the median mutation frequency of a clone
mut_freq_clone <- data_h %>%
group_by(clone_id) %>%
summarise(median_mut_freq = mean(mu_freq))
ggplot(mut_freq_clone, aes(median_mut_freq)) +
geom_histogram(, binwidth = 0.005) + theme_bw() +
theme(axis.title = element_text(size = 18))
Before we can build lineage trees, our data must be formatted into a tibble of airrClone objects.
traits
column.subject
column in the columns
option. traits
column.minseq
will remove all clones with fewer than the specified number of sequences.%%R
clones <- formatClones(data_h,
traits = "c_call", columns = "subject",
minseq = 3)
print(clones)
Dowser offers multiple ways to build lineage trees, including maximum parsimony, maximum likelihood, and B cell specific models such as IgPhyML.
Here, we build trees using the likelihood maximization functions from the phangorn
package in R. The tree objects themselves are saved as R phylo
objects in the trees
column.
%%R
trees <- getTrees(clones, build = "pml", nproc = 1)
print(trees)
To build lineage trees using the B cell specific models in IgPhyML, you must specify the location of the compiled IgPhyML executable in your system.
parameters
is also returned, which gives the parameter values of the HLP19 model.%%R
trees <- getTrees(clones,
build = "igphyml", nproc = 1, exec = "/usr/local/share/igphyml/src/igphyml")
print(trees)
Regardless of how you build trees, they are visualized in the same manner with the plotTrees
function. This will return a list of ggplot
objects in the same order as the input object. Here, we color the tips by the c_call
value because we specified that column in the formatClones
step.
%%R
p <- plotTrees(trees, tips = "c_call", tipsize = 4)
# plot first tree
p[[1]]
# save all trees to a pdf
treesToPDF(p, file = "results/alltrees.pdf")
Sequences of intermediate nodes are automatically reconstructed during the tree build process. To retrieve them, first plot the node numbers for each node. The function collapseNodes
can help clean up the tree plots.
Then, use the getNodeSeq
function to retrieve the sequence at the desired node.
%%R
trees <- collapseNodes(trees)
# plot trees with node ID numbers
p <- plotTrees(trees, tips = "c_call", tipsize = 4,
node_nums = TRUE, labelsize = 7)
p[[1]]
%%R
# get the sequence at node 8
first_clone_id <- trees[["clone_id"]][1]
sequence <- getNodeSeq(trees, clone = first_clone_id, node = 8)
print(sequence)
In addition to the functions for building and visualizing trees, Dowser also implements new techniques for analyzing B cell migration and differentiation, as well as for detecting new B cell evolution over time. These are more advanced topics detailed on the Dowser site.
If you have data from different tissues, B cell subtypes, and/or isotypes and want to use lineage trees to study the pattern of those traits along lineage trees, check out the discrete trait vignette.
If you have data from multiple timepoints from the same subject and want to determine if B cell lineages are evolving over the sampled interval, check out the measurable evolution vignette.
For more advanced tree visualization, check out the plotting trees vignette.