Benchmarking approaches for the analysis of single-cell expression data

#root.dir <- here::here()
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  #root.dir = root.dir
  fig.height = 6,
  fig.width = 8 #178mm
)  
knitr::opts_knit$set(#root.dir = root.dir, 
                     dpi = 350)  

library(data.table)
library(cowplot)
library(wesanderson)
library(ggplot2)
library(patchwork)
library(pROC)

Background

Recently, Zimmerman et al. proposed the use of mixed models over pseudobulk aggregation approaches, reporting improved performance on a novel simulation approach of hierarchical single-cell expression data. However, their reported results could not prove the superiority of mixed models as they only calculate performance based on type 1 error (performance of the models on non-differentially expressed genes). To correctly benchmark the models, a reanalysis using a balanced measure of performance, considering both the type 1 and type 2 errors (both the differentially and non-differentially expressed genes), is necessary.

This benchmark was conducted using the R/benchmark_script.R in this repository and an edited version of hierarchicell which returns the Matthews correlation coefficient performance metric as well as the type 1 error rates, uses the same simulated data across approaches and has checkpointing capabilities (so runs can continue from where they left off if aborted or crashed), available at: https://github.com/neurogenomics/hierarchicell.

We tested the models using the Matthews Correlation Coefficient (MCC). MCC is a well-known and frequently adopted metric in the machine learning field which offers a more informative and reliable score on binary classification problems. MCC produces scores in [-1,1] and will only assign a high score if a model performs well on both non-differentially and differentially expressed genes. Moreover, MCC scores are proportional to both the size of the differentially and non-differentially expressed genes, so it is robust to imbalanced datasets.

The benchmark was conducted based on the average MCC from 20,000 iterations; 50 runs for each of the 5 to 40 individuals and 50 to 500 cells at a p-value cut-off of 0.05 on 10,000 genes (5,000 differentially expressed genes at varying fold changes and 5,000 non-differentially expressed genes).

Results of benchmark analysis

We can visualise the results from the analysis, first load the data which is stored in this repo:

res <- data.table::fread("./results/method_performances.csv")
print(head(res))
#>    seed n_ind n_cells       method type_1_err         MCC
#> 1: 1202    10     100         MAST 0.27012072  0.36186907
#> 2: 1202    10     100      MAST_RE 0.06136821  0.59086331
#> 3: 1202    10     100  MAST_Combat 0.70579268 -0.02930744
#> 4: 1202    10     100  GLM_tweedie 0.06243806  0.80338803
#> 5: 1202    10     100 GLMM_tweedie 0.02923687  0.82716019
#> 6: 1202    10     100         GEE1 0.07482656  0.80977434

Now let’s plot the results. We will split this plot to also show just the top four performing results (left) so the differences between these are clearer:

#get colour palette set up - choose distinctive colours
#with similar tones for similar methods
q10 <- c(
  #orange - GEE
  "orange2",
  #yellow - modified t
  "yellow2",
  #purple - PB
  "plum2",
  "purple3",
  #gold - Tobit
  "goldenrod4",
  #green - GLMs
  "seagreen1",
  "green4",
  #blue - MAST
  "dodgerblue4",
  "cadetblue4",
  "cadetblue2"
)
size1 <- 7.4
size2 <- 10.5
all_n_ind <- c(5,10,20,30,40)
all_n_cells <- c(50,100,250,500)
tmp <- expand.grid(all_n_ind, all_n_cells)
tmp <- tmp[order(tmp$Var1),]
#create plotting column n_ind,n_cells - make factor so order maintained
res[,n_ind_n_cells:=factor(paste0(n_ind,"-\n",n_cells),
                            levels=apply(tmp, 1, paste, collapse="-\n"))]
#add descriptive method name
res[,method_des:=fcase(
      method == "MAST", "Two-part hurdle: Default",
      method == "MAST_Combat", "Two-part hurdle: Corrected",
      method == "MAST_RE","Two-part hurdle RE",
      method == "Monocle","Tobit",
      method == "ROTS","Modified t",
      method == "GLMM_tweedie","Tweedie: GLMM",
      method == "GLM_tweedie","Tweedie: GLM",
      method == "Pseudobulk_mean","Pseudobulk: Mean",
      method == "Pseudobulk_sum","Pseudobulk: Sum",
      method == "GEE1","GEE1"
  )]

mcc_res <- res[,mean(MCC),
               c("n_ind_n_cells","n_ind","n_cells","method_des")]
setorder(mcc_res,method_des)

#plot all res
plt_mcc1 <-
    ggplot(mcc_res)+
    geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des))+
    geom_vline(xintercept=4.5,color="grey",linetype = 3)+
    geom_vline(xintercept=8.5,color="grey",linetype = 3)+
    geom_vline(xintercept=12.5,color="grey",linetype = 3)+
    geom_vline(xintercept=16.5,color="grey",linetype = 3)+
    theme_cowplot()+
    scale_colour_manual(values=q10)+
    labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
         x="Number of Individuals - Number of cells")+
  theme(axis.text.x = element_text(size=size1),
        axis.text.y = element_text(size=size1),
        axis.title.x = element_text(size=size2),
        axis.title.y = element_text(size=size2),
        legend.text = element_text(size=size2),
        legend.title = element_blank())
#plot top 4
plt_mcc2 <-
  ggplot(mcc_res[method_des %in% c("Pseudobulk: Mean","Pseudobulk: Sum",
                                    "GEE1","Tweedie: GLMM"),])+
  geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des),
            show.legend = FALSE)+
  geom_vline(xintercept=4.5,color="grey",linetype = 3)+
  geom_vline(xintercept=8.5,color="grey",linetype = 3)+
  geom_vline(xintercept=12.5,color="grey",linetype = 3)+
  geom_vline(xintercept=16.5,color="grey",linetype = 3)+
  theme_cowplot()+
  labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
       x="Number of Individuals - Number of cells")+
  theme(axis.text.x = element_text(size=size1),
        axis.text.y = element_text(size=size1),
        axis.title.x = element_text(size=size2),
        axis.title.y = element_blank(),
        legend.text = element_text(size=size2),
        legend.title = element_blank())+
  #get matching colours for DEG approaches from full graph
  scale_colour_manual(values=q10[c(1,3,4,7)])

plt_mcc1 + plt_mcc2 + 
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

So it is clear that our results demonstrate that pseudobulk approaches are far from being “too conservative” and are, in fact, the best performing models based on this simulated dataset for the analysis of single-cell expression data.

Now let’s check on an imbalanced number of cells, plotted with the balanced number of cells for comparison for 20 individuals. This time we will also plot the standard deviation around the mean to show how much the performance varies:

res_imbal <- data.table::fread("./results/method_performances_imbal.csv")

#add descriptive method name
res_imbal[,method_des:=fcase(
  method == "MAST", "Two-part hurdle: Default",
  method == "MAST_Combat", "Two-part hurdle: Corrected",
  method == "MAST_RE","Two-part hurdle RE",
  method == "Monocle","Tobit",
  method == "ROTS","Modified t",
  method == "GLMM_tweedie","Tweedie: GLMM",
  method == "GLM_tweedie","Tweedie: GLM",
  method == "Pseudobulk_mean","Pseudobulk: Mean",
  method == "Pseudobulk_sum","Pseudobulk: Sum",
  method == "GEE1","GEE1"
)]

res_imbal[,n_cells:="imbal"]
res_imbal[,n_ind_n_cells:=factor(paste0(n_ind,"-\n",n_cells))]

mcc_res_imbal <- res_imbal[,mean(MCC),
                           c("n_ind_n_cells","n_ind","n_cells","method_des")]

#add in runs for 20 ind not imbal
mcc_res_imbal <- 
  rbindlist(list(mcc_res[n_ind==20,],mcc_res_imbal))

#plot with error bands
#get sd
sd_imbal <- 
  res_imbal[,sd(MCC),
          c("n_ind_n_cells","n_ind","n_cells","method_des")]
sd_bal <-
  res[,sd(MCC),
      c("n_ind_n_cells","n_ind","n_cells","method_des")]

sd_imbal <- rbindlist(list(sd_bal[n_ind==20,],sd_imbal))
setnames(sd_imbal,"V1","sd")

mcc_res_imbal[sd_imbal,sd:=i.sd, on=c("n_ind_n_cells","method_des")]

ggplot(mcc_res_imbal)+
  geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des))+
  geom_vline(xintercept=4.5,color="grey",linetype = 3)+
  #error bars with sd
  geom_errorbar(aes(x=n_ind_n_cells,ymin=V1-sd, ymax=V1+sd,colour=method_des), width=.2,
                position=position_dodge(0.05))+
  theme_cowplot()+
  scale_colour_manual(values=q10)+
  labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
       x="Number of Individuals - Number of cells")+
  theme(axis.text.x = element_text(size=size1),
        axis.text.y = element_text(size=size1),
        axis.title.x = element_text(size=size2),
        axis.title.y = element_text(size=size2),
        legend.text = element_text(size=size2),
        legend.title = element_blank())

Pseudobulk mean outperformed all other approaches on this analysis. The pseudobulk approach which aggregated by averaging rather than taking the sum appears to be the top performing overall, however, it is worth noting that hierarchicell does not normalise the simulated datasets before passing to the pseudobulk approaches. This is a standard step in such analysis to account for differences in sequencing depth and library sizes. This approach was taken by Zimmerman et al. as their data is simulated one independent gene at a time without considering differences in library size. The effect of this step is more apparent on the imbalanced number of cells where pseudobulk sum’s performance degraded dramatically. Pseudobulk mean appears invariant to this missing normalisation step because of averaging’s own normalisation effect. It should be noted that this is a flaw in the simulation software strategy and does not show an improved performance of pseudobulk mean over sum.

Finally, it may be of interest to consider the performance of the different methods at the same type 1 error rates. We can inspect their sensitivity (1-type 2 error) at different type 1 error with Receiver Operating Curves (ROCs).

#load results
dat_rocs <- 
  data.table::fread("unzip -p ./results/method_performances_imbal_DEGs.csv.zip")
#add descriptive method name
dat_rocs[,method_des:=fcase(
  method == "MAST", "Two-part hurdle: Default",
  method == "MAST_Combat", "Two-part hurdle: Corrected",
  method == "MAST_RE","Two-part hurdle RE",
  method == "Monocle","Tobit",
  method == "ROTS","Modified t",
  method == "GLMM_tweedie","Tweedie: GLMM",
  method == "GLM_tweedie","Tweedie: GLM",
  method == "Pseudobulk_mean","Pseudobulk: Mean",
  method == "Pseudobulk_sum","Pseudobulk: Sum",
  method == "GEE1","GEE1"
)]
setorder(dat_rocs,method_des)
method_desc <- c("Two-part hurdle: Default","Two-part hurdle RE",
                 "Two-part hurdle: Corrected","Tweedie: GLM","Tweedie: GLMM",
                 "GEE1","Pseudobulk: Mean","Pseudobulk: Sum","Tobit",
                 "Modified t")

#loop through all proportions of DEGs
prop_DEGs <- c(0.05,0.1,0.2,0.3)
prop_rocs <- vector(mode="list",length=length(prop_DEGs))
names(prop_rocs) <- prop_DEGs
for(prop_i in prop_DEGs){
  dat_rocs_prop <- dat_rocs[prop_DEGs==prop_i,]
  #need to create a separate ROC obj per method
  rocs <- vector(mode="list",length=length(method_desc))
  names(rocs) <- unique(dat_rocs$method_des)
  for(method_i in unique(dat_rocs$method_des)){
    #define object to plot
    rocs[[method_i]] <- 
      pROC::roc(dat_rocs_prop[method_des==method_i,]$actual, 
                dat_rocs_prop[method_des==method_i,]$pred)
  }
  #create ROC plot
  prop_rocs[[as.character(prop_i)]]<-
    ggroc(rocs,legacy.axes=TRUE)+
    annotate("text", size=2.8, family="Helvetica",fontface="bold",
             x = rep(0.9,10), 
             y=rev(c(0,0.03,0.06,0.09,0.12,0.15,0.18,0.21,0.24,0.27)),
             label = lapply(rocs,function(x) round(x$auc[1],3)),color=q10)+
    #geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), 
    #              color="darkgrey", linetype="dashed")+
    #geom_text(lapply(rocs,function(x) x$auc[1]))+
    theme_cowplot()+
    scale_colour_manual(values=q10)+
    labs(colour='Method',y="1 - Type 2 error",
         x="Type 1 error")+
    ggtitle(as.character(prop_i))+
    theme(plot.title = element_text(size = size2, face = NULL),
          axis.text.x = element_text(size=size1),
          axis.text.y = element_text(size=size1),
          axis.title.x = element_text(size=size2),
          axis.title.y = element_text(size=size2),
          legend.text = element_text(size=size2),
          legend.title = element_blank())
}
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases

#plot all
prop_rocs[["0.05"]] + prop_rocs[["0.1"]] + 
  prop_rocs[["0.2"]]+prop_rocs[["0.3"]] + 
  plot_layout(guides = "collect",ncol = 4) & theme(legend.position = "bottom")

Future work

Pseudobulk approaches were also found to be the top performing approaches in a recent review by Squair et al.. Notably, the pseudobulk method used here; DESeq2, performed worse than other pseudobulk models in Squair et al.,’s analysis and so their adoption may further increase the performance of pseudobulk approaches on our dataset. Conversely, Squair et al., did not consider all models included in our analysis or the different forms of pseudobulk aggregation. Therefore, our results on sum and mean pseudobulk extend their findings and indicate that mean aggregation may be the best performing. However, you should be cognizant that the lack of a normalisation step based on the flaw in the simulation software strategy likely causes the increased performance of mean over sum aggregation. Further, the use of simulated datasets in our analysis may not accurately reflect the differences between individuals that are present in biological datasets. Thus, despite both our results and those reported by Squair et al., there is still room for further analysis, benchmarking more models, including different combinations of pseudobulk aggregation methods and models, on more representative simulated datasets and biological datasets to identify the optimal approach. Specifically, we would expect pseudobulk sum with a normalisation step to outperform pseudobulk mean since it can account for the intra-individual variance which is otherwise lost with pseudobulk mean but this should be tested. In conclusion, these results demonstrate that pseudobulk approaches are far from being too conservative and are, in fact, the best performing models based on this simulated dataset for the analysis of single-cell expression data.

Benchmarking approaches for the analysis of single-cell expression data

Alan Murphy

Most recent update:
2022-07-08

Background

Results of benchmark analysis

Future work

Benchmarking approaches for the analysis of single-cell expression data

Alan Murphy

Most recent update: 2022-07-08

Background

Results of benchmark analysis

Future work

Most recent update:
2022-07-08