Polychoric Correlation with Likert Data

Bayes

Stan

Polychoric correlation using a bayesian framework in Stan.

Published

January 1, 2023

Motivation: It’s likely likert

Look at you, you successful businessperson you! You own a company that sells two products: A and B. You run a short two question survey to determine whether your customers would recommend either products. For this example we’ll assume that all customers are using both products. The question style is the commonly used “Net Promoter Score” or likert scale format:

“Provide your level of agreement to the following question: I would recommend this product to a friend”

Where the available choices are:

Strongly Disagree
Somewhat Disagree
Neither Agree or Disagree
Somewhat Agree
Strongly Agree

There are of course limitations to this kind of survey design. For one, most people have a hard time discretizing their feelings or emotions into a single bucket. Perhaps the more appropriate question response would feature a slider that allows respondents to freely select their agreement on a continuous scale. Regardless, as this is the design chosen by thousands of companies and organizations, we’ll choose it as well. Though, we’ll recognize that agreement or sentiment in general is better categorized as a spectrum.

Enough philosophy, now to the actual data. I’m going to show how the data is generated further down, but for now let’s say that we ran the survey and collected 1,000 responses. First, let’s start by loading in all the packages we’ll need for this analysis.

Code

library(tidyverse)
library(cmdstanr)
library(knitr)
library(kableExtra)
library(ggridges)

options(mc.cores=4)

set.seed(1234)

sigma <- c(1,0.5,0.5,1) %>% matrix(.,nrow=2,byrow=T)
dat <- mvtnorm::rmvnorm(500, sigma=sigma)

c_points <- list(
  c(-2.69, -0.68, 0.07, 1.13),
  c(-1.29, 0.62, 1.38, 1.65)
)

discrete_data <- dat %>% apply(., 1, 
                               \(x) c(
                                 sum(x[1] > c_points[[1]]) + 1,
                                 sum(x[2] > c_points[[2]]) + 1)
                               ) %>% t()

Next, we’ll take our discrete_data data frame which holds our survey responses and visualize it as a table of all unique responses. For example, the third row and second column will be the number of customers that responded 3 to question 1 and 2 to question 2.

table(discrete_data[,1], discrete_data[,2]) %>%
  kbl(row.names=1) %>%
  kable_styling() %>%
  column_spec(column = 1, bold=T)

	1	2	3	4	5
1	1	2	0	0	0
2	25	102	5	0	0
3	9	108	16	3	3
4	4	96	46	9	10
5	0	29	17	5	10

From the table above, we can already see that there is a high degree of positive correlation between the questions. If we wanted to quantifying this correlation, we might naively use the cor function, but this produces biased results as our provided data is not continuous, which is assumed by the default Pearson correlation measure. There are other measures of correlation such as Spearman or Kendall which are non-parametric, but neither take into account the data generating process that aligns with our philosophy. For that, we will need to employ the polychoric correlation which we will further define below.

Data Generating Process

We assumed that our data was generated by customers that were forced to discretize their agreement with a given question. If we wanted to properly model this, we might want to assume that our data is first generated from some latent multivariate continuous distribution, and then discretized using a set of p - 1 “cut-points” for each dimension of the latent space, where p is the number of possible choices in the questionnaire. The code to generate data from this process is shown below:

sigma <- c(1,0.5,0.5,1) %>% matrix(.,nrow=2,byrow=T)
dat <- mvtnorm::rmvnorm(500, sigma=sigma)

c_points <- list(
  c(-2.69, -0.68, 0.07, 1.13),
  c(-1.29, 0.62, 1.38, 1.65)
)

discrete_data <- dat %>% apply(., 1, 
                               \(x) c(
                                 sum(x[1] > c_points[[1]]) + 1,
                                 sum(x[2] > c_points[[2]]) + 1)
                               ) %>% t()

We use a bivariate gaussian as our latent distribution with a mean of 0 and variance, or equivalently in this case, a correlation matrix of

\[ \begin{bmatrix} 1 & 0.5\\ 0.5 & 1 \end{bmatrix} \] If we plot the latent variables with the cut-points we obtain:

plot(dat,pch=16, xlab = 'Question 1', ylab='Question 2', main='Latent Space Representation')
for (i in 1:length(c_points[[1]])){
  abline(v=c_points[[1]][i], lwd=2, col='darkred')
  abline(h=c_points[[2]][i], lwd=2, col='darkblue')
}

So a customer that answered 3 for the first question and 2 for the second question would have had their latent representation in the below region:

Code

plot(dat, pch=16, xlab = 'Question 1', ylab='Question 2', main='Latent Space Representation')
polygon(x=c(c_points[[1]][2], c_points[[1]][3], c_points[[1]][3], c_points[[1]][2]),
        y=c(c_points[[2]][1], c_points[[2]][1], c_points[[2]][2], c_points[[2]][2]),
        col=adjustcolor('purple', 0.5))
for (i in 1:length(c_points[[1]])){
  abline(v=c_points[[1]][i], lwd=2, col='darkred')
  abline(h=c_points[[2]][i], lwd=2, col='darkblue')
}

Modeling

Priors and Likelihood

Our model is almost fully specified with the data generating process outlined above, but we still need to incorporate our priors. For the correlation matrix, we will reparametrize using the Cholesky decomposition and use a LKJ prior. The cut-points are a little trickier, but notice that the marginals of our latent distribution are standard normals. We can use this to our advantage by reparametrizing the cut-points as a vector of probabilities where each entry is the probability allocated to the interval on the standard normal distribution between two adjacent cut-points \(c_i\) and \(c_j\). Note that for \(p-1\) cutpoints, there will be \(p\) entries in our probability vector. Thus, we can write:

\[ \begin{align} \begin{bmatrix} c_1\\ c_2 \\ \vdots\\ c_{p-1} \end{bmatrix} & \rightarrow \begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_{p} \end{bmatrix} \\ \\ &= \begin{bmatrix} \Phi(c_1) \\ \Phi(c_2) - \Phi(c_1) ] \\ \vdots \\ 1 - \Phi(c_{p-1}) \end{bmatrix} \end{align} \] Since our probability interval vector must sum to one, we can use a dirichlet distribution as the prior. The stan code to specify this prior and perform the reparametrization is below:

real induced_dirichlet_lpdf(vector c, vector alpha, real gamma){
    int K = num_elements(c) + 1;
    vector[K - 1] cuml = Phi(c - gamma);
    vector[K] p;
    matrix[K,K] J = rep_matrix(0,K,K);

    p[1] = cuml[1];
    for (k in 2:(K-1)){
        p[k] = cuml[k] - cuml[k-1];
    }
    p[K] = 1 - cuml[K-1];

    for (k in 1:K) J[k,1] = 1;

    for (k in 2:K){
        real rho = exp(std_normal_lpdf(c[k-1] - gamma));
        J[k,k] = -rho;
        J[k - 1, k] = rho;
    }
    return dirichlet_lpdf(p | alpha) + log_determinant(J);
}

The stan code is a bit more involved, and includes the Jacobian calculations since we are specifying a prior on the transformed parameters. For more detail about the reparametrization and Jacobian calculations I suggest reading Michael Betancourt’s ordinal regression tutorial. Be aware that his model uses a latent logistic distribution (different than our standard normal).

Finally, we need to consider the likelihood of the data, conditioned on our parameters. To model this, we will extend the multivariate probit model to the case where an arbitrary number of ordinal categories are observed. Without going into too much detail, the multivariate probit is used to model a bernoulli distributed random vector, where the data is assumed to have been generated from a latent multivariate normal distribution. For example, if you consider our data generating process above but instead only have one cut-point per dimension, then the data generated would be a bernoulli random vector. The stan code used to define this extenstion of the multivariate probit likelihood is here, along with the full stan code for the model.

The full derivation for the likelihood, and therefore stan code, is beyond the scope of this blog post, but I refer you to these two other resources to learn more if you are interested:

Ben Goodrich’s Truncated Normal Sampler in STAN
GHK Algorithm, which is what Ben’s implementation is based on

The parameter, model, and generated quantities block is shown below:

parameters {
  cholesky_factor_corr[D] L_Omega;
  array[N,D] real<lower=0, upper=1> u;
  array[D] ordered[n_cut] c_points;
}
model {
    L_Omega ~ lkj_corr_cholesky(4);
    for (d in 1:D) target += induced_dirichlet_lpdf(c_points[d] | rep_vector(1, n_cut + 1), 0);

    for (n in 1:N) target += trunc_norm(y[n], c_points, L_Omega, u[n], D, y_min, y_max);
}
generated quantities {
   corr_matrix[D] Omega;
   Omega = multiply_lower_tri_self_transpose(L_Omega);
}

The parameters of interest here are L_Omega, which will give us the correlation matrix for our latent gaussian and the c_points array which determines the cut-points that generated our data. Ignore the u parameter, as it is a nuisance parameter and is only used to help define the likelihood of our data.

Sampling and Posterior Exploration

Now that are model is fully defined, we can used the cmdstanr package to sample our posterior. The full stan code can found here. Note, during model fitting we are expecting some rejected samples due to the highly constrained values of the correlation matrix. When fitting this model with a higher dimension latent

# fp <- file.path('PATH TO YOUR STAN MODEL CODE')
mod <- cmdstan_model(fp)
data <- list(
  D = ncol(discrete_data),
  N = nrow(discrete_data),
  y = discrete_data,
  y_min = min(discrete_data),
  y_max = max(discrete_data)
)

poly_cor <- mod$sample(data = data, seed = 1234, chains = 4, parallel_chains = 2,
                       iter_warmup = 2000,iter_sampling = 2000)

Let’s take a look out some diagnotics to make sure we had adequate posterior sampling:

poly_cor$summary(c(paste0('c_points[1,', 1:4,']'), paste0('c_points[2,', 1:4,']'), 'Omega[2,1]')) %>%
  select(-c(median, q5, q95, mad)) %>%
  map(., \(x) if(is.character(x)) x else round(x,2)) %>% as.data.frame() %>%
  kbl() %>% kable_styling()

variable	mean	sd	rhat	ess_bulk	ess_tail
c_points[1,1]	-2.44	0.19	1	9841.39	5775.21
c_points[1,2]	-0.70	0.06	1	9717.82	7102.99
c_points[1,3]	-0.02	0.06	1	9485.35	7512.77
c_points[1,4]	1.04	0.07	1	11822.11	6832.92
c_points[2,1]	-1.36	0.08	1	12008.87	6858.70
c_points[2,2]	0.66	0.06	1	9789.17	6537.07
c_points[2,3]	1.44	0.08	1	10790.98	6061.38
c_points[2,4]	1.71	0.10	1	12190.28	6611.51
Omega[2,1]	0.51	0.04	1	10775.66	6575.90

Our Rhat, ess_bulk, and ess_tail look good! Let’s take a look at our posteriors for the c_points parameter. The ggplot code that adds line segments to the ridge plot is adapted from this stackoverflow post.

# Function to extract Cut Points from stan model
get_cut_point <- function(dim, n){
  c_name <- paste0('c_points[', dim, ',', n, ']')
  poly_cor$draws(c_name) %>% 
    .[,,1,drop=T] %>%
    as.vector() %>% as.data.frame() %>%
    cbind(., c_name, dim)
}

cut_point_draws <- lapply(4:1, \(x) get_cut_point(1,x)) %>% 
  append(., lapply(4:1, \(x) get_cut_point(2,x))) %>%
  do.call('rbind', .) %>% set_names(c('value', 'name', 'dim')) %>% 
  mutate(dim=as.factor(dim))

# Plot the Cut Point Ridge plot
comp <- ggplot(cut_point_draws, aes(x=value, y=name)) + geom_density_ridges()
ingredients <- ggplot_build(comp) %>% purrr::pluck("data", 1)

density_lines <- ingredients %>% group_by(group) %>% 
  mutate(a = c_points[[floor(max(group)/5) + 1]][[((max(group) - 1) %% 4) + 1]]) %>% 
  slice(which.min(abs(x-a)))

ggplot(cut_point_draws, aes(x=value, y=name)) + 
  geom_density_ridges(aes(fill=dim), rel_min_height = 0.01, lwd=1.2, alpha=0.3) +
  scale_x_continuous(breaks = seq(-3.5,2.5,0.5),limits = c(-3.5,2), expand=c(0,0)) +
  geom_segment(data = density_lines, col='darkred',lwd=1.2,
               aes(x = x, y = ymin, xend = x, 
                   yend = ymin+density*scale*iscale, linetype='Actual Cut-Point Value')) +
  scale_linetype_manual("",values=c('Actual Cut-Point Value'=1)) +
  scale_fill_manual("Cut-point Dimension", 
                    values=c('darkblue', 'darkred', 'yellow'), 
                    labels=c('Dim. 1', 'Dim. 2')) +
  labs(title='Cut Point Parameter Estimates') +
  theme_ridges() + 
  theme(axis.title.y = element_blank(), axis.title.x = element_blank(), 
        axis.text = element_text(size=20), plot.title = element_text(size=30),
        legend.text = element_text(size=20), legend.title = element_text(size=30))

It looks like the actual value of each cut-point is captured within the range of each estimated parameter. Now we can look at our originally requested quantity; the correlation.

poly_cor$draws('Omega[2,1]') %>% .[,,1,drop=T] %>% as.vector() %>% 
  {as.data.frame(list(x=.))} %>%
  ggplot(aes(x=x)) + geom_histogram(aes(y=..density..), fill='darkblue', col='black', alpha=0.3) + 
  geom_density(lwd=1.3) +     
  geom_vline(aes(xintercept=0.5, color='Actual Value'), size=1.5, linetype='twodash') +
  scale_colour_manual("", values="darkred") +
  labs(title='Correlation Parameter Estimate') +
  theme_minimal() +
  theme(axis.title.y = element_blank(), axis.title.x = element_blank(), legend.text = element_text(size=13),
        axis.text = element_text(size=13), plot.title = element_text(size=17))

Conclusion

It looks like our model fits the data well and is able to adequately identify the parameters. While the polychoric correlation is a little more involved than a simple pearson correlation, it aligns more with our original data generating process philosophy. Since this model was fit using a bayesian framework we have samples from our posterior which we can use to perform decision analysis, generating pseudocustomers, or probabilistic PCA from the posterior of the correlation matrix.

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.6   kableExtra_1.4.0 knitr_1.49       cmdstanr_0.8.1  
 [5] lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4     
 [9] purrr_1.0.2      readr_2.1.5      tidyr_1.3.1      tibble_3.2.1    
[13] ggplot2_3.5.1    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] tensorA_0.36.2.1     utf8_1.2.4           generics_0.1.3      
 [4] xml2_1.3.6           stringi_1.8.4        hms_1.1.3           
 [7] digest_0.6.37        magrittr_2.0.3       evaluate_1.0.1      
[10] grid_4.4.2           timechange_0.3.0     mvtnorm_1.3-2       
[13] fastmap_1.2.0        jsonlite_1.8.9       processx_3.8.4      
[16] backports_1.5.0      ps_1.8.1             fansi_1.0.6         
[19] viridisLite_0.4.2    scales_1.3.0         abind_1.4-8         
[22] cli_3.6.3            rlang_1.1.4          munsell_0.5.1       
[25] withr_3.0.2          yaml_2.3.10          tools_4.4.2         
[28] tzdb_0.4.0           checkmate_2.3.2      colorspace_2.1-1    
[31] vctrs_0.6.5          posterior_1.6.0      R6_2.5.1            
[34] matrixStats_1.4.1    lifecycle_1.0.4      htmlwidgets_1.6.4   
[37] pkgconfig_2.0.3      pillar_1.9.0         gtable_0.3.6        
[40] glue_1.8.0           systemfonts_1.1.0    xfun_0.49           
[43] tidyselect_1.2.1     rstudioapi_0.17.1    farver_2.1.2        
[46] htmltools_0.5.8.1    labeling_0.4.3       svglite_2.1.3       
[49] rmarkdown_2.29       compiler_4.4.2       distributional_0.5.0