Pareto Smoothed Importance Sampling - Leave One Out Cross Validation from Scratch

Bayes

Stan

I didn’t believe the magic until I derived it for myself.

Published

November 27, 2024

Loo Who?

If you’re like me, when you first saw the loo package and all the accompanying research, you probably thought it was something close to magic. Coming from a traditional ML world, leave-one-out (loo) cross-validation (cv) is the gold standard for model checking and validation, but it is usually quite time consuming to do in practice.

To perform clascical loo, you loop over your data set \(n\) times, where \(n\) is the number points in your data. For each iteration you remove or “hold out” \((X_i, y_i)\), where \(X\) and \(y\) are the covariates and response, respectively. The remaining data \((X_{-i}, y_{-i})\) is used to train the model. After training you predict using the held out covariate \(X_i\) to produce \(\tilde{y}_i\). Comparison can then be made between \(\tilde{y}_i\) and \(y_i\) using whichever error function fits your domain: MSE, accuracy, ELPD, etc.

Here is a short demo for linear regression with one covariate. We’ll perform loo and plot the absolute difference between \(\tilde{y}_i\) and \(y_i\):

Code

set.seed(1235)
n <- 100
x <- rnorm(n)
y <- x*2 + -2 + ifelse(1:100 %% 40 == 0, 15, rnorm(n,0,1))
data <- cbind(co=x,resp=y) |> as.data.frame()

loo_pred <- rep(NA, n)
slope_pred <- rep(NA, n)
for (i in 1:n){
  mod <- lm(resp ~ co, data=data[-i,])
  loo_pred[i] <- predict(mod, newdata=data[i,])
  slope_pred[i] <- mod$coefficients[2]
}
diff <- abs(loo_pred - data$resp)
plot(diff, ylab='Abs. Difference between pred and actual y', 
     xlab='LOO Iteration', pch=16, col=ifelse(diff > 4, 'blue','black'))
legend('topleft', col=c('blue'), pch=c(16), legend=c('Influential Points'))

An immediate result of this testing framework is that each point can be compared independent of the other. They are not batched into testing/training sets, each point is a predicted value where the model never got to see it during training. We can already start to see which points might be causing issues for the model or does not fit our data generating process. Another benefit that I’ll briefly show is how the coefficients of the \(n\) trained models differ. Thus showing how influential some points in the data might be.

Code

plot(slope_pred, pch=16, ylab='Slope of Model trained without point i', xlab='LOO Iteration',
     col=ifelse(diff > 4, 'blue','black'))

Some Important Samples

Alright, so we’ve tackled the latter half of the posts name. What about the first half? Well for that, we’ll have to split it in half as well and focus on importance sampling first. At its heart, importance sampling is just a way to evaluate pesky integrals. Consider the following expectation.

\[E_\pi[f(\theta)]=\int\pi(\theta)f(\theta)d\theta\]

For vanilla importance sampling we could draw \(S\) samples of \(\theta\) from \(\pi(\theta)\) and then take the average over all samples of the quantity \(\pi(\theta) \cdot f(\theta)\) to get an approximation of the above expectation. That is,

\[E_\pi[f(\theta)] \approx \frac{1}{S} \sum^S \pi(\theta_s) \cdot f(\theta_s), \text{ where } \theta_s\sim\pi(\theta)\]

Now, let’s assume that we cannot draw samples directly from \(\pi(\theta)\), as is usually the case when attempting to draw samples from a complex posterior distribution. Instead, we can draw samples from \(\gamma(\theta)\), which nicely approximates \(\pi(\theta)\). We can then transform our integral to the following:

\[\begin{align} E_\pi[f(\theta)] &= \int\pi(\theta)f(\theta)d\theta \\ &= \int \gamma(\theta) \frac{\pi(\theta)}{\gamma(\theta)} f(\theta)d\theta \\ &= E_\gamma[\frac{\pi(\theta)}{\gamma(\theta)} f(\theta)] \\ &\approx \frac{1}{S} \sum^S \frac{\pi(\theta_s)}{\gamma(\theta_s)} \cdot f(\theta_s), \text{ where } \theta_s\sim\gamma(\theta) \end{align}\]

We call the quantity \(\frac{\pi(\theta)}{\gamma(\theta)}\) the importance ratio. The slight catch here is that above will only hold if both \(\pi\) and \(\gamma\) are normalized. If they are not, which is usually the case if we are approximating posteriors, then we need to employ the self-normalized estimator. That is, if \(r(\theta)= \pi(\theta)/\gamma(\theta)\) is an non-normalized importance ratio, then

\[\frac{1}{S} \sum^S \frac{r(\theta_s) \cdot f(\theta_s)}{r(\theta_s)}\]

is the self-normalized importance estimator of \(E_\gamma[r(\theta) f(\theta)]\).

Let’s show a quick example where \(\pi \sim N(0,1)\) and \(f(\theta) = \theta^4\). Clearly, \(\pi\) would be a dream to sample from, but we’ll assume that we left our normal sampler at home. We’ll instead use \(\gamma \sim t_{\nu=1,000}\) to sample from as an approximation to \(\pi\).

Code

theta_samples <- rt(1e4, df=1000)
imp_ratios <- dt(theta_samples, df=1000)/dnorm(theta_samples)
f <- theta_samples^4
estimator <- cumsum(imp_ratios * f) / (1:length(theta_samples))

par(mfrow=c(1,2))
plot(estimator, lwd=3, ylab='Importance Estimator', xlab='S', type='n')
abline(h=3, col='darkred', lwd=3)
lines(estimator, lwd=3)
legend('topright', legend=c('Estimated', 'Actual'), lwd=3, col=c('black', 'darkred'))
hist(imp_ratios, col=adjustcolor('darkblue',0.3), xlab='Importance Ratios',
     main='', ylab='', breaks=25)

From the properties of the normal distribution, we know that we are just estimating \(E[\theta^4]\) with respect to the normal distribution. The answer of course is just given by the 4th central moment: \(3\sigma^2\).

So we have our estimator, but there are still questions that we’ve swept under the rug. For one, how many iterations should we have run our sampler for, and does \(\gamma\) really approximate \(\pi\) well enough. For a little motivation let’s try the same experiment again, but instead use \(\gamma \sim t_{\nu=10}\) which has wider tails (a little foreshadowing).

Code

theta_samples <- rt(1e4, df=20)
imp_ratios <- dt(theta_samples, df=20)/dnorm(theta_samples)
f <- theta_samples^4
estimator <- cumsum(imp_ratios * f) / (1:length(theta_samples))

par(mfrow=c(1,2))
plot(estimator, lwd=3, ylab='Importance Estimator', xlab='S', type='n')
abline(h=3, col='darkred', lwd=3)
lines(estimator, lwd=3)
legend('topright', legend=c('Estimated', 'Actual'), lwd=3, col=c('black', 'darkred'))
hist(imp_ratios, col=adjustcolor('darkblue',0.3), xlab='Importance Ratios',
     main='', ylab='', breaks=25)

Wow! Look’s like we are far from our actual value. But there are no warnings, nothing yelled at us for the poorly estimated value. How can we critique the sampler as well as our chosen \(\gamma\) distribution? The key is in the importance ratios. Notice how in the first experiment where \(\gamma \sim t_{\nu=1,000}\), the histogram of importance ratios tail is much lighter than that of the second experiment. This is a property we would like to be able to enforce, as well as detect. An early method for controlling the importance rations \(r(\theta)\) was to truncate at some maximum value. In particular Ionides (2008) defined \(w(\theta_s)\) to be the truncated importance weights as:

\[w(\theta_s) = \min(r(\theta_s),\sqrt{S}\bar{r})\]

where \(\bar{r}\) is the average importance ratio over all samples \(S\). While this method does guarantee finite variance of the truncated importance ratios, it unfortunately incorporates a large bias. Another method given by Vehtari et al. (2024) to control the importance ratios is to fit a Pareto distribution to the tail of importance ratio distribution. This not only aids in giving the importance ratios finite variance while introducing little bias, but also allows us to use the parameters of the fit Pareto distribution as diagnostics. The following definition for the Pareto distribution, as well as its motivation is defined with much greater detail in Vehtari et al. (2024). I will attempt to give a slimmed down version of their background with a little added history.

Pareto Mentality

You might have heard of the 80/20 “rule” before. Usually in the context of sales: 20% of customers make up 80% of sales. It’s a powerful rule of thumb that more often than not comes out roughly correct. This rule, also known as the Pareto principle, is named after the Italian sociologist and economist Vilfredo Pareto, for which the Parteo distribution is also named after.

The Pareto distribution as one might expect after learning the 80/20, is a simple power law. It is usually characterized by one parameter and is useful for fitting tails of distributions. For our use case, we will instead be using the generalized Pareto distribution which has been shown by Pickands (1975) and Balkema and Haan (1974) to well approximate the tail of the importance ratios \(r(\theta) | r(\theta) > u\) for some set tail lower bound \(u\). The PDF for the generalized Pareto is given by:

\[ p(y|u,\sigma,k)= \begin{cases} \frac{1}{\sigma} (1 + k (\frac{y-u}{\sigma}))^{-\frac{1}{k}-1} & k\neq 0 \\ \frac{1}{\sigma}\exp(\frac{y-u}{\sigma}) & k=0 \\ \end{cases} \] where \(u\) is the aforementioned lower bound on the tail, \(\sigma\) is the positive scale parameter, and \(k\) is the very important shape parameter. The inputs distribution is further restricted to \(y\in[u,\inf)\). The inverse of our parameter \(k\) tells us how many moments exist in the distribution. Recall that if a distributions first moment exists then it has a mean¹, and if it has a second moment then its variance is defined. You might see where I am going with this - we can use \(k\) after fitting our generalized Pareto to the importance ratios’ tail as a sort of diagnostic. If \(k\) is in an acceptable range, we can feel comfortable using the Pareto fit for the tails of our importance samples.

The question of fit is not an easy one though. There have been numerous papers that build upon each other, but we will climb to the top of the Giant stack² for this one. Our method today comes from Zhang and Stephens (2009). It has a base of maximum likelihood, with a side of empirical Bayes, and a sprinkle of quadrature estimation for garnish. Overall a hearty meal!

I’ve struggled with putting this in an appendix, and maybe that would be more digestible,³ but where is the fun in that!

It is shown by Vehtari et al. (2024) that for \(k\)

Outline

Why Loo

Provide some overview of why LOO CV is important

Importance Sampling

We are going to be using it a lot later so good to get it out of the way

Pareto Distribution

What is it
How is it defined - pdf
- Why do we care
- What do the parameters mean?
How do we go about estimating the parameters
- Talk about the empirical Bayes estimation procedure
Bring back the context of MC

Tieing it back together

Introduce the Vehtari 2017 paper
Talk about ELPD
Reference the paper to show that we can use p(y) as a approximate weight distribution (with the required math, work through the algebra)
Go ahead and do the actual

References

Balkema, A. A., and L. de Haan. 1974. “Residual Life Time at Great Age.” The Annals of Probability 2 (5): 792–804. http://www.jstor.org/stable/2959306.

Ionides, Edward L. 2008. “Truncated Importance Sampling.” Journal of Computational and Graphical Statistics 17 (2): 295–311. http://www.jstor.org/stable/27594308.

Pickands, James. 1975. “Statistical Inference Using Extreme Order Statistics.” The Annals of Statistics 3 (1): 119–31. http://www.jstor.org/stable/2958083.

Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.” https://arxiv.org/abs/1507.02646.

Zhang, Jin, and Michael A. Stephens. 2009. “A New and Efficient Estimation Method for the Generalized Pareto Distribution.” Technometrics 51 (3): 316–25. http://www.jstor.org/stable/40586625.

Footnotes

sorry Cauchy↩︎
That’s my best pun for “Standing on the Shoulders of Giants”↩︎
I just can’t stop with these food puns↩︎