In Tutorial I, we have discussed a full-rank kernel for MOGP model. However, estimating the cross-covariance kernel requires \(L(L-1)/2\) parameters \(\theta_{l_1,l_2}, 1 \leq l_1 \leq l_2 \leq L\) which imposes challenges in both statistical and computational aspects when modelling MOGP with many outputs (e.g. \(L \geq 4\)). In this section, we will discuss an attractive dimension reduction approach, the intrinsic coregionalization model (ICM), to keep the number of correlation parameters low.

1 Packages and user-defined functions

Loading all packages into the current working directory:

  • DiceKriging: kriging methods for single-output GP (SOGP).
  • rgenoud: genetic optimization solver used for Maximum Likelihood Estimation of GP hyperparameters.
  • kergp: kriging methods for user-defined kernel, needed for both Full-rank and ICM Multi-output GP.
  • nloptr: solve nonlinear optimization using R interface to NLopt.
  • data.table: extension of data.frame to enhance data manipulation.

Loading all user-defined functions:

  • \(\color{blue}{\texttt{createMortData.R}}\): to import datasets and select populations prior to model fitting. We can specify the age groups, calendar years, and whether both Males and Females are modeled together. All the datasets are downloaded from the Human Mortality Database. Currently, we have 16 European countries (seperated by gender) in Death Count and Exposures folders. These countries are:
    • Austria
    • Belarus
    • Czech
    • Denmark
    • Estonia
    • France
    • Germany
    • Hungary
    • Latvia
    • Lithuania
    • Netherlands
    • Poland
    • Spain
    • Sweden
    • Switzerland
    • UK
  • \(\color{blue}{\texttt{covICM.R}}\): to define the ICM Kernel with pre-specified number of populations and choice of rank. Users can directly modify this script according to their own analyses (E.g: provide different covariance kernel over Age-Year inputs, adjust the lower and upper bounds of the hyperparameters, re-name the hyperparameters, etc.). In ICM, each output \(f_l\) is assumed be a linear combination of independent latent GPs. Let \(u_1(\mathbf{x}),\ldots,u_Q(\mathbf{x})\) be independent latent functions each from a GP prior with covariance kernel \(C^{(u)}(\mathbf{x,x'})\). The modeled outputs \(f_l\) are linear combinations of these \(Q\) latent factors: \[\begin{equation} f_l(\mathbf{x}) = a_{l,1}u_1(\mathbf{x}) + \ldots + a_{l,Q}u_Q(\mathbf{x}) = \sum_{q=1}^Q a_{l,q}u_q(\mathbf{x}), \end{equation}\] where \(a_{l,q}\)’s are the factor loadings. Let \(\mathbf{a}_q=(a_{1,q},\ldots,a_{L,q})^{T}\) be the vector representing the collection of linear coefficients associated with the q\(th\) latent function across the \(L\) outputs, so that \(\mathbf{f(x)} = \sum_{q=1}^Q \mathbf{a}_qu_q(\mathbf{x})\). It follows that the covariance for \(\mathbf{f(x)}\) is: \[\begin{align} \mathbf{C}(x,x') = \text{Cov} \left(\mathbf{f(x)},\mathbf{f(x')} \right) & = \text{Cov}\bigg(\sum_{q=1}^Q \mathbf{a}_qu_q(\mathbf{x}),\sum_{q=1}^Q \mathbf{a}_qu_q(\mathbf{x'})\bigg) \\ & = \bigg(\sum_{q=1}^Q \mathbf{a}_q\mathbf{a}_q^T\bigg) \otimes \text{Cov} \left(u_q(\mathbf{x}),u_q(\mathbf{x'}) \right) \\ % & = AA^T \otimes Cov(u_q(\mathbf{x}),u_q(\mathbf{x'})) \\ & = AA^T \otimes C^{(u)}(\mathbf{x,x'}) \end{align}\] where matrix \(A = (\mathbf{a}_1,\ldots,\mathbf{a}_Q)\) and \(\otimes\) is the Kronecker matrix product. Re-parametrizing by \(B:=AA^T\), the above equation can be expressed as the Kronecker product between the cross-population covariance \(B \in \mathbb{R}^{L \times L}\) and the covariance over the Age-Year inputs \(\mathbf{C}^{(u)} \in \mathbb{R}^{N \times N}\): \[\begin{equation} \mathbf{C}=\text{Cov}(\mathbf{f(x),f(x')}) = B \otimes \mathbf{C}^{(u)}. \end{equation}\]

  Under ICM, the number of hyperparameters in the cross-population covariance is reduced from \(L(L-1)/2\) to \(Q \times L\). Therefore, taking \(Q<L/2\) allows to reduce the hyper-parameter space and alleviate the computational budget compared to the full- rank setup. Notably the assumption in ICM that all \(L\) populations share the common spatial covariance suits our interest in the inference of joint lengthscales in Age and Year dimensions. The computational complexity required in ICM is greatly simplified due to the properties of the Kronecker product. Finally, ICM allows the process variance \(\eta^2_l\) to vary by populations, i.e., the l\(th\) element in the diagonal in \(B\) is the process variance for population \(l\). This further bolsters the flexibility in MOGP to excel in out-of-sample forecasts.

  • \(\color{blue}{\texttt{gp_predictionICM.R}}\): to perform prediction. Outputs contain the predicted mean and the posterior variance at new inputs \(\mathbf{x_*}\) for the latent function \(\mathbf{f_*}\) and observed outputs \(\mathbf{y_*}\). The cross-covariance matrix between two sets of locations are also provided.

Lastly, we write an R function to create integer coding for populations in MOGP model. This is equivalent to create dummy variables for a categorical variable.

For the remaining of this ICM tutorials, we will illustrate several examples how to fit ICM-MOGP models with different populations. We will show that borrowing information from other populations boosts the predictive power for out-of-sample forecasts. Finally, we explain how MOGP models achieve the coherence in long-term forecats and compare the results with SOGP models.

2 Two-population ICM

2.1 Example 1: Denmark and Sweden, Males

2.1.1 Fitting Denmark-Sweden ICM

We first prepare an aggregated data that combines Danish and Swedish mortality for Males, on Ages 70–84 and Years 1990–2012. Denmark is coded as 1 and Sweden is coded as 2.

The output vector (\(y\)) and input vector (\(X\)) must be provided. We have to change names of the columns in \(X\) to match input names defined in \(\color{blue}{\texttt{covICM.R}}\) script.

Prior to fitting the model, we also need to specify the number of populations and choice of rank. In this example, we have 2 populations (Male Danish and Male Swedish) and the rank value is 2. The function \(\color{blue}{\texttt{gp()}}\) within kergp package can be called to fit the model.

There are several important arguments in \(\color{blue}{\texttt{gp()}}\) function:

  • \(\color{blue}{\texttt{formula}}\): the left-hand side of is the reponse name and right-hand side is the trend covariates.
  • \(\color{blue}{\texttt{data}}\): contains the reponse \(y\) and the inputs \(X\).
  • \(\color{blue}{\texttt{cov}}\): user-defined covariance kernel. In this case, we define the ICM kernel through the \(\color{blue}{\texttt{covICM()}}\) function with \(\color{blue}{\texttt{rank = 2}}\) and \(\color{blue}{\texttt{num_pop = 4}}\).
  • \(\color{blue}{\texttt{estim}}\): logical argument, if \(\color{blue}{\texttt{TRUE}}\): the model parameters are estimated by Maximum Likelihood.
  • \(\color{blue}{\texttt{noise}}\): logical argument, if \(\color{blue}{\texttt{TRUE}}\): estimate noise in the error term.
  • \(\color{blue}{\texttt{multistart}}\): different starting points for optimization process. Users are encouraged to load parallel backend (doParallel).
  • \(\color{blue}{\texttt{opts}}\): controls the optimization properties. Look up \(\color{blue}{\texttt{nloptr::nloptr.get.default.options()}}\) for more information.

The summary output of the fitted model contains parameters in both the mean and the covariance kernel:


Call:
gp(formula = y ~ X$x1 + as.factor(X$x3), data = data.frame(y, 
    X), inputs = names(X), cov = covICM(rank, num_pop), estim = TRUE, 
    compGrad = FALSE, noise = TRUE, varNoiseIni = 0.001, varNoiseLower = 1e-05, 
    varNoiseUpper = 0.01, parCovIni = c(10, 10, rep(0.15, rank * 
        num_pop)), optimMethod = "NLOPT_LN_COBYLA", opts = list(xtol_rel = 1e-05, 
        check_derivatives_tol = 1e-05, population = 120, maxeval = 150))


Number of observations: 690 

Trend coef.:
                     Value
(Intercept)      -11.22364
X$x1               0.10730
as.factor(X$x3)2   0.07649

Covariance whith class "covMan"
'User' covariance kernel
o Description: myGauss 
o Dimension 'd' (nb of inputs): 3
o Parameters: "theta.ag", "theta.yr", "a11", "a21", "a12", "a22"
o Number of parameters: 6
o Accept matrix inputs.
o Param. values: 
            Value Lower Upper
theta.ag 22.40037 1e+00    30
theta.yr 23.46850 1e+00    30
a11       0.33739 1e-06     1
a21       0.17090 1e-06     1
a12       0.05628 1e-06     1
a22       0.24297 1e-06     1

Noise variance: 0.001

Within the covariance ICM, we can extract the factor loadings and compute the cross-covariance matrix \(B\):

        Denmark  Sweden
Denmark 0.11700 0.07133
Sweden  0.07133 0.08824

We can infer cross-correlation matrix \(R\) from \(B\):

        Denmark Sweden
Denmark  1.0000 0.7021
Sweden   0.7021 1.0000

Thus, the correlation between Denmark and Sweden is: \(r_{DEN,~SWE} \approx 0.70\).

Out-of-sample prediction via Denmark-Sweden ICM

We first create a test set to predict log-mortality for Age group 70–84 in year 2013 (one-year out), 2015 (3-year out) and 2016 (4-year out):

We then extract \(\sigma^2_l\) from single-population models. In this case, we want \(\sigma^2_{DEN}\) and \(\sigma^2_{SWE}\) resulted from two single-population models, fitted on Male populations in Denmark and Sweden, Ages 70–84 and Years 1990–2012. For covenience purpose, the nugget_mortData.Rda is the R dataset that contains the estimated \(\sigma^2_l\) in SOGP models fitted on Ages 70–84 and different periods for 16 Male populations in this study.

[1] 0.0015163 0.0008022

Finally, we can apply the user-defined function \(\color{blue}{\texttt{gp.predict()}}\) in \(\color{blue}{\texttt{gp_predictionICM.Rda}}\) script for prediction:

The table below displays a few examples of the prediction on the test set for Denmark (\(popN=1\)) and Sweden (\(popN=2\)). The prediction results are the predicted mean, standard deviation and the posterior variance (we don’t show the cross-covariance here). Users have the option to estimate either the variance of the underlying function \(f_*()\) or the observed values \(y_*\). Here, we are showing the variance (and standard deviation) of \(y_*\):

age year popN mean vr std
1 70 2013 1 -3.822 0.0017 0.0413
2 71 2013 1 -3.713 0.0017 0.0408
46 70 2013 2 -4.039 0.0009 0.0303
47 71 2013 2 -3.928 0.0009 0.0299

To assess model performance, we can compute the symmetric mean absolute percentage error (SMAPE):

country year smape
Denmark 2013 1.4350
Sweden 2013 0.8248
Denmark 2015 1.2842
Sweden 2015 1.0939
Denmark 2016 1.1924
Sweden 2016 0.8971

2.1.2 Fitting single-population for Denmark and Sweden

Next, we employ \(\color{blue}{\texttt{km()}}\) function in the package DiceKriging to fit single-population models for Male observations, Ages 70–84 and Years 1990–2012 in Denmark and Sweden.


optimisation start
------------------
* estimation method   : MLE 
* optimisation method : gen 
* analytical gradient : used
* trend model : ~x.age
* covariance model : 
  - type :  gauss 
  - nugget : unknown homogenous nugget effect 
  - parameters lower bounds :  1e-10 1e-10 
  - parameters upper bounds :  45 45 
  - upper bound for alpha   :  1 
  - best initial criterion value(s) :  609.3 

Call:
km(formula = ~x.age, design = data.frame(x = xMort), response = yMort, 
    covtype = "gauss", nugget.estim = TRUE, optim.method = "gen", 
    upper = c(45, 45), control = list(max.generations = 100, 
        pop.size = 100, wait.generations = 10, solution.tolerance = 1e-05, 
        print.level = 0))

Trend  coeff.:
               Estimate
 (Intercept)   -10.5624
       x.age     0.0984

Covar. type  : gauss 
Covar. coeff.:
                Estimate
 theta(x.age)    30.9560
theta(x.year)    19.5353

Variance estimate: 0.122

Nugget effect estimate: 0.001516


optimisation start
------------------
* estimation method   : MLE 
* optimisation method : gen 
* analytical gradient : used
* trend model : ~x.age
* covariance model : 
  - type :  gauss 
  - nugget : unknown homogenous nugget effect 
  - parameters lower bounds :  1e-10 1e-10 
  - parameters upper bounds :  45 45 
  - upper bound for alpha   :  1 
  - best initial criterion value(s) :  713.4 

Call:
km(formula = ~x.age, design = data.frame(x = xMort), response = yMort, 
    covtype = "gauss", nugget.estim = TRUE, optim.method = "gen", 
    upper = c(45, 45), control = list(max.generations = 100, 
        pop.size = 100, wait.generations = 10, solution.tolerance = 1e-05, 
        print.level = 0))

Trend  coeff.:
               Estimate
 (Intercept)   -11.2168
       x.age     0.1086

Covar. type  : gauss 
Covar. coeff.:
                Estimate
 theta(x.age)    19.4663
theta(x.year)    10.7145

Variance estimate: 0.04152

Nugget effect estimate: 0.0008022

We apply the two single GP models fitted above to predict out-of-sample prediction for Ages 70–84 in Year 2013, 2015, and 2016. Then, we compute SMAPE values and compare the results with MOGP models.

[1] "Denmark"
   year smape
1: 2013 1.580
2: 2015 1.344
3: 2016 1.258
[1] "Sweden"
   year smape
1: 2013 1.046
2: 2015 1.977
3: 2016 2.533

  MOGP forecast is more accurate (smaller SMAPE) than SOGP for both populations.

Visualization of the prediction performance

To compare the prediction performance between single-population models and 2-population ICM, one way is to visualize the predicted values and the 95% credible bands of the prediction. Here, we illustrate how to compare the results in Sweden.

Prediction and 95% credible bands via Denmark-Sweden ICM:

Prediction and 95% credible bands via single-population GP for Sweden:

The interactive plots to visualize the predicted means and their 95% credible bands for Age 70 & 84 are below (R code for non-interactive plots are also provided). Note that the forecast period includes both in-sample (1990-2012) and out-of-sample (2013-2020).

 

  The model prediction starts to diverge for \(x_{yr} \geq 2013\). We see that the predicted curves produced by a joint model are closer to the observed values in the test period from 2013–2016. We also observe that the posterior variance in MOGP is smaller than one from SOGP model (the 95% credible bands are smaller in MOGP).

2.2 Example 2: Males & Females in Denmark

2.2.1 Fitting Danish Male-Female ICM

Similar to Example 1, we create an aggregated mortality data to combine Danish Males and Females, on Ages 70–84 and Years 1990–2012. Females are coded as 0 and Males are coded as 1.

Fitting ICM model for 2 populations and rank 2:


Call:
gp(formula = y ~ X$x1 + as.factor(X$x3), data = data.frame(y, 
    X), inputs = names(X), cov = covICM(rank, num_pop), estim = TRUE, 
    compGrad = FALSE, noise = TRUE, varNoiseIni = 0.001, varNoiseLower = 1e-05, 
    varNoiseUpper = 0.01, parCovIni = c(10, 10, rep(0.15, rank * 
        num_pop)), optimMethod = "NLOPT_LN_COBYLA", opts = list(xtol_rel = 1e-05, 
        check_derivatives_tol = 1e-05, population = 120, maxeval = 150))


Number of observations: 690 

Trend coef.:
                    Value
(Intercept)      -11.0492
X$x1               0.1008
as.factor(X$x3)1   0.4304

Covariance whith class "covMan"
'User' covariance kernel
o Description: myGauss 
o Dimension 'd' (nb of inputs): 3
o Parameters: "theta.ag", "theta.yr", "a11", "a21", "a12", "a22"
o Number of parameters: 6
o Accept matrix inputs.
o Param. values: 
            Value Lower Upper
theta.ag 15.01363 1e+00    30
theta.yr 11.69492 1e+00    30
a11       0.33616 1e-06     1
a21       0.09877 1e-06     1
a12       0.01610 1e-06     1
a22       0.15687 1e-06     1

Noise variance: 0.001

We can extract the factor loadings and compute the cross-covariance \(B\):

        Females   Males
Females 0.11326 0.03573
Males   0.03573 0.03436

The cross-correlation \(R\) is:

        Females  Males
Females  1.0000 0.5727
Males    0.5727 1.0000

The correlation between Danish males and females is: \(r_{F,~M} \approx 0.57\).

Out-of-sample prediction via Danish Males-Females ICM

We create a test set to predict log-mortality for Ages 70–84 in Year 2013, 2015, and 2016:

For prediction, we need to provide the nugget from single-population models, fitted on Female and Male population in Denmark. We then apply the \(\color{blue}{\texttt{gp.predict()}}\) function for prediction.

Below, the table displays examples of the prediction on the test set for Danish Females (\(popN=1\)) and Danish Males (\(popN=2\)):

age year popN mean vr std
1 70 2013 0 -4.288 0.0019 0.0437
2 71 2013 0 -4.163 0.0018 0.0425
46 70 2013 1 -3.808 0.0018 0.0425
47 71 2013 1 -3.701 0.0017 0.0418

Compute SMAPE values:

popN year smape
0 2013 0.8488
1 2013 1.5697
0 2015 1.7786
1 2015 1.2976
0 2016 1.2391
1 2016 1.2046

2.2.2 Fitting single-population for Males and Females in Denmark

We are using \(\color{blue}{\texttt{km()}}\) function in package DiceKriging to fit single-population models for Males and Females in Denmark, Ages from 70–84 and Years 1990–2012:


optimisation start
------------------
* estimation method   : MLE 
* optimisation method : gen 
* analytical gradient : used
* trend model : ~x.age
* covariance model : 
  - type :  gauss 
  - nugget : unknown homogenous nugget effect 
  - parameters lower bounds :  1e-10 1e-10 
  - parameters upper bounds :  45 45 
  - upper bound for alpha   :  1 
  - best initial criterion value(s) :  601.6 

Call:
km(formula = ~x.age, design = data.frame(x = xMort), response = yMort, 
    covtype = "gauss", nugget.estim = TRUE, optim.method = "gen", 
    upper = c(45, 45), control = list(max.generations = 100, 
        pop.size = 100, wait.generations = 10, solution.tolerance = 1e-05, 
        print.level = 0))

Trend  coeff.:
               Estimate
 (Intercept)   -11.6755
       x.age     0.1095

Covar. type  : gauss 
Covar. coeff.:
                Estimate
 theta(x.age)    11.5563
theta(x.year)     9.5257

Variance estimate: 0.04287

Nugget effect estimate: 0.001487


optimisation start
------------------
* estimation method   : MLE 
* optimisation method : gen 
* analytical gradient : used
* trend model : ~x.age
* covariance model : 
  - type :  gauss 
  - nugget : unknown homogenous nugget effect 
  - parameters lower bounds :  1e-10 1e-10 
  - parameters upper bounds :  45 45 
  - upper bound for alpha   :  1 
  - best initial criterion value(s) :  608.6 

Call:
km(formula = ~x.age, design = data.frame(x = xMort), response = yMort, 
    covtype = "gauss", nugget.estim = TRUE, optim.method = "gen", 
    upper = c(45, 45), control = list(max.generations = 100, 
        pop.size = 100, wait.generations = 10, solution.tolerance = 1e-05, 
        print.level = 0))

Trend  coeff.:
               Estimate
 (Intercept)   -10.5622
       x.age     0.0984

Covar. type  : gauss 
Covar. coeff.:
                Estimate
 theta(x.age)    30.9379
theta(x.year)    19.5254

Variance estimate: 0.1218

Nugget effect estimate: 0.001516

Compute SMAPE values for out-of-sample prediction:

[1] "sex =  0"
   year  smape
1: 2013 0.9422
2: 2015 1.8972
3: 2016 1.4004
[1] "sex =  1"
   year smape
1: 2013 1.580
2: 2015 1.344
3: 2016 1.258

  Again, MOGP model produces more accurate prediction than SOGP for both Males and Females in Denmark.

Coherence mortality forecasts

Fitting GP models for individual populations tends to generate divergence in long-range forecast that are inconsistent with historical patterns. To illustrate, we apply SOGP and MOGP models for out-of-sample forecasts, Age 70. Then, we compute the log-mortality difference between Danish Males and Females.

Difference in log-mortality between Males and Females for Age 70, projected from 1990–2050, based on two SOGP models:

Difference in log-mortality between Males and Females for Age 70, projected from 1990–2050, based on MOGP model:

We further create interactive heatmaps to display the differences from two cases: (i) differences computed from two SOGP models and (ii) differences computed from a MOGP model. Again, all models are fitted for Ages 70–84 in Years 1990–2012. The differences are projected onwards to 2050.

 

 

The first heatmap displays the differences via SOGP models. It implies that as early as 2030, Males will have lower mortality than females.   In contrast, the second heatmap is the MOGP forecasted differences: it is coherent as Females are projected to maintain higher mortality than Males.

3 Many-population ICM

In this section, we expand the ICM framework to fit many more populations. Specifically, we build a 7-population ICM with Rank 3, on Ages 70–84 and Years 1990–2016 for Male observations. Three scenarios with different assumptions on the mortality improvement factors are fitted.

Mortality Improvement Factors: measures longevity changes year-over-year. In terms of the observations, the raw annual percentage mortality improvement is: \(1-\dfrac{\exp{\big(y(x_{ag};x_{yr})\big)}}{\exp{\big(y(x_{ag};x_{yr}-1)\big)}}\). The smoothed improvement factor is obtained by replacing \(y\)’s by the GP model posterior \(m_*\)’s: \[1-\dfrac{\exp{\big(m_*(x_{ag};x_{yr})\big)}}{\exp{\big(m_*(x_{ag};x_{yr}-1)\big)}}\]

Prior to fitting, we prepare the aggregated data from 7 countries: Austria, Denmark, Germany, Netherlands, Sweden, Switzerland, and UK.

Case I: Zero long-term mortality improvement, captured by the linear mean function: \[m(x^n)=\beta_0+\beta^{ag}_1x_{ag}^n+\sum_{l=2}^L\beta_{pop,l}\big(x_{pop,l}^n\big)\]

Case II: Long-term mortality improvement based on a historical pattern, captured by: \[m(x^n)=\beta_0 + \beta^{ag}_1x_{ag}^n+\beta^{yr}_1x_{yr}^n + \sum_{l=2}^L \beta_{pop,l}x_{pop,l}^n\]

Case III: Long-term mortality improvement based on expert judgement. We again use \(m(x^n)=\beta_0 + \beta^{ag}_1x_{ag}^n+\beta^{yr}_1x_{yr}^n + \sum_{l=2}^L \beta_{pop,l}x_{pop,l}^n\) but this time the \(\beta_1^{yr}\) coefficients is chosen by the modeler.

After fitting these models, we create a test set for all cases to predict both log-mortality and improvement rates for Age 70.

Prediction via Case I:

Prediction via Case II:

Prediction via Case III:

Putting the forecasted log-mortality rates and improvement rates in all 7 populations together:

 

 

 

 

The choice of mean function \(m(.)\) has minimal impact on in-sample forecasts that are largely driven by the training data covering 1990–2016. In contrast, the long-term levels of mortality improvement are completely driven by \(m(.)\). When visualizing the forecasts for all cases, we see a strong coherence so that mortality rates across populations all move roughly in unison over time, matching our intuition about the persistent commonality of their future mortality experiences.