\(\color{#c64329}{\text{WELCOME}}\)

This repository provides a sequence of tutorials on how to fit Multi-output Gaussian process (MOGP) models for multi-population longevity modeling. The underlying datasets and R scripts are provided for the users via the GitHub link. For further detailed explanation of the methods, please visit our arXiv preprint.

The aim of our models is to analyze historical mortality data (especially based on the Human Mortality Database). Rather than direct investigation of the historical experience, our focus is on statistical modeling for the purposes of (i) insights into past and present longevity trends; (ii) data fusion across multiple datasets; (iii) probabilistic forecasting of future longevity trends, especially in the near- (<3 years) and medium-term (3-10 years).

We work with a statistical framework for age-specific mortality; this means that we make mechanistic assumptions about mortality rate as a function of given covariates and do not rely on any further demographic, socio-economic or structural patterns. This paradigm of stochastic mortality modeling is appropriate for actuarial analysis that focuses on projecting mortality experience for the intermediate-old ages.

  Disclaimer: Our methodology is much less appropriate for modeling infant and young-adult mortality, extreme ages (90+), or for making long-term projections (10+ years). While the model can certainly be straightforwardly run on such datasets, we strongly caution potential users that adjustments are likely to be necessary to obtain reasonable results.

\(\color{#c64329}{\text{Highlights of the Approach}}\)

The primary target audience are actuaries and demographers with statistical and/or quantitative backgrounds, who are interested in new methodology for looking at longevity trends across multiple populations. The key features of our approach are:

  • A scalable machine learning methodology that simultaneously models multiple longevity surfaces with a joint spatial covariance framework. Specifically, we tap into the GP ecosystem which is a centerpiece of probabilistic data science and in our opinion holds enormous promise for actuarial purposes;

  • Explicit information fusion across populations to maximize predictive accuracy;

  • A unified, data-driven framework for smoothing historical mortality observations (aka in-sample prediction) and forecasting future longevity scenarios (out-of-sample prediction)

  • A fully probabilistic framework that provides stochastic scenarios, probabilistic forecasts, and uncertainty bands, rather than just point projections

  • Open source implementation using publicly available packages, specifically within the environment

The tutorials are organized into several parts:

  • Methodology: contains the overview of the methodology and the notations we are using throughout this repository.

  • Tutorial I: walks through setting up a MOGP Full rank model with several case studies. We explain how to derive the distribution for the improvement rate factors and compare the results with single-output models.

  • Tutorial II: defines the Intrinsic Coregionalization Model (ICM) for MOGP to make more efficient and scalable models, and to achieve dimension reduction. We also illustrate how MOGP models achieve better performance in out-of-sample prediction and generate the coherence in long-term forecast.

  • Tutorial III: shows how to set up ICM when data has missing values, or not all outputs are observed given a set of input locations.

Why Multiple Populations?

Before diving into methodological details, we emphasize that there is an extensive and fast growing literature on stochastic mortality modeling for a single population, and a lot of methods for 2 populations (such as joint modeling of Male/Female longevity). However, few models exist for predictive multi-population longevity analysis beyond two populations. Thus, our motivation is to provide a predictive front-end to the wealth of data in HMD which is an amazing resource that is not fully leveraged by most existing tools.

It is generally accepted that there is strong commonality in mortality experiences of different populations. Consequently, aggregation of mortality datasets affords better capture of trends and denoising of raw longevity data, improving prediction accuracy. This applies both to the static structure of longevity (in terms of Age and Year) and in its dynamic evolution over time. Data fusion is also important for mitigating model risk, i.e. for fitting the best model within the proposed class. In our approach, this translates into better (hyper)-parameter estimates and less likelihood that the statistical model “goes rogue”. In other words, a multi-population model will have higher actuarial credibility. This is especially true for smaller nations with just a few million inhabitants.

Moreover, joint models capture information fusion, which is very valuable since mortality data are released asynchronously. With a joint model one can rely on the newly released data of a related foreign population to update and improve the domestic forecast (see Tutorial III). Last but not least, joint models are critical for generating forecasts and future scenarios simultaneously across multiple populations. Individual models will tend to be non-coherent, i.e. include scenarios where the joint mortality trends cross-over or diverge in unrealistic ways.

\(\color{#c64329}{\text{Data}}\)

For all the case studies, we work with mortality data from the Human Mortality Database which provides aggregated mortality statistics at the national level for more than 40 developed countries across the globe.

The HMD applies the same consistent set of procedures on each population and presently focuses on developed economies where death registrations and census data are available and reliable. For our analysis we rely on one-year age groups, concentrating on Ages 50–84 (retirement ages most relevant for predictive actuarial analysis) for both genders and calendar Years 1990–2016. We note that HMD data is typically delayed for 2-3 years from the present to allow thorough cleaning and double checking of all data.

The dataset is organized as a large table. The \(n\)th observation for the \(l\)th country contains (i) Age and Year as a pair of independent variables, \((x_{ag}^n,x_{yr}^n)\), and (ii) the natural logarithm of the observed mortality rate, \[\begin{equation} y^n = \log\bigg[\frac{\text{Death counts at $(x_{ag}^n,x_{yr}^n)$}}{\text{Exposed-to-risk counts at $(x_{ag}^n,x_{yr}^n)$}}\bigg]=\log\bigg[\frac{D^n}{E^n}\bigg]. \end{equation}\] For example \((x_{ag},x_{yr},y) = (64,2010,-4.2)\) means that for that population the Age-64 yearly mortality rate in 2010 was \(e^{-4.2} = 0.0149956\), i.e. about 1.5% chance that a 64-year old would not make it to their 65th birthday.

We denote by \(\mathcal{D}_l =\{(x^n,y^n)\}_{n=1}^N\) the dataset for the \(l\)th country. In the analysis below we will work with 1 or more populations simultaneously.