The funding for the SYLLS project finished and it has continued under the Administrative Data Research Centre for Scotland (ADRC-S)

The England and Wales Longitudinal Study (ONS LS), Scottish Longitudinal Study (SLS) and Northern Ireland Longitudinal Study (NILS) are incredibly rich micro-datasets linking census and other health and administrative data (births, deaths, marriages, cancer registrations) for individuals and their immediate families across several decades. Whilst unique and valuable resources, the sensitive nature of the information they contain means that access to the microdata is restricted to approved researchers and LS support staff, who can only view and work with the data in safe settings controlled by the national statistical agencies. Consequently, compared to other census data products such as the aggregate statistics or interaction data, the three longitudinal studies are used by a small number of researchers – a situation which limits their potential impact.

Given that confidentiality constraints mean that open access is not possible with the real microdata, alternative options were needed to allow academics and other users to carry out their research more freely. To address this the SYLLS project (Synthetic Data Estimation for UK Longitudinal Studies) was set up. SYLLS developed techniques to produce synthetic data which mimic the real data and preserve the relationships between variables, but are more freely accessible.

This project, a collaboration between the three UK Longitudinal Study Research Support Units – CeLSIUS, LSCS and the NILS-RSU made use of two complementary methods for generating synthetic data products:

  1. Statistical modelling with conditional specification is used to generate bespoke synthetic datasets for individual research projects. After cleaning their data and developing their analyses on the synthetic data the users repeat them and, we hope, confirm the results on the actual LS data sets. Routines to generate a synthetic version of real datasets are implemented in the R package ‘synthpop’, which is freely available from the  R website. More information about the package can be found on the ‘Synthpop’ website.
  2. Microsimulation was used to generate synthetic longitudinal data ‘spines’ for each of the national longitudinal studies. These ‘spines’ synthesise the full sample but include only the most frequently used variables and longitudinal transitions.

Development of the ‘synthpop’ package now continues under the auspices of the ADRC-S.

Contact:  Dr Beata Nowok or Professor Gillian Raab

The SYLLS team was: