Synthpop

Go to the Synthpop website

Overview

The synthpop package for R allows users to create synthetic versions of confidential individual-level data for use by researchers interested in making inferences about the population that the data represent. The synthesised data can be released with fewer restrictions on how they must be held than for the original data. They can be used to carry out statistical analyses, though we would usually recommend to conduct an analysis of the original data to confirm the results. Synthetic data are also useful for providing data sets for teaching.

The package allows the synthesis process to be customised in many different ways according to the characteristics of the data being synthesised. There are default values for most of the parameters, but if you want your synthetic data to be useful you must set parameters appropriately.

To cite the synthpop package in publications use:

Nowok, B., G.M. Raab & C. Dibben (2016), synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11. Available at: https://www.jstatsoft.org/article/view/v074i11

Methodology

The key objective of producing synthetic versions of original data sets is to replace sensitive values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Usually in synthpop all values of synthesised variables are replaced. Most commonly variables are synthesised one-by-one using sequential regression modelling. This means that conditional distributions, from which synthetic values are drawn, are defined for each variable separately and they are conditioned on the original variables that are earlier in the synthesis sequence.  A recent addition to our methods includes the option to synthesise a group of categorical variables together, from their joint distribution, at the start of the synthesis.  See the package NEWS file for more details. Additional variables not to be synthesised can be used as predictors. If the user chooses to retain them in the synthetic data then the output data set will be partially or incompletely synthesised.

Consider as an example a default synthesis, i.e. synthesis with all values of all variables (Y1, Y2,…, Yp) to be replaced. The first variable to be synthesised Y1 cannot have any predictors and therefore its synthetic values are generated by random sampling with replacement from its observed values. Then the distribution of Y2 conditional on Y1 is estimated and the synthetic values of Y2 are generated using the fitted model and the synthesised values of Y1. Next the distribution of Y3 conditional on Y1 and Y2 is estimated and used along with synthetic values of Y1 and Y2 to generate synthetic values of Y3 and so on. The distribution of the last variable Yp will be conditional on all other variables. Similar conditional specification approaches are used in most implementations of synthetic data generation. They are preferred to joint modelling not only because of the ease of implementation but also because of their flexibility to apply methods that take into account structural features of the data such as logical constraints or missing data patterns.

With practicality and flexibility in mind, classification and regression trees (CART) are used as the default conditional models for synthesis but various parametric alternatives are also available.

synthpop story

The R package synthpop has been written as part of the UK Economic and Social Research Council funded SYLLS project (SYnthetic Data Estimation for UK LongitudinaL Studies) to allow support staff of the UK Longitudinal Studies (LSs) to produce synthetic data tailored to the needs of individual research projects. You can read more here.

Go to the Synthpop website