Synthpop :: Longitudinal Studies Centre Scotland

Synthpop

Overview

The synthpop package for R allows users to create synthetic versions of confidential individual-level data for use by researchers interested in making inferences about the population that the data represent. The synthesised data can be released with fewer restrictions on how they must be held than for the original data. They can be used to carry out statistical analyses, though we would usually recommend to conduct an analysis of the original data to confirm the results. Synthetic data are also useful for providing data sets for teaching.

The package allows the synthesis process to be customised in many different ways according to the characteristics of the data being synthesised. There are default values for most of the parameters, but if you want your synthetic data to be useful you must set parameters appropriately.

To cite the synthpop package in publications use:

Nowok, B., G.M. Raab & C. Dibben (2016), synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11. Available at: https://www.jstatsoft.org/article/view/v074i11

Methodology

The key objective of producing synthetic versions of original data sets is to replace sensitive values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Usually in synthpop all values of synthesised variables are replaced. Most commonly variables are synthesised one-by-one using sequential regression modelling. This means that conditional distributions, from which synthetic values are drawn, are defined for each variable separately and they are conditioned on the original variables that are earlier in the synthesis sequence. A recent addition to our methods includes the option to synthesise a group of categorical variables together, from their joint distribution, at the start of the synthesis. See the package NEWS file for more details. Additional variables not to be synthesised can be used as predictors. If the user chooses to retain them in the synthetic data then the output data set will be partially or incompletely synthesised.

Consider as an example a default synthesis, i.e. synthesis with all values of all variables (Y1, Y2,…, Yp) to be replaced. The first variable to be synthesised Y1 cannot have any predictors and therefore its synthetic values are generated by random sampling with replacement from its observed values. Then the distribution of Y2 conditional on Y1 is estimated and the synthetic values of Y2 are generated using the fitted model and the synthesised values of Y1. Next the distribution of Y3 conditional on Y1 and Y2 is estimated and used along with synthetic values of Y1 and Y2 to generate synthetic values of Y3 and so on. The distribution of the last variable Yp will be conditional on all other variables. Similar conditional specification approaches are used in most implementations of synthetic data generation. They are preferred to joint modelling not only because of the ease of implementation but also because of their flexibility to apply methods that take into account structural features of the data such as logical constraints or missing data patterns.

With practicality and flexibility in mind, classification and regression trees (CART) are used as the default conditional models for synthesis but various parametric alternatives are also available.

synthpop story

The R package synthpop has been written as part of the UK Economic and Social Research Council funded SYLLS project (SYnthetic Data Estimation for UK LongitudinaL Studies) to allow support staff of the UK Longitudinal Studies (LSs) to produce synthetic data tailored to the needs of individual research projects. You can read more here.

Go to the Synthpop website

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__utma	2 years	Used to distinguish users and sessions. The cookie is created when the javascript library executes and no existing __utma cookies exists. The cookie is updated every time data is sent to Google Analytics.
__utmb	30 minutes	Used to determine new sessions/visits. The cookie is created when the javascript library executes and no existing __utmb cookies exists. The cookie is updated every time data is sent to Google Analytics.
__utmc		Not used in ga.js. Set for interoperability with urchin.js. Historically, this cookie operated in conjunction with the __utmb cookie to determine whether the user was in a new session/visit.
__utmt	10 minutes	Used to throttle request rate.
__utmz	6 months	Stores the traffic source or campaign that explains how the user reached your site. The cookie is created when the javascript library executes and is updated every time data is sent to Google Analytics.
_ga	2 years	Used to distinguish users.
_gat	1 minute	Used to throttle request rate.
_gid	24 hours	Used to distinguish users.

Longitudinal Studies Centre Scotland - Longitudinal Studies Centre Scotland – Linking Lives Through Time

Synthpop

Overview

Methodology

synthpop story