Synthetic data

Synthetic data are "any production data applicable to a given situation that are not obtained by direct measurement" according to the McGraw-Hill Dictionary of Scientific and Technical Terms;^[1] where Craig S. Mullins, an expert in data management, defines production data as "information that is persistently stored and used by professionals to conduct business processes.".^[2]

The creation of synthetic data is an involved process of data anonymization; that is to say that synthetic data is a subset of anonymized data.^[3] Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. Many times the particular aspects come about in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number, etc.).

Usefulness

Synthetic data are generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. This allows us to take into account unexpected results and have a basic solution or remedy, if the results prove to be unsatisfactory. Synthetic data are often generated to represent the authentic data and allows a baseline to be set.^[4] Another use of synthetic data is to protect privacy and confidentiality of authentic data. As stated previously, synthetic data is used in testing and creating many different types of systems; below is a quote from the abstract of an article that describes a software that generates synthetic data for testing fraud detection systems that further explains its use and importance. "This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment."^[4]

History

The history of the generation of synthetic data dates back to 1993. In 1993, the idea of original fully synthetic data was created by Rubin.^[5] Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household.^[6] Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.^[7]

In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.^[6] Later, other important contributors to the development of synthetic data generation are Raghunathan, Reiter, Rubin, Abowd, Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.^[6]

Applications

Synthetic data are used in the process of data mining. Testing and training fraud detection systems, confidentiality systems and any type of system is devised using synthetic data. As described previously, synthetic data may seem as just a compilation of “made up” data, but there are specific algorithms and generators that are designed to create realistic data.^[8] This synthetic data assists in teaching a system how to react to certain situations or criteria. Researcher doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion.^[4]

Synthetic data is also used to protect the privacy and confidentiality of a set of data. Real data contains personal/private/confidential information that a programmer, software creator or research project may not want to be disclosed.^[9] Synthetic data holds no personal information and cannot be traced back to any individual; therefore, the use of synthetic data reduces confidentiality and privacy issues.

Calculations

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms".¹⁰

"Synthetic data can be generated with random orientations and positions."⁸ Datasets can be get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.⁹

Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data.⁹

David Jensen from the Knowledge Discovery Laboratory mentioned how to generate synthetic data in his "Proximity 4.3 Tutorial" chapter 6: "Researchers frequently need to explore the effects of certain data characteristics on their data model." To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure¹⁰:random graphs that is generated by some random process;lattice graphs having a ring structure;lattice graphs having a grid structure, etc. In all cases, the data generation process follows the same process: 1. Generate the empty graph structure. 2. Generate attribute values based on user-supplied prior probabilities.

Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.¹⁰

References

↑ Synthetic data. (n.d.). McGraw-Hill Dictionary of Scientific and Technical Terms. Retrieved November 29, 2009, from Answers.com Web site:
↑ Mullins, Craig S. (2009, February 5). What is Production Data? Message posted to http://www.neon.com/blog/blogs/cmullins/archive/2009/02/05/What-is-Production-Data_3F00_.aspx
↑ MacHanavajjhala, Ashwin; Kifer, Daniel; Abowd, John; Gehrke, Johannes; Vilhuber, Lars (2008). "Privacy: Theory meets Practice on the Map". 2008 IEEE 24th International Conference on Data Engineering: 277–286. doi:10.1109/ICDE.2008.4497436.
1 2 3 Barse, E.L., Kvarnström, H., & Jonsson, E. (2003). Synthesizing test data for fraud detection systems. Manuscript submitted for publication, Department of Computer Engineering, Chalmbers University of Technology, Göteborg, Sweden. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1254343&isnumber=28060
↑ "Discussion: Statistical Disclosure Limitation". Journal of Official Statistics. 9: 461–468. 1993.
1 2 3 Abowd, John M. "Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods. [Powerpoint slides]". Retrieved 17 February 2011.
↑ "Statistical Analysis of Masked Data". Journal of Official Statistics. 9: 407–426. 1993.
↑ Deng, R. (2002). Information and Communications Security. Proceedings of the 4th International Conference, ICICS 2002 Singapore, December 2002. Retrieved from https://books.google.com/books?id=6mod7enQa8cC&pg=PA265&dq=%22synthetic+data%22#v=onepage&q=%22synthetic%20data%22&f=false
↑ Abowd, J.M., & Lane, J. (2004). New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. Manuscript submitted for publication, Cornell Institute for Social and Economic Research (CISER), Cornell University, Ithica, New York. Retrieved from http://www.springerlink.com/content/27nud7qx09qurg3p/fulltext.pdf

Wang, A, Qiu, T, & Shao, L. (2009). A Simple Method of Radial Distortion Correction with Centre of Distortion Estimation. 35. Retrieved from http://www.springerlink.com/content/8180144q56t30314/fulltext.pdf
Duncan, G. (2006). Statistical confidentiality: Is Synthetic Data the Answer? Retrieved from http://www.idre.ucla.edu/events/PPT/2006_02_13_duncan_Synthetic_Data.ppt
Jensen, D. (2004). Proximity 4.3 Tutorial Chapter 6. Retrieved from http://kdl.cs.umass.edu/proximity/documentation/tutorial/ch06s09.html
Jackson, C, Murphy, R, & Kovaˇcevic´, J. (2009). Intelligent Acquisition and Learning of Fluorescence Microscope Data Models. 18(9), Retrieved from http://www.andrew.cmu.edu/user/jelenak/Repository/08_JacksonMK.pdf
Adam Coates and Blake Carpenter and Carl Case and Sanjeev Satheesh and Bipin Suresh and Tao Wang and David J. Wu and Andrew Y. Ng (2011). "Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning". ICDAR. pp. 440–445. |access-date= requires |url= (help)

External links

The "DataGenerator" a model-based synthetic data generator: http://finraos.github.io/DataGenerator/
The datgen synthetic data generator: http://www.datasetgenerator.com
Fienberg, S. E. (1994). "Conflicts between the needs for access to statistical information and demands for confidentiality," Journal of Official Statistics 10, 115–132.
Little, R (1993). "Statistical Analysis of Masked Data," Journal of Official Statistics, 9, 407-426.
Raghunathan, T.E., Reiter, J.P., and Rubin, D.B. (2003). "Multiple Imputation for Statistical Disclosure Limitation," Journal of Official Statistics, 19, 1-16.
Reiter, J.P. (2004). "Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation," Survey Methodology, 30, 235-242.

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 10/13/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.