Media Query Source: Part 35 CMSWire (US digital magazine)Synthetic data & how it's used in workplaceTest data that reflects prod data propertiesEntire

The responses I provided to a media outlet on February 9, 2022:

Media: What is Synthetic data and does it have a digital workplace use?

Gfesser: Synthetic data is a type of test data that is intended to reflect the statistical properties of real production data. Because of this, synthetic data is different than more traditional types of test data, such as generated data. Generated data is typically output from a data creation process which, like synthetic data, creates data from scratch, but generates each record or event in isolation rather than taking the entire created census of data into account.

For example, data generated for an integer value based on a rule which randomly chooses between the values 1 and 100, without regard to values of the same field in other records or events. As such, generated data fulfills the purpose of testing software processes to help ensure that these handle data values (or absence of data values) correctly, but do not fulfill the purpose of emulating realistic production scenarios.

Synthetic data, however, considers the entire created census of data to reflect needed statistical properties which enable this data to emulate real scenarios. Similar to my aforementioned generated data example, a particular integer field might reflect the values of 1 to 100, but creation of synthetic data would also consider the distribution of these values across all of the created records or events. Additionally, creation of synthetic data would also consider this distribution not just with respect to a given field in isolation, but in the context of other field values.

Test data can be scarce, especially for the purposes of building machine learning (ML) models, and some data patterns might prove to be especially rare in real production data (as is the case with anomalies), so synthetic data can be used to emphasize such scenarios. All of this said, however, keep in mind that while use of large data sets has been typically needed for useful results, researchers and vendors in recent years have been working to develop approaches which reduce this dependency on large data sets.


See all of my responses to media queries here.

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe