Not All Synthetic Data is the Same: A Framework for Generating Realistic Data

4 min readSep 27, 2024

A common misconception about synthetic data is that it’s all created equally. In reality, generating synthetic data for complex, nuanced use cases — like healthcare prescription data — can be exponentially more challenging than building a dataset for weather simulations. The goal of synthetic data isn’t just to simulate but to closely approximate real-world scenarios. While achieving a perfect 100% replication is impossible, high-quality synthetic data should mirror the complexity and diversity of real-world data as much as possible.

This blog explores the challenges of generating synthetic data, different approaches, and a framework for selecting the right method for your specific use case.

All Data is Not Created Equal: Why Healthcare Prescription Data is Harder Than Weather Data

Consider generating synthetic data for healthcare prescription records. A small error — such as an incorrect dosage for a patient’s age or condition — could have catastrophic consequences when models trained on this data are used in real-world applications. On the other hand, generating weather data, while complex, doesn’t involve the same level of interdependencies and risk.

Healthcare data is particularly intricate due to its multiple interdependencies, such as between dosage and age or diagnosis and treatment. Failing to capture these relationships in synthetic data can undermine the reliability of predictive models and analytics.

Understanding the Complexity and Richness of Data

The main challenge in synthetic data generation lies in accounting for the complexity and richness of the original data. The more nuanced the data, the harder it is to create high-quality synthetic versions. Complex data often has dependencies, relationships, and outliers that must be preserved.

High Complexity Data: Healthcare prescription records, financial transaction logs, and industrial IoT sensor readings.
Low Complexity Data: Simulated weather patterns, simple time-series data, or retail stock levels.

Approaches to Generating Synthetic Data

Real Synthetic Data: An Approach to Masked Data
Using actual production data for testing is ideal but presents privacy, compliance, and data leak risks. An alternative, Real Synthetic Data, replaces Personally Identifiable Information (PII), Personal Health Information (PHI), and sensitive data with synthetic equivalents while preserving the original structure and complexity of the dataset.
What is Real Synthetic Data?
Real Synthetic Data uses production data as a base but replaces sensitive fields with synthetic values that mimic the original, retaining important characteristics like format, range, and structure. For instance, phone numbers are replaced with valid alternatives, maintaining geographic consistency through area codes while anonymizing content.
Best for: Use cases with a rich history of production data and complex interdependencies, like healthcare, financial transactions, and legal records. This approach is often the best option where consistency and accuracy are critical.
Simulated Synthetic Data
Best for: Situations where there is little or no historical production data or the data is relatively straightforward.
This method uses code or machine learning models to generate synthetic data resembling real-world data. Although scalable, it often struggles to capture edge cases or nuances.
Example: Generating synthetic customer transaction data for an online retailer, where transactions generally follow simpler rules (price, quantity, discount), making fully synthetic data sufficient.
Hybrid Approach
Best for: Cases where some fields require strong consistency while others can afford more flexibility.
This approach blends masked production data with synthetic data. For example, in a telecom dataset, masked production data might be used for customer identity fields (e.g., name, phone number), while call logs could be simulated to generate variations in call durations and outcomes.

Choosing the Right Approach

Several factors influence the choice of synthetic data generation method:

History of Production Data: If you have a long history of reliable production data, Real Synthetic Data is typically the best choice, as it retains consistency and relationships.
Data Complexity: For industries with intricate data relationships, like healthcare or finance, masked production data is usually the only viable option. For simpler data, fully synthetic approaches work well.
Consistency Requirements: If your dataset has complex consistency needs (like valid dosage levels in healthcare), production-based data is preferable. Simulated data may be sufficient where consistency is less critical (like in weather simulations).
Data Volume: For large-scale systems requiring massive datasets, Simulated Synthetic Data or a Hybrid Approach offers scalability while balancing complexity.

When generating synthetic data, a one-size-fits-all approach rarely works. The best method depends on your data’s complexity, the availability of historical data, and the need for consistency and scalability.

Protecto (www.protecto.ai) is not a synthetic data company. We specialize in masking your production data while retaining its type, format, length, and consistency. Protecto can be used to generate accurate test data. For your use case, accurate real synthetic data may be the better option.

Not All Synthetic Data is the Same: A Framework for Generating Realistic Data

All Data is Not Created Equal: Why Healthcare Prescription Data is Harder Than Weather Data

Understanding the Complexity and Richness of Data

Approaches to Generating Synthetic Data

Choosing the Right Approach

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Amar Kanagaraj

No responses yet