# Synthetic Data Optimization

### What is Synthetic Data Optimization and why it matters

Synthetic data is essential for scaling datasets and capturing rare or hard to observe scenarios, but in practice, its impact is limited by a critical challenge: the domain gap between synthetic and real-world data. This gap is rarely measurable, making it difficult to understand how synthetic samples actually influence model behavior or where they fall short.

As a result, improving synthetic data becomes a slow, trial-and-error process. Teams must manually tweak simulation parameters, regenerate data, and retrain models—without clear, model-driven feedback to guide them. This lack of visibility turns synthetic data optimization into a labor-intensive and inefficient loop, where progress is hard to quantify and even harder to systematize.

### Generating data that is mathematically and semantically closer to reality

Tensorleap enables a structured, model-driven approach to optimizing synthetic data:

1. **Quantify the domain gap -** Tensorleap uses its latent space representation to measure the distance between synthetic and real data distributions.

<figure><img src="/files/7bf38DeGHIDaGRj660cH" alt=""><figcaption></figcaption></figure>

2. **Compare synthetic and real data -** By analyzing how both datasets are represented in latent space, Tensorleap provides a clear signal of where synthetic data diverges from the target population.
3. **Guide synthetic data generation -** Tensorleap suggests improved simulation configurations based on the model and real data samples which eliminates the manual trial and error. These recommendations help steer the next iteration of synthetic data generation toward better alignment.

<figure><img src="/files/edqXckwyKmNwrsoKIP1B" alt=""><figcaption></figcaption></figure>

### Key Differences: Manual Approach vs. Tensorleap<br>

| Process / Approach     | Manual Approach                                                | Tensorleap Approach                                                       |
| ---------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------- |
| Optimization Objective | Implicit and unclear, based on heuristics or visual inspection | Explicit and measurable via domain gap in latent space                    |
| Evaluation Process     | Requires repeated model training to assess data quality        | Direct evaluation by measuring alignment between synthetic and real data  |
| Iteration Process      | Trial and error tuning of simulation parameters                | Guided iterations using model driven recommendations                      |
| Efficiency             | Slow, resource intensive cycles                                | Faster convergence with fewer iterations                                  |
| Data Alignment         | Hard to verify similarity to real world distribution           | Systematic alignment to target population through latent space comparison |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorleap.ai/getting-value-from-tensorleap/synthetic-data-optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
