> For the complete documentation index, see [llms.txt](https://docs.tensorleap.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorleap.ai/getting-value-from-tensorleap/pruning.md).

# Pruning

### What is Pruning and why it matters

Tensorleap's Dataset Pruning removes redundant training samples while preserving the diversity and balance of your dataset. Instead of training on every sample you've collected, you train on a smaller, carefully selected subset that covers the same ground — so models train faster and cheaper without losing accuracy.

Most real-world datasets are heavily redundant. Many samples look almost identical to many others, and adding more of them doesn't teach your model anything new it just makes training longer, more expensive, and biased toward whatever happens to be over-represented.

Dataset Pruning helps you:<br>

* Cut training cost and time - Train on 70–80% of your data and reach the same (or better) accuracy.
* Remove redundant data - Near-duplicates and look-alike samples are dropped first, rare and informative samples are kept.
* Balance class representation - Stratify by metadata (class, source, scenario, etc.) so under-represented groups aren't included in pruning process.
* Surface the "marginal" samples - Each kept sample gets a priority score, so you can see which samples are core to coverage and which are on the edge of being dropped.

### Running dataset pruning from Tensorleap

Tensorleap analyzes your dataset distribution and rebalances it using your selected metadata tags. Applying filters will focus on a subset and optionally prioritize specific metadata dimensions to guide the pruning process.

1. Click on the DS Curation button&#x20;

<figure><img src="/files/LdZnLFNE286IXrJVpX82" alt=""><figcaption></figcaption></figure>

2. You can choose between allowing Tensorleap to automatically determine the percentage of samples to prune or manually entering the percentage yourself. By default, Tensorleap automatically determines the pruning percentage. If you prefer to specify it manually, simply uncheck the checkbox.

<figure><img src="/files/NoHBtqYlAvPzplopC8hy" alt=""><figcaption></figcaption></figure>

3. You can add dataset filters to exclude specific parts of the dataset from the pruning process
4. You can add metadata tags to prioritize specific metadata dimensions to guide the pruning process

### The Output

Once the pruning process finishes you get:

1. A CSV with one row per training sample, including:
   1. &#x20;A 1 indication if the sample was pruned and a 0 indication if it's kept
   2. A priority score for kept samples&#x20;
2. A cluster filter for visualizing the kept vs. pruned split inside the population exploration view.

You can use the priority score to:

* Tune the percentage by inspecting how marginal the borderline samples look.
* Spot redundancy regions: clusters of low-priority kept samples indicate dense areas that could likely tolerate more pruning.
* Find unique samples: samples scoring near 0 are your dataset's "anchors", removing them would leave coverage holes.

### Best Practices

* Re-prune after major data additions. Pruning is a snapshot of redundancy at one point in time,  adding new data shifts what's redundant.
* Compare before/after metrics. Train one model on the full set and one on the pruned set, if accuracy holds, the pruning was safe and you've cut cost permanently.
* Combine with active learning. Prune to remove redundancy and use [active learning](/getting-value-from-tensorleap/active-learning.md) to decide what new data to label next. The two are complementary.

### Pruning Video Tutorial

{% embed url="<https://app.guidde.com/share/playbooks/3ZJuaitkgcnRYyEX1zS4Ym?mode=videoOnly&origin=k2buG3CvzZWUzfsWk7HPoOLDKpg2>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.tensorleap.ai/getting-value-from-tensorleap/pruning.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.