Pruning
What is Pruning and why it matters
Tensorleap's Dataset Pruning removes redundant training samples while preserving the diversity and balance of your dataset. Instead of training on every sample you've collected, you train on a smaller, carefully selected subset that covers the same ground — so models train faster and cheaper without losing accuracy.
Most real-world datasets are heavily redundant. Many samples look almost identical to many others, and adding more of them doesn't teach your model anything new it just makes training longer, more expensive, and biased toward whatever happens to be over-represented.
Dataset Pruning helps you:
Cut training cost and time - Train on 70–80% of your data and reach the same (or better) accuracy.
Remove redundant data - Near-duplicates and look-alike samples are dropped first, rare and informative samples are kept.
Balance class representation - Stratify by metadata (class, source, scenario, etc.) so under-represented groups aren't included in pruning process.
Surface the "marginal" samples - Each kept sample gets a priority score, so you can see which samples are core to coverage and which are on the edge of being dropped.
Running dataset pruning from Tensorleap
Tensorleap analyzes your dataset distribution and rebalances it using your selected metadata tags. Applying filters will focus on a subset and optionally prioritize specific metadata dimensions to guide the pruning process.
Click on the DS Curation button

You can choose between allowing Tensorleap to automatically determine the percentage of samples to prune or manually entering the percentage yourself. By default, Tensorleap automatically determines the pruning percentage. If you prefer to specify it manually, simply uncheck the checkbox.

You can add dataset filters to exclude specific parts of the dataset from the pruning process
You can add metadata tags to prioritize specific metadata dimensions to guide the pruning process
The Output
Once the pruning process finishes you get:
A CSV with one row per training sample, including:
A 1 indication if the sample was pruned and a 0 indication if it's kept
A priority score for kept samples
A cluster filter for visualizing the kept vs. pruned split inside the population exploration view.
You can use the priority score to:
Tune the percentage by inspecting how marginal the borderline samples look.
Spot redundancy regions: clusters of low-priority kept samples indicate dense areas that could likely tolerate more pruning.
Find unique samples: samples scoring near 0 are your dataset's "anchors", removing them would leave coverage holes.
Best Practices
Re-prune after major data additions. Pruning is a snapshot of redundancy at one point in time, adding new data shifts what's redundant.
Compare before/after metrics. Train one model on the full set and one on the pruned set, if accuracy holds, the pruning was safe and you've cut cost permanently.
Combine with active learning. Prune to remove redundancy and use active learning to decide what new data to label next. The two are complementary.
Pruning Video Tutorial
Last updated
Was this helpful?

