Unlabeled Data

As a data scientist, one of the most important things you can do is label your data samples. This allows you to build models that are more accurate and can be applied to real-world data. However, with the vast amount of data out there, it can be tough to prioritize which samples to label.

Tensorleap constructs the model's most informative latent-space, which enables you to prioritize which samples to label in an efficient way, by utilizing the learnt features of the model.

Integration Script

The unlabeled_data_preprocessing_func (custom name) is a preprocess function that is called just once before the reading the data, similar to the Preprocess Function. It prepares the data for later use in input encoders.

from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse

# Preprocessing Function
def unlabeled_preprocessing_func() -> PreprocessResponse:
...
    return PreprocessResponse(length=len(unlabeled_df), data=unlabeled_df)

leap_binder.set_unlabeled_data_preprocess(function=unlabeled_preprocessing_func)

This function returns a single PreprocessResponse object.

Fetch Similar

In order to prioritize unlabeled data, choose a sample within the Population Exploration analysis that correlates to a desired cluster, and request to fetch similar samples from the unlabeled data.

Once the Fetch Similar process finished, a similarity map of the found samples will be presented. You can choose to set the color and size of the dots to to similarity in order to indicate which were found to be the most similar to the target sample.

Last updated 2 years ago

Was this helpful?