Dataset Parse Process

This describes the dataset parse process

Overview

Once an upload of a script integration to Tensorleap had initiated, an import model process starts.

This process:

Uploads the dataset code into the Tensorleap server
(optionally) if the "Build Dynamic dependencies" flag is on in the settings page - it creates a virtual environment for your data loading using the provided requirements.txt file.

The first time a virtual environment is created might take some time. A requirements.txt file would only be used to build a virtual environment once. This environment is cached and used in all future runs of the same requirements.txt file. If, for some reason, there is a need to re-run the virtual environment creation step - any change to the requirements.txt file would invalidate the cache and trigger a new environment

Extract registered functions.
Parse and test the validity of the data loading:
- Initializes a Preprocess function
- Loads the first sample from the preprocess response, and retrieves its' input, ground truth & metadata values.

Common Run Issues:

a Dataset Parse can fail due to:

When the Build Dynamic dependencies flag on:
- Missing requirements.txt
- Invalid requirements.txt. Make sure you can use pip to install this environment from file locally, and remove any redundant/conflicting dependency.
A bug in the integration script data flow. It is highly suggested to run an integration test before pushing code to the platform to prevent this.
Missing files/folders & unable to read specific files:
- Make sure the leap.yaml includes all of the files and folders required to parse your dataset. Using the CLI upload prints all the files that are included in the integration so they can be reviewed and the code viewer within the platform allows browsing through all included files.
- Make sure the server has access to each file the integration script tries to load, read, or access. This included the dataset, configuration files, and any other assets. If using an on-prem installation the leap server info command would output the folders Tensorleap can access under the datasetvolumes attribute.
OOM for data-loaders that try to load memory-heavy objects. Increase the settings memory limits for the dataset parse job.

PreviousImport Model Process NextTensorleap default environment

Last updated 4 months ago

Was this helpful?