> For the complete documentation index, see [llms.txt](https://docs.tensorleap.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorleap.ai/guides/full-guides/mnist-guide/dataset-integration.md).

# Dataset Integration

This section covers the integration of the `mnist` dataset into Tensorleap. We'll later use this dataset with a classification model.

## Dataset Script

Below is the full dataset script to be used in the integration. More information about the structure of this script can be found under [**Dataset Script**](/tensorleap-integration/writing-integration-code.md).

```python
from typing import List

import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import Metric, DatasetMetadataType

# Preprocess Function
def preprocess_func() -> List[PreprocessResponse]:
    (data_X, data_Y), (test_X, test_Y) = mnist.load_data()

    data_X = np.expand_dims(data_X, axis=-1)  # Reshape :,28,28 -> :,28,28,1
    data_X = data_X / 255                     # Normalize to [0,1]
    data_Y = to_categorical(data_Y)           # Hot Vector
    
    test_X = np.expand_dims(test_X, axis=-1)  # Reshape :,28,28 -> :,28,28,1
    test_X = test_X / 255                     # Normalize to [0,1]
    test_Y = to_categorical(test_Y)           # Hot Vector

    train_X, val_X, train_Y, val_Y = train_test_split(data_X, data_Y, test_size=0.2, random_state=42)

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data dictionary.
    # In this example we pass `images` and `labels` that later are encoded into the inputs and outputs 
    train = PreprocessResponse(length=len(train_X), data={'images': train_X, 'labels': train_Y})
    val = PreprocessResponse(length=len(val_X), data={'images': val_X, 'labels': val_Y})
    test = PreprocessResponse(length=len(test_X), data={'images': test_X, 'labels': test_Y})

    response = [train, val, test]
    return response

# Input encoder fetches the image with the index `idx` from the `images` array set in
# the PreprocessResponse data. Returns a numpy array containing the sample's image. 
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    return preprocess.data['images'][idx].astype('float32')

# Ground truth encoder fetches the label with the index `idx` from the `labels` array set in
# the PreprocessResponse's data. Returns a numpy array containing a hot vector label correlated with the sample.
def gt_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    return preprocess.data['labels'][idx].astype('float32')

# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds the int digit of each sample (not a hot vector).
def metadata_label(idx: int, preprocess: PreprocessResponse) -> int:
    one_hot_digit = gt_encoder(idx, preprocess)
    digit = one_hot_digit.argmax()
    digit_int = int(digit)
    return digit_int

LABELS = ['0','1','2','3','4','5','6','7','8','9']
# Dataset binding functions to bind the functions above to the `Dataset Instance`.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='image')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.int, name='label')
leap_binder.add_prediction(name='classes', labels=LABELS)

```

{% hint style="info" %}
For more information, see [**Binding Functions**](/tensorleap-integration/writing-integration-code.md#binding-functions).
{% endhint %}

## Add a Dataset Instance

{% tabs %}
{% tab title="UI" %}

#### Add a Dataset Instance Using the UI

To add a new **Dataset Instance**:&#x20;

1. Navigate to [**Resources Management**](/user-interface/resources-management.md) and click the <img src="/files/ULjVAUDOEdoPgtwVT1FX" alt="" data-size="line"> button.&#x20;
2. In the **Dataset Editor**, enter these properties:
   * Dataset Name: `mnist`
   * Script: copy and paste the **script** from the [**Dataset Script**](#dataset-script) above
3. Click **Save**.

![Add a New Dataset Instance](/files/gsaimhtukuYe0k7BWvf2)

After saving the `mnist` dataset, the platform will automatically parse the dataset script. This process evaluates the **script** and ensures that all its functions, including the ability to successfully read the data, are working as expected.

{% hint style="info" %}
Upon successful parsing, the details of the MNIST dataset will be displayed on the right. In case of unsuccessful parsing, errors will be shown instead.
{% endhint %}
{% endtab %}

{% tab title="CLI" %}

#### Initial CLI Setup

Verify that `leapcli` is installe&#x64;*.* For more information, see [**Installing Leap CLI**](/getting-started/quickstart/quickstart-using-cli.md#installing-leap-cli)**.**

#### Project Folder Setup

1. Create a folder for our `mnist` project.

   ```
   mkdir mnist
   cd mnist
   ```
2. Initialize and synchronize the created folder with the Tensorleap platform by running a command that will set up the `.tensorleap` folder within the project folder. The  command `leap init (PROJECT) (DATASET) (--h5/--onnx)` with the following parameters:

   * PROJECT = MNIST (project name)
   * DATASET = mnist (dataset name)
   * (--h5/--onnx) = model format, `--h5` for **Tensorflow** (H5) and `--onnx` for **PyTorch** (ONNX)

   ```javascript
   leap init MNIST mnist --h5
   ```
3. Next, we need to set your credentials to `leap` CLI by running the following command:

   ```javascript
   leap login [API_ID] [API_KEY] [ORIGIN]
   ```

The `API_ID` , `API_KEY` and the `ORIGIN`, along with the full command, can easily be found by clicking the <img src="/files/XZK9nzEcOCPmY9kGPOJY" alt="" data-size="line"> button within the [**Resources Management**](/user-interface/resources-management.md) view.

#### Push Dataset

When using the CLI, the **Dataset Script** is defined within the `.tensorleap/dataset.py` file, and the **Dataset Instance** is created/updated upon performing `leap push`.

1. By default, the `.tensorleap/dataset.py` file has a sample template. Let's replace it with our [**Dataset Script**](#dataset-script) above. One way to do it is with `vim`:

   ```
   rm .tensorleap/dataset.py
   cat > .tensorleap/dataset.py
     << paste the dataset script above + CTRL-D  >>
   ```
2. Let's test our dataset script using `leap check`:

   ```
   leap check --dataset
   ```
3. Next, we'll push our dataset to the Tensorleap platform using the following command:

   ```shell
   leap push --dataset
   ```

   It should print out:

   > `New dataset detected. Dataset name: mnist` \
   > `Push command successfully complete`

Congrats! You have successfuly created the `mnist` **Dataset Instance** and integrated the [**Dataset Script**](#dataset-script). You can view it in the UI in the **Resources Management** view.
{% endtab %}
{% endtabs %}

## Up Next - Model Integration

The purpose of this section was to help you define a dataset script and create a dataset instance in Tensorleap.&#x20;

Now that the `mnist` dataset has been integrated into Tensorleap, we can use it with a classification model. That's what we'll do in the next section, where we'll build a classification model.

When ready, move on to [**Model Integration**](/guides/full-guides/mnist-guide/model-integration.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.tensorleap.ai/guides/full-guides/mnist-guide/dataset-integration.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
