# Advanced Metrics

As the model generalizes various characteristics in the data, we will see that samples with similar metadata will cluster together in the similarity map. Furthermore, we can use additional samples' metadata to identify correlations between various characteristics and the model's performance.

In this section, we'll add custom metadata to our dataset and inspect such correlations using the Metrics Dashboard.

## Add Custom Metadata

As an example, we'll add the **Euclidean Distance from Class Centroid** metadata.&#x20;

First, the preprocessing function `preprocess_func` must calculate the average image for each class and store it in the `dataset_binder` cache container. Then, the metadata function calculates the sample's Euclidean distance from the class average. This metric could aid us in analyzing the model's performance on samples that are relatively distinct in comparison to the class average.&#x20;

### Dataset Script

In the [**Resources Management**](/user-interface/resources-management.md) view, click the `mnist` dataset and add the code below to its script. Note that the centroid computation is added to the end of our preprocessing function `preprocess_func()`.&#x20;

**Code snippet**

```python
def calc_classes_centroid(preprocess: PreprocessResponse) -> dict:
    avg_images_dict = {}
    # calculate average image on the pixels.
    # returns a dictionary: key: class, values: images 28x28 
    data_X = preprocess.data['images']
    data_Y = preprocess.data['labels']
    for label in LABELS:
        inputs_label = data_X[np.equal(np.argmax(data_Y, axis=1), int(label))]
        avg_images_dict[label] = np.mean(inputs_label, axis=0)
    return avg_images_dict


def preprocess_func() -> List[PreprocessResponse]:
...
    leap_binder.cache_container["classes_avg_images"] = calc_classes_centroid(train)
    response = [train, val, test]
    return response


def metadata_euclidean_distance_from_class_centroid(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
    ### calculate euclidean distance from the average image of the specific class
    sample_input = preprocess.data['images'][idx]
    label = preprocess.data['labels'][idx]
    label = str(np.argmax(label))
    class_average_image = leap_binder.cache_container["classes_avg_images"][label]
    return np.linalg.norm(class_average_image - sample_input)


leap_binder.set_metadata(function=metadata_euclidean_distance_from_class_centroid, metadata_type=DatasetMetadataType.float, name='euclidean_diff_from_class_centroid')
```

For convenience, you can find the full script below:

<details>

<summary>Full Script (expandable)</summary>

```python
from typing import List

import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import DatasetMetadataType, Metric

LABELS = ['0','1','2','3','4','5','6','7','8','9']

# preprocessing func
def preprocess_func() -> List[PreprocessResponse]:
    (data_X, data_Y), (test_X, test_Y) = mnist.load_data()

    data_X = np.expand_dims(data_X, axis=-1)  # Reshape :,28,28 -> :,28,28,1
    data_X = data_X / 255                     # Normalize to [0,1]
    data_Y = to_categorical(data_Y)           # Hot Vector

    test_X = np.expand_dims(test_X, axis=-1)  # Reshape :,28,28 -> :,28,28,1
    test_X = test_X / 255                     # Normalize to [0,1]
    test_Y = to_categorical(test_Y)           # Hot Vector

    train_X, val_X, train_Y, val_Y = train_test_split(data_X, data_Y, test_size=0.2, random_state=42)

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data dictionary.
    # In this example we pass `images` and `labels` that later are encoded into the inputs and outputs 
    train = PreprocessResponse(length=len(train_X), data={'images': train_X, 'labels': train_Y})
    val = PreprocessResponse(length=len(val_X), data={'images': val_X, 'labels': val_Y})
    test = PreprocessResponse(length=len(test_X), data={'images': test_X, 'labels': test_Y})

    leap_binder.cache_container["classes_avg_images"] = calc_classes_centroid(train)

    response = [train, val, test]
    return response

# Input encoder fetches the image with the index `idx` from the `images` array set in
# the PreprocessResponse's data. Returns a numpy array containing the sample's image. 
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    return preprocess.data['images'][idx].astype('float32')


# Ground truth encoder fetches the label with the index `idx` from the `labels` array set in
# the PreprocessResponse's data. Returns a numpy array containing a hot vector label correlated with the sample.
def gt_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    return preprocess.data['labels'][idx].astype('float32')


# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds the int digit of each sample (not a hot vector).
def metadata_label(idx: int, preprocess: PreprocessResponse) -> int:
    one_hot_digit = gt_encoder(idx, preprocess)
    digit = one_hot_digit.argmax()
    digit_int = int(digit)
    return digit_int


def metadata_euclidean_distance_from_class_centroid(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    ### calculate euclidean distance from the average image of the specific class
    sample_input = preprocess.data['images'][idx]
    label = preprocess.data['labels'][idx]
    label = str(np.argmax(label))
    class_average_image = leap_binder.cache_container["classes_avg_images"][label]
    return np.linalg.norm(class_average_image - sample_input)

def calc_classes_centroid(preprocess: PreprocessResponse) -> dict:
    avg_images_dict = {}
    # calculate average image on the pixels.
    # returns a dictionary: key: class, values: images 28x28 
    data_X = preprocess.data['images']
    data_Y = preprocess.data['labels']
    for label in LABELS:
        inputs_label = data_X[np.equal(np.argmax(data_Y, axis=1), int(label))]
        avg_images_dict[label] = np.mean(inputs_label, axis=0)
    return avg_images_dict

# Dataset binding functions to bind the functions above to the Dataset Instance.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='image')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.int, name='label')
leap_binder.set_metadata(function=metadata_euclidean_distance_from_class_centroid, metadata_type=DatasetMetadataType.float, name='euclidean_diff_from_class_centroid')
leap_binder.add_prediction(name='prediction', labels=LABELS)
```

</details>

Once you add the code to the script, click <img src="/files/mfsLBdUu3rxdhDhnfHZ1" alt="" data-size="line"> to save the **Dataset Instance**.

### Dataset Block

After updating and saving the script, our dataset block needs to be updated. To do so, follow these steps:

1. Open the `MNIST` project.
2. From the **Versions** view, position your cursor over the model revision, click <img src="/files/uG4lQceLsLKBgiqjLMTA" alt="" data-size="line"> to **Open Commit**.
3. On the **Dataset Block** in the **Network** view, click the **Update** button. More info at [**Script Version**](/user-interface/project/network/network-mapping/create-a-mapping-deprecated/input-node.md#script-version)**.**
4. To save the version with the updated dataset block, click the <img src="/files/KufD8iWbuVi1Smd314P4" alt="" data-size="line"> button and set the `Revision Name` to `cnn-extra`. More info at [**Versions**](/user-interface/project/versions.md).
5. To train the update model, click <img src="/files/EjRb7Kqxt5ZYUOF3Xy3L" alt="" data-size="line"> from the top bar. Let's set the `Number of Epochs` to `10` and click <img src="/files/Is9sXxG5yWqlFXZb0k6P" alt="" data-size="line">. More info at [**Evaluate/Train Model**](/user-interface/project/menu-bar/evaluate-a-model.md).
6. Under the `cnn-extra` revision on the [**Versions**](/user-interface/project/versions.md) view, click <img src="/files/Y6gaIy1hktwIqCf4NHEy" alt="" data-size="line"> to display the new version's metrics on the dashboard.

### &#x20;Add Custom Dashlets

In this section, you will add custom **Dashlets** with the added metadata.

Open the to the `mnist` [**Dashboard**](/user-interface/dashboards/dashlets/metrics-dashboard.md)  that was created in the [**Model** **Integration**](/guides/full-guides/mnist-guide/model-integration.md#metrics) step and follow the next steps.

#### Loss by Sample

1. To add a dashlet, click <img src="/files/SnUBPK2c5ltW0v4DMJAT" alt="" data-size="line"> at the top right.
2. Choose the **Table** type **Dashlet** by clicking <img src="/files/dnFP0fjm2QtwstXKvHOC" alt="" data-size="line"> on the left side of the **Dashlet**.
3. Set the **Dashlet Name** to `Sample Loss`.
4. Under **Metrics** add a field and set `metrics.loss` with `average` aggregation.
5. Under **Metadata** add these fields:
   * &#x20;`sample_identity.index`
   * `dataset_slice.keyword`
6. Close the dashlet options panel to fully view the table.

#### Centroid Distance vs Loss

1. To add a dashlet, click <img src="/files/SnUBPK2c5ltW0v4DMJAT" alt="" data-size="line"> at the top right. The **Bar** dashlet option should be the first to open up.
2. Set the **X-Axis** to `metadata.euclidean_from_cls_centroid`.
3. Set the **Interval** to `1`.
4. Turn on the **Split series by subset** and the **Show only last epoch** options.
5. Close the dashlet options panel to fully view thew chart.

#### Dashboard

You can reposition and resize each dashlet within the dashboard. Here is the final layout:

![Dashboard with Custom Dashlets](/files/HUhpjLdqWSb7rwMyDYr9)

### Metrics Analysis

In this section, we will investigate the metrics within our custom dashboard.&#x20;

First, let's focus on the `Centroid Dist vs Loss` visualization we created:

![Centroid Dist vs Loss](/files/60UBMSsYym33AdJYeTJz)

The visualization above displays a histogram of the average loss vs the Euclidean distance. It reveals a strong correlation between distance and loss - samples with high distance values tend to have higher losses.

### Sample Analysis

In the table in our dashboard, we see two samples that fall into that bucket, one of them with a very high loss. Let's run a Sample Analysis on that sample:

1. Select **Analyzer** from the drop-down at the top of the **Dashboard** view.
2. Click <img src="/files/dZ86b2AejnAl0JYyK8vM" alt="" data-size="line">and choose **Analyze Sample.**
3. Set **Dataset Slice** to `Validation`  and set the **Sample Index** to the *sample\_index* found in the Samples Loss table visualization - `6754`.
4. Click <img src="/files/yWDJczNmxVpnuM78oXUR" alt="" data-size="line">.

From the **Sample Analysis** above, we get the following results:

![Prediction and Ground Truth (click-to-zoom)](/files/nGAQYcZ1M331X8Y777h8) ![Original Sample](/files/bT8MZAHLbp3UdnbmxEkJ)

From the results, we see that the model confuses this sample (the digit 8) with the digit `0`. We can also see that this sample was written in a thick marker, causing a high Euclidean distance from the average.

## Conclusion

This section concludes our tutorial on the MNIST dataset.&#x20;

For another tutorial on performing model analysis using Tensorleap, please check the next section dealing with building a classifier model to predict positive and negative reviews using the IMDB movie database.

Ready for more? Go to the [**IMDB Guide**](/guides/full-guides/imdb-guide.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorleap.ai/guides/full-guides/mnist-guide/advanced-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
