# Advanced Metrics

As the model generalizes various characteristics in the data, we will see that samples with similar metadata will cluster together in the similarity map. Furthermore, we can use additional samples' metadata to identify correlations between various characteristics and the model's performance.

In this section, we'll add custom metadata to our dataset and inspect such correlations using the Metrics dashboard.

## Add Custom Metadata

As an example, we will be adding the following metadata:

* Length - the number of words in a sample.
* Score - the IMDB score a user had given the target movie.

These metadata functions calculate and return the length and score, respectively, of each sample in the IMDB dataset. For more information, see [**Metadata Function**](/tensorleap-integration/writing-integration-code/metadata-function.md).

These metadata functions will return the length and score, respectively, of each sample in the IMDB dataset. We will add them to our [**Integration Script**](broken://pages/65JXLxD0E3CuEMjZKgdG).

### Integration Script

In the [**Resources Management**](/user-interface/resources-management.md) view, click the `imdb` dataset and add the code below to its script.

&#x20;**Code snippet**

```python
def score_metadata(idx, preprocess: PreprocessResponse) -> int:
    return int(preprocess.data['df']['paths'][idx].split("_")[1].split(".")[0])
    
leap_binder.set_metadata(function=score_metadata, metadata_type=DatasetMetadataType.int, name='score')
```

For convenience, you can find the full script with additional metadata below:

<details>

<summary>Full Script (expandable)</summary>

```python
from typing import List, Optional, Callable, Tuple, Dict

import json, os, re, string
from os.path import basename, dirname, join

import pandas as pd
import numpy as np
from google.auth.credentials import AnonymousCredentials
from google.cloud import storage
from google.cloud.storage import Bucket
from keras_preprocessing.text import Tokenizer as TokenizerType
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from pandas.core.frame import DataFrame as DataFrameType

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import DatasetMetadataType, LeapDataType, Metric
from code_loader.contract.visualizer_classes import LeapText

NUMBER_OF_SAMPLES = 20000
BUCKET_NAME = 'example-datasets-47ml982d'
PROJECT_ID = 'example-dev-project-nmrksf0o'

### Helper Functions: ###
def _connect_to_gcs() -> Bucket:
    gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())
    return gcs_client.bucket(BUCKET_NAME)


def _download(cloud_file_path: str, local_file_path: Optional[str] = None) -> str:
    BASE_PATH = "imdb"
    cloud_file_path = join(BASE_PATH, cloud_file_path)
    # if local_file_path is not specified saving in home dir
    if local_file_path is None:
        home_dir = os.getenv("HOME")
        assert home_dir is not None
        local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)
    # check if file already exists
    if os.path.exists(local_file_path):
        return local_file_path

    bucket = _connect_to_gcs()
    dir_path = os.path.dirname(local_file_path)
    os.makedirs(dir_path, exist_ok=True)
    blob = bucket.blob(cloud_file_path)
    blob.download_to_filename(local_file_path)
    return local_file_path


def load_tokenizer(tokenizer_path: str) -> TokenizerType:
    with open(tokenizer_path, 'r') as f:
        data = json.load(f)
        tokenizer = tokenizer_from_json(data)
    return tokenizer


def download_load_assets() -> Tuple[TokenizerType, DataFrameType]:
    cloud_path = join("assets", "imdb.csv")
    local_path = _download(cloud_path)
    df = pd.read_csv(local_path)
    cloud_path = join("assets", "tokenizer_v2.json")
    local_path = _download(cloud_path)
    tokenizer = load_tokenizer(local_path)
    return tokenizer, df


# Preprocess Function
def preprocess_func() -> List[PreprocessResponse]:
    tokenizer, df = download_load_assets()
    train_label_size = int(0.9 * NUMBER_OF_SAMPLES / 2)
    val_label_size = int(0.1 * NUMBER_OF_SAMPLES / 2)
    df = df[df['subset'] == 'train']
    train_df = pd.concat([df[df['gt'] == 'pos'][:train_label_size], df[df['gt'] == 'neg'][:train_label_size]], ignore_index=True)
    val_df = pd.concat([df[df['gt'] == 'pos'][train_label_size:train_label_size + val_label_size], df[df['gt'] == 'neg'][train_label_size:train_label_size + val_label_size]], ignore_index=True)
    ohe = {"pos": [1.0, 0.], "neg": [0., 1.0]}

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data dictionary.
    # In this example we pass `images` and `labels` that later are encoded into the inputs and outputs 
    train = PreprocessResponse(length=2 * train_label_size, data={"df": train_df, "tokenizer": tokenizer, "ohe": ohe})
    val = PreprocessResponse(length=2 * val_label_size, data={"df": val_df, "tokenizer": tokenizer, "ohe": ohe})
    response = [train, val]

    # Adding custom data to leap_binder for later usage within the visualizer function
    leap_binder.custom_tokenizer = tokenizer
    
    return response


# Input Encoder Helper Functions
def standardize(comment: str) -> str:
    lowercase = comment.lower()
    html_stripped = re.sub('<br />', ' ', lowercase)
    punctuation_stripped = re.sub('[%s]' % re.escape(string.punctuation), '', html_stripped)
    return punctuation_stripped


def prepare_input(tokanizer: TokenizerType, input_text: str, sequence_length: int = 250) -> np.ndarray:
    standard_text = standardize(input_text)
    tokanized_input = tokanizer.texts_to_sequences([standard_text])
    padded_input = pad_sequences(tokanized_input, maxlen=sequence_length)
    return padded_input[0, ...]

# Input Encoder - fetches the text with the index `idx` from the `paths` array set in
# the PreprocessResponse's data. Returns a numpy array containing padded tokenized input. 
def input_tokens(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    comment_path = preprocess.data['df']['paths'][idx]
    local_path = _download(comment_path)
    with open(local_path, 'r') as f:
        comment = f.read()
    tokenizer = preprocess.data['tokenizer']
    padded_input = prepare_input(tokenizer, comment)
    return padded_input

# Ground Truth Encoder - fetches the label with the index `idx` from the `gt` array set in
# the PreprocessResponse's  data. Returns a numpy array containing a hot vector label correlated with the sample.
def gt_sentiment(idx: int, preprocess: PreprocessResponse) -> List[float]:
    gt_str = preprocess.data['df']['gt'][idx]
    return preprocess.data['ohe'][gt_str]


# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds the ground truth of each sample (not a hot vector).
def gt_metadata(idx: int, preprocess: PreprocessResponse) -> str:
    if preprocess.data['df']['gt'][idx] == "pos":
        return "positive"
    else:
        return "negative"

# Visualizer functions define how to interpet the data and visualize it.
# In this example we define a tokens-to-text visualizer.
def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts([data])[0]
    return LeapText(texts)

def score_metadata(idx, preprocess: PreprocessResponse) -> int:
    return int(preprocess.data['df']['paths'][idx].split("_")[1].split(".")[0])
 

# Binders
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_tokens, name='tokens')
leap_binder.set_ground_truth(function=gt_sentiment, name='sentiment')
leap_binder.set_metadata(function=gt_metadata, metadata_type=DatasetMetadataType.string, name='gt')
leap_binder.set_metadata(function=score_metadata, metadata_type=DatasetMetadataType.int, name='score')
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')
leap_binder.add_prediction(name='sentiment', labels=['positive','negative'], metrics=[Metric.BinaryAccuracy])
```

</details>

Once you add the code to the script, click <img src="/files/mfsLBdUu3rxdhDhnfHZ1" alt="" data-size="line"> to save the **Dataset**.

### Dataset Block

After updating and saving the script, our dataset block needs to be updated. To do so, follow these steps:

1. Open the `IMDB` project.
2. From the **Versions** view, position your cursor over the `dense-nn` model revision, click <img src="/files/uG4lQceLsLKBgiqjLMTA" alt="" data-size="line"> to **Open Commit**.
3. On the **Dataset Block** in the **Network** view, click the **Update** button. More info at [**Script Version**](/user-interface/project/network/network-mapping/create-a-mapping-deprecated/input-node.md#script-version)**.**
4. To save the version with the updated dataset block, click the <img src="/files/KufD8iWbuVi1Smd314P4" alt="" data-size="line"> button and set the `Revision Name` to `dense-nn-extra`. More info at [**Versions**](/user-interface/project/versions.md).
5. To train the updated model, click <img src="/files/EjRb7Kqxt5ZYUOF3Xy3L" alt="" data-size="line"> from the top bar. We'll set the `Number of Epochs` to `10` and click <img src="/files/Is9sXxG5yWqlFXZb0k6P" alt="" data-size="line">. More info at [**Evaluate/Train Model**](/user-interface/project/menu-bar/evaluate-a-model.md).
6. Under the `dense-nn-extra` revision on the **Versions** view, click <img src="/files/Y6gaIy1hktwIqCf4NHEy" alt="" data-size="line"> to display the new version's metrics on the dashboard.

Follow steps 2-6 above also for the `imdb_cnn` we imported earlier in the [**Model Perception Analysis**](/guides/full-guides/imdb-guide/model-perception-analysis.md) section of this tutorial, using `imdb_cnn-extra` as the `Revision Name`.&#x20;

### Add Custom Dashlets

In this section, you will add custom **Dashlets** with the added metadata.

Open the to the `imdb` [**Dashboard**](/user-interface/dashboards/dashlets/metrics-dashboard.md)  that was created in the [**Model** **Integration**](/guides/full-guides/imdb-guide/model-integration.md#add-a-dashboard-and-dashlets) step and follow the next steps.

#### Loss by Sample

1. To add a dashlet, click <img src="/files/SnUBPK2c5ltW0v4DMJAT" alt="" data-size="line"> at the top right.
2. Choose the **Table** type **Dashlet** by clicking <img src="/files/dnFP0fjm2QtwstXKvHOC" alt="" data-size="line"> on the left side of the **Dashlet**.
3. Set the **Dashlet Name** to `Sample Loss`.
4. Under **Metrics** add a field and set `metrics.loss` with `average` aggregation.
5. Under **Metadata** add these fields:
   * &#x20;`sample_identity.index`
   * `dataset_slice.keyword`
6. Close the dashlet options panel to fully view the table.

#### Loss vs Score

1. To add a dashlet, click <img src="/files/SnUBPK2c5ltW0v4DMJAT" alt="" data-size="line"> at the top right. The **Bar** dashlet option should be the first to open up.
2. Set the **X-Axis** to `metadata.score`.
3. Set the **Interval** to `1`.
4. Turn on the **Split series by subset** and the **Show only last epoch** options.
5. Close the dashlet options panel to fully view thew chart.

#### Dashboard

You can reposition and resize each dashlet within the dashboard. Here is the final layout:

![Custom Dashboard and Dashlets](/files/yoHce81wTdAKSP5WHRyF)

## Conclusion

This section concludes our tutorial on the IMDB dataset.

We also have another tutorial on building and training a classification model using the `mnist` database. If you haven't gone through it yet, go to our [**MNIST Guide**](/guides/full-guides/mnist-guide.md).

You can also check out reference documentation for the Tensorleap UI and Command Line Interface (CLI) in [**Reference**](/user-interface.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorleap.ai/guides/full-guides/imdb-guide/advanced-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
