As the model generalizes various characteristics in the data, we will see that samples with similar metadata will cluster together in the similarity map. Furthermore, we can use additional samples' metadata to identify correlations between various characteristics and the model's performance.
In this section, we'll add custom metadata to our dataset and inspect such correlations using the Metrics dashboard.
Add Custom Metadata
As an example, we will be adding the following metadata:
Length - the number of words in a sample.
Score - the IMDB score a user had given the target movie.
These metadata functions calculate and return the length and score, respectively, of each sample in the IMDB dataset. For more information, see Metadata Function.
These metadata functions will return the length and score, respectively, of each sample in the IMDB dataset. We will add them to our Integration Script.
Integration Script
In the Resources Management view, click the imdb dataset and add the code below to its script.
For convenience, you can find the full script with additional metadata below:
Full Script (expandable)
from typing import List, Optional, Callable, Tuple, Dictimport json, os, re, stringfrom os.path import basename, dirname, joinimport pandas as pdimport numpy as npfrom google.auth.credentials import AnonymousCredentialsfrom google.cloud import storagefrom google.cloud.storage import Bucketfrom keras_preprocessing.text import Tokenizer as TokenizerTypefrom tensorflow.keras.preprocessing.sequence import pad_sequencesfrom tensorflow.keras.preprocessing.text import tokenizer_from_jsonfrom pandas.core.frame import DataFrame as DataFrameType# Tensorleap importsfrom code_loader import leap_binderfrom code_loader.contract.datasetclasses import PreprocessResponsefrom code_loader.contract.enums import DatasetMetadataType, LeapDataType, Metricfrom code_loader.contract.visualizer_classes import LeapTextNUMBER_OF_SAMPLES =20000BUCKET_NAME ='example-datasets-47ml982d'PROJECT_ID ='example-dev-project-nmrksf0o'### Helper Functions: ###def_connect_to_gcs() -> Bucket: gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())return gcs_client.bucket(BUCKET_NAME)def_download(cloud_file_path:str,local_file_path: Optional[str]=None) ->str: BASE_PATH ="imdb" cloud_file_path =join(BASE_PATH, cloud_file_path)# if local_file_path is not specified saving in home dirif local_file_path isNone: home_dir = os.getenv("HOME")assert home_dir isnotNone local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)# check if file already existsif os.path.exists(local_file_path):return local_file_path bucket =_connect_to_gcs() dir_path = os.path.dirname(local_file_path) os.makedirs(dir_path, exist_ok=True) blob = bucket.blob(cloud_file_path) blob.download_to_filename(local_file_path)return local_file_pathdefload_tokenizer(tokenizer_path:str) -> TokenizerType:withopen(tokenizer_path, 'r')as f: data = json.load(f) tokenizer =tokenizer_from_json(data)return tokenizerdefdownload_load_assets() -> Tuple[TokenizerType, DataFrameType]: cloud_path =join("assets", "imdb.csv") local_path =_download(cloud_path) df = pd.read_csv(local_path) cloud_path =join("assets", "tokenizer_v2.json") local_path =_download(cloud_path) tokenizer =load_tokenizer(local_path)return tokenizer, df# Preprocess Functiondefpreprocess_func() -> List[PreprocessResponse]: tokenizer, df =download_load_assets() train_label_size =int(0.9* NUMBER_OF_SAMPLES /2) val_label_size =int(0.1* NUMBER_OF_SAMPLES /2) df = df[df['subset']=='train'] train_df = pd.concat([df[df['gt'] == 'pos'][:train_label_size], df[df['gt'] == 'neg'][:train_label_size]], ignore_index=True)
val_df = pd.concat([df[df['gt'] == 'pos'][train_label_size:train_label_size + val_label_size], df[df['gt'] == 'neg'][train_label_size:train_label_size + val_label_size]], ignore_index=True)
ohe ={"pos": [1.0,0.],"neg": [0.,1.0]}# Generate a PreprocessResponse for each data slice, to later be read by the encoders.# The length of each data slice is provided, along with the data dictionary.# In this example we pass `images` and `labels` that later are encoded into the inputs and outputs train =PreprocessResponse(length=2* train_label_size, data={"df": train_df, "tokenizer": tokenizer, "ohe": ohe}) val =PreprocessResponse(length=2* val_label_size, data={"df": val_df, "tokenizer": tokenizer, "ohe": ohe}) response = [train, val]# Adding custom data to leap_binder for later usage within the visualizer function leap_binder.custom_tokenizer = tokenizerreturn response# Input Encoder Helper Functionsdefstandardize(comment:str) ->str: lowercase = comment.lower() html_stripped = re.sub('<br />', ' ', lowercase) punctuation_stripped = re.sub('[%s]'% re.escape(string.punctuation), '', html_stripped)return punctuation_strippeddefprepare_input(tokanizer: TokenizerType,input_text:str,sequence_length:int=250) -> np.ndarray: standard_text =standardize(input_text) tokanized_input = tokanizer.texts_to_sequences([standard_text]) padded_input =pad_sequences(tokanized_input, maxlen=sequence_length)return padded_input[0, ...]# Input Encoder - fetches the text with the index `idx` from the `paths` array set in# the PreprocessResponse's data. Returns a numpy array containing padded tokenized input. definput_tokens(idx:int,preprocess: PreprocessResponse) -> np.ndarray: comment_path = preprocess.data['df']['paths'][idx] local_path =_download(comment_path)withopen(local_path, 'r')as f: comment = f.read() tokenizer = preprocess.data['tokenizer'] padded_input =prepare_input(tokenizer, comment)return padded_input# Ground Truth Encoder - fetches the label with the index `idx` from the `gt` array set in# the PreprocessResponse's data. Returns a numpy array containing a hot vector label correlated with the sample.defgt_sentiment(idx:int,preprocess: PreprocessResponse) -> List[float]: gt_str = preprocess.data['df']['gt'][idx]return preprocess.data['ohe'][gt_str]# Metadata functions allow to add extra data for a later use in analysis.# This metadata adds the ground truth of each sample (not a hot vector).defgt_metadata(idx:int,preprocess: PreprocessResponse) ->str:if preprocess.data['df']['gt'][idx] =="pos":return"positive"else:return"negative"# Visualizer functions define how to interpet the data and visualize it.# In this example we define a tokens-to-text visualizer.deftext_visualizer_func(data: np.ndarray) -> LeapText: tokenizer = leap_binder.custom_tokenizer texts = tokenizer.sequences_to_texts([data])[0]returnLeapText(texts)defscore_metadata(idx,preprocess: PreprocessResponse) ->int:returnint(preprocess.data['df']['paths'][idx].split("_")[1].split(".")[0])# Bindersleap_binder.set_preprocess(function=preprocess_func)leap_binder.set_input(function=input_tokens, name='tokens')leap_binder.set_ground_truth(function=gt_sentiment, name='sentiment')leap_binder.set_metadata(function=gt_metadata, metadata_type=DatasetMetadataType.string, name='gt')leap_binder.set_metadata(function=score_metadata, metadata_type=DatasetMetadataType.int, name='score')leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')leap_binder.add_prediction(name='sentiment', labels=['positive','negative'], metrics=[Metric.BinaryAccuracy])
Dataset Block
After updating and saving the script, our dataset block needs to be updated. To do so, follow these steps:
Open the IMDB project.
On the Dataset Block in the Network view, click the Update button. More info at Script Version.
Follow steps 2-6 above also for the imdb_cnn we imported earlier in the Model Perception Analysis section of this tutorial, using imdb_cnn-extra as the Revision Name.
Add Custom Dashlets
In this section, you will add custom Dashlets with the added metadata.
Open the to the imdbDashboard that was created in the ModelIntegration step and follow the next steps.
Loss by Sample
Set the Dashlet Name to Sample Loss.
Under Metrics add a field and set metrics.loss with average aggregation.
Under Metadata add these fields:
sample_identity.index
dataset_slice.keyword
Close the dashlet options panel to fully view the table.
Loss vs Score
Set the X-Axis to metadata.score.
Set the Interval to 1.
Turn on the Split series by subset and the Show only last epoch options.
Close the dashlet options panel to fully view thew chart.
Dashboard
You can reposition and resize each dashlet within the dashboard. Here is the final layout:
Conclusion
This section concludes our tutorial on the IMDB dataset.
We also have another tutorial on building and training a classification model using the mnist database. If you haven't gone through it yet, go to our MNIST Guide.
You can also check out reference documentation for the Tensorleap UI and Command Line Interface (CLI) in Reference.