This section covers the integration of the imdb dataset into Tensorleap. We'll later use this dataset with a classification model.
Dataset Script
Below is the full dataset script to be used in the integration. More information about the structure of this script can be found under Dataset Script.
from typing import List, Optional, Callable, Tuple, Dictimport json, os, re, stringfrom os.path import basename, dirname, joinimport pandas as pdimport numpy as npfrom google.auth.credentials import AnonymousCredentialsfrom google.cloud import storagefrom google.cloud.storage import Bucketfrom tensorflow.keras.preprocessing.sequence import pad_sequencesfrom tensorflow.keras.preprocessing.text import tokenizer_from_jsonfrom pandas.core.frame import DataFrame as DataFrameType# Tensorleap importsfrom code_loader import leap_binderfrom code_loader.contract.datasetclasses import PreprocessResponsefrom code_loader.contract.enums import DatasetMetadataType, LeapDataType, Metricfrom code_loader.contract.visualizer_classes import LeapTextNUMBER_OF_SAMPLES =20000BUCKET_NAME ='example-datasets-47ml982d'PROJECT_ID ='example-dev-project-nmrksf0o'### Helper Functions: ###def_connect_to_gcs() -> Bucket: gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())return gcs_client.bucket(BUCKET_NAME)def_download(cloud_file_path:str,local_file_path: Optional[str]=None) ->str: BASE_PATH ="imdb" cloud_file_path =join(BASE_PATH, cloud_file_path)# if local_file_path is not specified saving in home dirif local_file_path isNone: home_dir = os.getenv("HOME")assert home_dir isnotNone local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)# check if file already existsif os.path.exists(local_file_path):return local_file_path bucket =_connect_to_gcs() dir_path = os.path.dirname(local_file_path) os.makedirs(dir_path, exist_ok=True) blob = bucket.blob(cloud_file_path) blob.download_to_filename(local_file_path)return local_file_pathdefload_tokenizer(tokenizer_path:str):withopen(tokenizer_path, 'r')as f: data = json.load(f) tokenizer =tokenizer_from_json(data)return tokenizerdefdownload_load_assets(): cloud_path =join("assets", "imdb.csv") local_path =_download(cloud_path) df = pd.read_csv(local_path) cloud_path =join("assets", "tokenizer_v2.json") local_path =_download(cloud_path) tokenizer =load_tokenizer(local_path)return tokenizer, df# Preprocess Functiondefpreprocess_func() -> List[PreprocessResponse]: tokenizer, df =download_load_assets() train_label_size =int(0.9* NUMBER_OF_SAMPLES /2) val_label_size =int(0.1* NUMBER_OF_SAMPLES /2) df = df[df['subset']=='train'] train_df = pd.concat([df[df['gt'] =='pos'][:train_label_size], df[df['gt'] =='neg'][:train_label_size]], ignore_index=True) val_df = pd.concat([df[df['gt'] =='pos'][train_label_size:train_label_size + val_label_size], df[df['gt'] =='neg'][train_label_size:train_label_size + val_label_size]], ignore_index=True) ohe ={"pos": [1.0,0.],"neg": [0.,1.0]}# Generate a PreprocessResponse for each data slice, to later be read by the encoders.# The length of each data slice is provided, along with the data dictionary.# In this example we pass `images` and `labels` that later are encoded into the inputs and outputs train =PreprocessResponse(length=2* train_label_size, data={"df": train_df, "tokenizer": tokenizer, "ohe": ohe}) val =PreprocessResponse(length=2* val_label_size, data={"df": val_df, "tokenizer": tokenizer, "ohe": ohe}) response = [train, val]# Adding custom data to leap_binder for later usage within the visualizer function leap_binder.custom_tokenizer = tokenizerreturn response# Input Encoder Helper Functionsdefstandardize(comment:str) ->str: lowercase = comment.lower() html_stripped = re.sub('<br />', ' ', lowercase) punctuation_stripped = re.sub('[%s]'% re.escape(string.punctuation), '', html_stripped)return punctuation_strippeddefprepare_input(tokanizer,input_text:str,sequence_length:int=250) -> np.ndarray: standard_text =standardize(input_text) tokanized_input = tokanizer.texts_to_sequences([standard_text]) padded_input =pad_sequences(tokanized_input, maxlen=sequence_length)return padded_input[0, ...]# Input Encoder - fetches the text with the index `idx` from the `paths` array set in# the PreprocessResponse's data. Returns a numpy array containing padded tokenized input. definput_tokens(idx:int,preprocess: PreprocessResponse) -> np.ndarray: comment_path = preprocess.data['df']['paths'][idx] local_path =_download(comment_path)withopen(local_path, 'r')as f: comment = f.read() tokenizer = preprocess.data['tokenizer'] padded_input =prepare_input(tokenizer, comment)return padded_input# Ground Truth Encoder - fetches the label with the index `idx` from the `gt` array set in# the PreprocessResponse's data. Returns a numpy array containing a hot vector label correlated with the sample.defgt_sentiment(idx:int,preprocess: PreprocessResponse) -> List[float]: gt_str = preprocess.data['df']['gt'][idx]return preprocess.data['ohe'][gt_str]# Metadata functions allow to add extra data for a later use in analysis.# This metadata adds the ground truth of each sample (not a hot vector).defgt_metadata(idx:int,preprocess: PreprocessResponse) ->str:if preprocess.data['df']['gt'][idx] =="pos":return"positive"else:return"negative"# Visualizer functions define how to interpet the data and visualize it.# In this example we define a tokens-to-text visualizer.deftext_visualizer_func(data: np.ndarray) -> LeapText: tokenizer = leap_binder.custom_tokenizer texts = tokenizer.sequences_to_texts([data])returnLeapText(texts[0].split(' '))# Bindersleap_binder.set_preprocess(function=preprocess_func)leap_binder.set_input(function=input_tokens, name='tokens')leap_binder.set_ground_truth(function=gt_sentiment, name='sentiment')leap_binder.set_metadata(function=gt_metadata, metadata_type=DatasetMetadataType.string, name='gt')leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')leap_binder.add_prediction(name='sentiment', labels=['positive','negative'])
Add Integration Script
Add a Dataset Using UI
To add a new Dataset:
In the Dataset Editor, enter these properties:
Dataset Name: imdb
Script: copy and paste the script from the Dataset Script above
Click Save.
After saving the imdb dataset, the platform will automatically parse the database script. This process evaluates the script and ensures that all its functions, including the ability to successfully read the data, are working as expected.
Upon successful parsing, the details of the MNIST dataset will be displayed on the right. In case of unsuccessful parsing, errors will be shown instead.
Initial CLI Setup
Verify that leapcli is installed. For more informaton, see Installing Leap CLI.
Project Folder Setup
Create a folder for our imdb project.
mkdir imdbcd imdb
Initialize and synchronize the created folder with the Tensorleap platform by running a command that will set up the .tensorleap folder within the project folder. The command leap init(PROJECT) (DATASET) (--h5/--onnx) should be set with the following parameters:
PROJECT = IMDB (project name)
DATASET = imdb (dataset name)
(--h5/--onnx) = model format, --h5 for Tensorflow (H5) and --onnx for PyTorch (ONNX)
leap init IMDB myorg imdb --h5
Next, we need to set your credentials to leap CLI by running the following command:
leap login [API_ID] [API_KEY]
Push Dataset
When using the CLI, the Dataset Script is defined within the .tensorleap/dataset.py file, and the Dataset Instance is created/updated upon performing leap push.
By default, the .tensorleap/dataset.py file has a sample template. Let's replace it with our Dataset Script above. One way to do it is withvim:
Next, we'll push our dataset to the Tensorleap platform using the following command:
leap push --dataset
It should print out:
New dataset detected. Dataset name: imdb
Push command successfully complete
Congrats! You have successfully created the imdbDataset Instance and integrated the Dataset Script. You can view it in the UI in the Resources Management view.
Up Next - Model Integration
The purpose of this section was to help you define a dataset script and create a dataset instance in Tensorleap.
Now that the imdb dataset has been integrated into Tensorleap, we can use it with a classification model. That's what we'll do in the next section, where we'll build a classification model.