LogoLogo
  • Tensorleap
  • Examples
    • Semantic Segmentation
    • Image Analysis
    • Sentiment Analysis
    • MNIST Project Walkthrough
    • IMDB Project Walkthrough
  • Quickstart using CLI
  • Guides
    • Full Guides
      • MNIST Guide
        • Dataset Integration
        • Model Integration
        • Model Perception Analysis
        • Advanced Metrics
      • IMDB Guide
        • Dataset Integration
        • Model Integration
        • Model Perception Analysis
        • Advanced Metrics
    • Integration Script
      • Preprocess Function
      • Input Encoder
      • Ground Truth Encoder
      • Metadata Function
      • Visualizer Function
      • Prediction
      • Custom Metrics
      • Custom Loss Function
      • Custom Layers
      • Unlabeled Data
      • Examples
        • CelebA Object Detection (YoloV7)
        • Wikipedia Toxicity (using Tensorflow Datasets)
        • Confusion Matrix
        • CelebA Classification (using GCS)
  • Platform
    • Resources Management
    • Project
    • Dataset
    • Secret Manager
    • Network
      • Dataset Node
      • Layers
      • Loss and Optimizer
      • Visualizers
      • Import Model
      • Metrics
    • Evaluate / Train Model
    • Metrics Dashboard
    • Versions
    • Issues
    • Tests
    • Analysis
      • helpers
        • detection
          • YOLO
    • Team management
    • Insights
  • API
    • code_loader
      • leap_binder
        • add_custom_metric
        • set_preprocess
        • set_unlabeled_data_preprocess
        • set_input
        • set_ground_truth
        • set_metadata
        • add_prediction
        • add_custom_loss
        • set_visualizer
      • enums
        • DatasetMetadataType
        • LeapDataType
      • datasetclasses
        • PreprocessResponse
      • visualizer_classes
        • LeapImage
        • LeapImageWithBBox
        • LeapGraph
        • LeapText
        • LeapHorizontalBar
        • LeapImageMask
        • LeapTextMask
  • Tips & Tricks
    • Import External Code
  • Legal
    • Terms of Use
    • Privacy Policy
Powered by GitBook
On this page
  • Dataset Script
  • Add Integration Script
  • Up Next - Model Integration

Was this helpful?

  1. Guides
  2. Full Guides
  3. IMDB Guide

Dataset Integration

PreviousIMDB GuideNextModel Integration

Last updated 2 years ago

Was this helpful?

This section covers the integration of the imdb dataset into Tensorleap. We'll later use this dataset with a classification model.

Dataset Script

Below is the full dataset script to be used in the integration. More information about the structure of this script can be found under .

from typing import List, Optional, Callable, Tuple, Dict

import json, os, re, string
from os.path import basename, dirname, join

import pandas as pd
import numpy as np
from google.auth.credentials import AnonymousCredentials
from google.cloud import storage
from google.cloud.storage import Bucket

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from pandas.core.frame import DataFrame as DataFrameType

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import DatasetMetadataType, LeapDataType, Metric
from code_loader.contract.visualizer_classes import LeapText

NUMBER_OF_SAMPLES = 20000
BUCKET_NAME = 'example-datasets-47ml982d'
PROJECT_ID = 'example-dev-project-nmrksf0o'

### Helper Functions: ###
def _connect_to_gcs() -> Bucket:
    gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())
    return gcs_client.bucket(BUCKET_NAME)


def _download(cloud_file_path: str, local_file_path: Optional[str] = None) -> str:
    BASE_PATH = "imdb"
    cloud_file_path = join(BASE_PATH, cloud_file_path)
    # if local_file_path is not specified saving in home dir
    if local_file_path is None:
        home_dir = os.getenv("HOME")
        assert home_dir is not None
        local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)
    # check if file already exists
    if os.path.exists(local_file_path):
        return local_file_path

    bucket = _connect_to_gcs()
    dir_path = os.path.dirname(local_file_path)
    os.makedirs(dir_path, exist_ok=True)
    blob = bucket.blob(cloud_file_path)
    blob.download_to_filename(local_file_path)
    return local_file_path


def load_tokenizer(tokenizer_path: str):
    with open(tokenizer_path, 'r') as f:
        data = json.load(f)
        tokenizer = tokenizer_from_json(data)
    return tokenizer


def download_load_assets():
    cloud_path = join("assets", "imdb.csv")
    local_path = _download(cloud_path)
    df = pd.read_csv(local_path)
    cloud_path = join("assets", "tokenizer_v2.json")
    local_path = _download(cloud_path)
    tokenizer = load_tokenizer(local_path)
    return tokenizer, df


# Preprocess Function
def preprocess_func() -> List[PreprocessResponse]:
    tokenizer, df = download_load_assets()
    train_label_size = int(0.9 * NUMBER_OF_SAMPLES / 2)
    val_label_size = int(0.1 * NUMBER_OF_SAMPLES / 2)
    df = df[df['subset'] == 'train']
    train_df = pd.concat([df[df['gt'] == 'pos'][:train_label_size], df[df['gt'] == 'neg'][:train_label_size]], ignore_index=True)
    val_df = pd.concat([df[df['gt'] == 'pos'][train_label_size:train_label_size + val_label_size], df[df['gt'] == 'neg'][train_label_size:train_label_size + val_label_size]], ignore_index=True)
    ohe = {"pos": [1.0, 0.], "neg": [0., 1.0]}

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data dictionary.
    # In this example we pass `images` and `labels` that later are encoded into the inputs and outputs 
    train = PreprocessResponse(length=2 * train_label_size, data={"df": train_df, "tokenizer": tokenizer, "ohe": ohe})
    val = PreprocessResponse(length=2 * val_label_size, data={"df": val_df, "tokenizer": tokenizer, "ohe": ohe})
    response = [train, val]

    # Adding custom data to leap_binder for later usage within the visualizer function
    leap_binder.custom_tokenizer = tokenizer
    
    return response


# Input Encoder Helper Functions
def standardize(comment: str) -> str:
    lowercase = comment.lower()
    html_stripped = re.sub('<br />', ' ', lowercase)
    punctuation_stripped = re.sub('[%s]' % re.escape(string.punctuation), '', html_stripped)
    return punctuation_stripped


def prepare_input(tokanizer, input_text: str, sequence_length: int = 250) -> np.ndarray:
    standard_text = standardize(input_text)
    tokanized_input = tokanizer.texts_to_sequences([standard_text])
    padded_input = pad_sequences(tokanized_input, maxlen=sequence_length)
    return padded_input[0, ...]

# Input Encoder - fetches the text with the index `idx` from the `paths` array set in
# the PreprocessResponse's data. Returns a numpy array containing padded tokenized input. 
def input_tokens(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    comment_path = preprocess.data['df']['paths'][idx]
    local_path = _download(comment_path)
    with open(local_path, 'r') as f:
        comment = f.read()
    tokenizer = preprocess.data['tokenizer']
    padded_input = prepare_input(tokenizer, comment)
    return padded_input

# Ground Truth Encoder - fetches the label with the index `idx` from the `gt` array set in
# the PreprocessResponse's  data. Returns a numpy array containing a hot vector label correlated with the sample.
def gt_sentiment(idx: int, preprocess: PreprocessResponse) -> List[float]:
    gt_str = preprocess.data['df']['gt'][idx]
    return preprocess.data['ohe'][gt_str]


# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds the ground truth of each sample (not a hot vector).
def gt_metadata(idx: int, preprocess: PreprocessResponse) -> str:
    if preprocess.data['df']['gt'][idx] == "pos":
        return "positive"
    else:
        return "negative"

# Visualizer functions define how to interpet the data and visualize it.
# In this example we define a tokens-to-text visualizer.
def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts([data])
    return LeapText(texts[0].split(' '))


# Binders
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_tokens, name='tokens')
leap_binder.set_ground_truth(function=gt_sentiment, name='sentiment')
leap_binder.set_metadata(function=gt_metadata, metadata_type=DatasetMetadataType.string, name='gt')
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')
leap_binder.add_prediction(name='sentiment', labels=['positive','negative'])

Add Integration Script

Add a Dataset Using UI

To add a new Dataset:

  1. In the Dataset Editor, enter these properties:

    • Dataset Name: imdb

  2. Click Save.

After saving the imdb dataset, the platform will automatically parse the database script. This process evaluates the script and ensures that all its functions, including the ability to successfully read the data, are working as expected.

Upon successful parsing, the details of the MNIST dataset will be displayed on the right. In case of unsuccessful parsing, errors will be shown instead.

Initial CLI Setup

Project Folder Setup

  1. Create a folder for our imdb project.

    mkdir imdb
    cd imdb
  2. Initialize and synchronize the created folder with the Tensorleap platform by running a command that will set up the .tensorleap folder within the project folder. The command leap init (PROJECT) (DATASET) (--h5/--onnx) should be set with the following parameters:

    • PROJECT = IMDB (project name)

    • DATASET = imdb (dataset name)

    • (--h5/--onnx) = model format, --h5 for Tensorflow (H5) and --onnx for PyTorch (ONNX)

    leap init IMDB myorg imdb --h5
  3. Next, we need to set your credentials to leap CLI by running the following command:

    leap login [API_ID] [API_KEY]

Push Dataset

When using the CLI, the Dataset Script is defined within the .tensorleap/dataset.py file, and the Dataset Instance is created/updated upon performing leap push.

  1. rm .tensorleap/dataset.py
    cat > .tensorleap/dataset.py
      << paste the dataset script above + CTRL-D  >>
  2. Let's test our dataset script using leap check:

    leap check --dataset
  3. Next, we'll push our dataset to the Tensorleap platform using the following command:

    leap push --dataset

It should print out:

New dataset detected. Dataset name: imdb Push command successfully complete

Up Next - Model Integration

The purpose of this section was to help you define a dataset script and create a dataset instance in Tensorleap.

Now that the imdb dataset has been integrated into Tensorleap, we can use it with a classification model. That's what we'll do in the next section, where we'll build a classification model.

Navigate to the and click on the button.

Script: copy and paste the script from the above

Verify that leapcli is installed. For more informaton, see .

The API_ID , API_KEY and the ORIGIN, along with the full command, can easily be found by clicking the button within the view.

By default, the .tensorleap/dataset.py file has a sample template. Let's replace it with our above. One way to do it is withvim:

Congrats! You have successfully created the imdb Dataset Instance and integrated the . You can view it in the UI in the Resources Management view.

When ready, move on to .

Dataset Script
Model Integration
Dataset Script
Dataset Script
Dataset Script
Resources Management
Resources Management
Installing Leap CLI
Dataset Instance