Wikipedia Toxicity (using Tensorflow Datasets)

This dataset contains text comments from a Wikipedia talk page that have been labeled for toxicity. The comments are classified into various categories of toxicity - severe toxicity, obscenity, threatening language, insulting language, and identity attack. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification.

In this example, we will load data from Tensorflow Datasets, use Persistent Storage, and integrate the Wikipedia Toxicity dataset.

In the description below, we will explain each part of the script, while the full script can be found at the end of this page.

Setup

In the first part of the script, we import all the relevant modules:

Common modules
texthero - text processing module
trainsformers.AutoTokenizer - text tokenizer module
code_loader - Tensorleap's integration module

In addition, the MAX_LENGTH is set and a pre-trained BERT tokenizer is loaded.

import os
from typing import List, Union, Callable

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import texthero as hero
from transformers import AutoTokenizer

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import Metric, DatasetMetadataType, LeapDataType
from code_loader.contract.visualizer_classes import LeapText


MAX_LENGTH = 250
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Preprocess Function

The preprocess_func (custom name) is a preprocess function that is called just once before the training/evaluating process. It prepares the data for later use in input encoders, output encoders, and metadata functions. More info at Preprocess Function.

The implementation below loads the wikipedia_toxicity_subtypes dataset by using Tensorflow Datasets, and converts it to a DataFrame for easier handling. The tfdf.load function is provided with the PERSISTENT_DIR path for caching.

Once the dataset is fetched and loaded, a preprocessing step is performed - converting to UTF-8 format, lowercasing, and removing URLs, digits, punctuations, and HTML tags.

After tokenizing the data, the word_to_index mapping is set so that the Tensorleap platform can translate the tokens into words for visualization purposes.

Lastly, the PreprocessResponse objects are set for the train and validation data slices. These objects are passed to the encoder and metadata functions later on.

def preprocess_func() -> List[PreprocessResponse]:
    PERSISTENT_DIR = '/nfs/'

    train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR)
    val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR)
    train_df = tfds.as_dataframe(train_ds)
    val_df = tfds.as_dataframe(val_ds)

    # Text Preproccessing
    feature_col = "text" 
    train_df[feature_col] = train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)              
    val_df[feature_col] = val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)
    # Bind `word to index` mapping
    word_index = tokenizer.vocab
    word_index[""] = word_index.pop("[PAD]")
    leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data frame.
    train = PreprocessResponse(length=len(train_df), data=train_df)
    val = PreprocessResponse(length=len(val_df), data=val_df)

    return [train, val]

Input Encoder

The input encoder generates an input component of a sample with index idx from the preprocessing PreprocessResponse object. This sample will later be fetched as input by the network. The function is called for every evaluated sample. More info at Input Encoder.

In the example below, the text with index idx is retrieved from the preprocessing's data. This text is tokenized by our tokenizer and turned into a list of word IDs, which is fetched as our model's input.

def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    text = preprocess.data["text"].iloc[idx]
    tokens = tokenizer(text, return_tensors='tf', truncation=True, padding='max_length', max_length=MAX_LENGTH,  add_special_tokens=True)
    input_ids = tokens["input_ids"][0]
    return input_ids

Ground Truth Encoder

The ground truth encoder generates a ground truth component of a sample with index idx, from the preprocessing. This function is called for each evaluated sample. It will later be used as the ground truth for the loss function. More info at Ground Truth Encoder.

In the code below, the to_predict list contains the keys for multi-label prediction. With the multi label values set to either 0 or 1, we will use a binary cross-entropy loss function and sigmoid activation on the last layer.

def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
    to_predict = ['identity_attack', 'insult','obscene', 'severe_toxicity', 'threat', 'toxicity']
    return np.array(preprocess.data.iloc[idx][to_predict])

Metadata Functions

For each sample, Tensorleap allows extra data to be added for future analysis. Each defined metadata is wrapped in a metadata function.

In the code below, we created a function that adds the label toxic or non-toxic for each sample. Additionally, we added the word count metadata.

def metadata_toxicity(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
    return 'toxic' if preprocess.data['toxicity'].iloc[idx] > 0 else 'non-toxic'

def metadata_word_count(idx: int, preprocess: Union[PreprocessResponse, list]) -> int:
    return len(preprocess.data.iloc[idx]['text'].split())

Visualizers

Visualizer functions translate encoded data , which is derived from a tensor, an input or ground_truth, to a chosen format that can be visualized. See Visualizers for more info.

In this example, the visualizer function received data in a form of a tokenized text, and returns the decoded text sequence. The LeapText data-class can later be read and visualized within the platform.

def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts(data)
    return LeapText(texts[0].split(' '))

Binding Functions

For the Tensorleap platform to register the encoders and functions, we use the leap_binder object:

leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='text')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_toxicity, metadata_type=DatasetMetadataType.string, name='toxicity')
leap_binder.set_metadata(function=metadata_word_count, metadata_type=DatasetMetadataType.int, name='word_count')
leap_binder.add_prediction(name='classes', labels=['non-toxic','toxic'], metrics=[Metric.Accuracy])
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')

The add_prediction function provides information about the prediction type of the current use-case and its metrics. This information will later be used for calculating selected metrics and visualizations.

Extra Metadata

Our dataset includes extra metadata such as identity_attack , insult, threat, and more. These fields are implemented using the wrapper function metadata_encoder that generates a metadata function for each extra field.

At the end of this code snippet, we set the generated metadata functions to the leap_binder object for each of the extra fields.

# Extra metadata
EXTRA_METADATA = ['identity_attack', 'insult', 'obscene', 'severe_toxicity', 'threat', 'toxicity']

def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessingResponse], int]:
    def func(idx: int, preprocess: PreprocessResponse) -> int:
        return preprocessing.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]

    func.__name__ = EXTRA_METADATA[extra_metadata_key]
    return func

for i in range(len(EXTRA_METADATA)):
    leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])

Full Script

For your convenience, the full script is given below:

import os
from typing import List, Union, Callable

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import texthero as hero
from transformers import AutoTokenizer

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import Metric, DatasetMetadataType
from code_loader.contract.visualizer_classes import LeapText


MAX_LENGTH = 250
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Preprocessing Function:
def preprocess_func() -> List[PreprocessResponse]:
    PERSISTENT_DIR = '/nfs/'

    train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR)
    val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR)
    train_df = tfds.as_dataframe(train_ds)
    val_df = tfds.as_dataframe(val_ds)

    # Text Preproccessing
    feature_col = "text" 
    train_df[feature_col] = train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)              
    val_df[feature_col] = val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)
    # Bind `word to index` mapping
    word_index = tokenizer.vocab
    word_index[""] = word_index.pop("[PAD]")
    leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data frame.
    train = PreprocessResponse(length=len(train_df), data=train_df)
    val = PreprocessResponse(length=len(val_df), data=val_df)

    return [train, val]

# Input encoder fetches the image with the index `idx` from the data from set in
# the PreprocessResponse's data. Returns an ndarray containing the sample's tokens.
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    text = preprocess.data["text"].iloc[idx]
    tokens = tokenizer(text, return_tensors='tf', truncation=True, padding='max_length', max_length=MAX_LENGTH,  add_special_tokens=True)
    input_ids = tokens["input_ids"][0]
    return input_ids

# Ground truth encoder fetches the label with the index `idx` from the `toxicity` column set in
# the PreprocessResponse's data. Returns a numpy array containing a numeric multi-label
def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
    to_predict = ['identity_attack', 'insult','obscene', 'severe_toxicity', 'threat', 'toxicity']
    return np.array(preprocess.data.iloc[idx][to_predict])

# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds label as a string.
def metadata_toxicity(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
    return 'toxic' if preprocess.data['toxicity'].iloc[idx] > 0 else 'non-toxic'

def metadata_word_count(idx: int, preprocess: Union[PreprocessResponse, list]) -> int:
    return len(preprocess.data.iloc[idx]['text'].split())

# Visualizers
def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts(data)
    return LeapText(texts[0].split(' '))

# Binding functions to bind the functions above to Tensorleap.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='text')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_toxicity, metadata_type=DatasetMetadataType.string, name='toxicity')
leap_binder.set_metadata(function=metadata_word_count, metadata_type=DatasetMetadataType.int, name='word_count')
leap_binder.add_prediction(name='classes', labels=['non-toxic','toxic'], metrics=[Metric.Accuracy])
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapText.type, name='text_from_token')

# Extra metadata
EXTRA_METADATA = ['identity_attack', 'insult', 'obscene', 'severe_toxicity', 'threat', 'toxicity']

def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessResponse], int]:
    def func(idx: int, preprocess: PreprocessResponse) -> int:
        return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]

    func.__name__ = EXTRA_METADATA[extra_metadata_key]
    return func

for i in range(len(EXTRA_METADATA)):
    leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])

Last updated 9 months ago

Was this helpful?

hashtagSetup

hashtagPreprocess Function

hashtagInput Encoder

hashtagGround Truth Encoder

hashtagMetadata Functions

hashtagVisualizers

hashtagBinding Functions

hashtagExtra Metadata

hashtagFull Script