# Wikipedia Toxicity (using Tensorflow Datasets)

This dataset contains text comments from a Wikipedia talk page that have been labeled for toxicity. The comments are classified into various categories of toxicity - severe toxicity, obscenity, threatening language, insulting language, and identity attack. This dataset is a replica of the data released for the [**Jigsaw Toxic Comment Classification**](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

In this example, we will load data from [**Tensorflow Datasets**](https://www.tensorflow.org/datasets)**,** use [**Persistent Storage**](/tensorleap-integration/writing-integration-code.md#persistent-storage), and integrate the Wikipedia Toxicity dataset.

In the description below, we will explain each part of the script, while the full script can be found at the end of this page.

## Setup

In the first part of the script, we import all the relevant modules:

* Common modules
* `texthero` - text processing module
* `trainsformers.AutoTokenizer` - text tokenizer module
* `code_loader` - Tensorleap's integration module

In addition, the `MAX_LENGTH` is set and a pre-trained BERT tokenizer is loaded.

```python
import os
from typing import List, Union, Callable

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import texthero as hero
from transformers import AutoTokenizer

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import Metric, DatasetMetadataType, LeapDataType
from code_loader.contract.visualizer_classes import LeapText


MAX_LENGTH = 250
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```

## Preprocess Function

The `preprocess_func` *(custom name)* is a **preprocess** function that is called just **once** before the training/evaluating process. It prepares the data for later use in **input encoders**, **output encoders**, and **metadata** functions. More info at [**Preprocess Function**](/tensorleap-integration/writing-integration-code/preprocess-function.md).

The implementation below loads the [**wikipedia\_toxicity\_subtypes**](https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes) dataset by using [**Tensorflow Datasets**](https://www.tensorflow.org/datasets), and converts it to a DataFrame for easier handling. The `tfdf.load` function is provided with the `PERSISTENT_DIR` path for caching.

Once the dataset is fetched and loaded, a preprocessing step is performed - converting to *UTF-8* format, lowercasing, and removing URLs, digits, punctuations, and HTML tags.

After tokenizing the data, the `word_to_index` mapping is set so that the Tensorleap platform can translate the tokens into words for visualization purposes.

Lastly, the [**PreprocessResponse**](/tensorleap-integration/python-api/code_loader/datasetclasses/preprocessresponse.md) objects are set for the *train* and *validation* data slices. These objects are passed to the encoder and metadata functions later on.

```python
def preprocess_func() -> List[PreprocessResponse]:
    PERSISTENT_DIR = '/nfs/'

    train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR)
    val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR)
    train_df = tfds.as_dataframe(train_ds)
    val_df = tfds.as_dataframe(val_ds)

    # Text Preproccessing
    feature_col = "text" 
    train_df[feature_col] = train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)              
    val_df[feature_col] = val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)
    # Bind `word to index` mapping
    word_index = tokenizer.vocab
    word_index[""] = word_index.pop("[PAD]")
    leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data frame.
    train = PreprocessResponse(length=len(train_df), data=train_df)
    val = PreprocessResponse(length=len(val_df), data=val_df)

    return [train, val]
```

## Input Encoder

The input encoder generates an input component of a sample with index `idx` from the `preprocessing` [**PreprocessResponse**](/tensorleap-integration/python-api/code_loader/datasetclasses/preprocessresponse.md) object. This sample will later be fetched as input by the network. The function is called for every evaluated sample. More info at [**Input Encoder**](/tensorleap-integration/writing-integration-code/input-encoder.md).

In the example below, the *text* with index `idx` is retrieved from the preprocessing's data. This *text* is tokenized by our `tokenizer` and turned into a list of word IDs, which is fetched as our model's input.&#x20;

```python
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    text = preprocess.data["text"].iloc[idx]
    tokens = tokenizer(text, return_tensors='tf', truncation=True, padding='max_length', max_length=MAX_LENGTH,  add_special_tokens=True)
    input_ids = tokens["input_ids"][0]
    return input_ids
```

## Ground Truth Encoder

The ground truth encoder generates a ground truth component of a sample with index `idx`, from the `preprocessing`. This function is called for each evaluated sample. It will later be used as the ground truth for the **loss** function. More info at [**Ground Truth Encoder**](/tensorleap-integration/writing-integration-code/ground-truth-encoder.md).

In the code below, the `to_predict` list contains the keys for multi-label prediction. With the multi label values set to either 0 or 1, we will use a **binary cross-entropy** loss function and **sigmoid** activation on the last layer.

```python
def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
    to_predict = ['identity_attack', 'insult','obscene', 'severe_toxicity', 'threat', 'toxicity']
    return np.array(preprocess.data.iloc[idx][to_predict])
```

## Metadata Functions

For each sample, Tensorleap allows extra data to be added for future analysis. Each defined metadata is wrapped in a **metadata function**.

In the code below, we created a function that adds the label `toxic` or `non-toxic` for each sample. Additionally, we added the word count metadata.

```python
def metadata_toxicity(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
    return 'toxic' if preprocess.data['toxicity'].iloc[idx] > 0 else 'non-toxic'

def metadata_word_count(idx: int, preprocess: Union[PreprocessResponse, list]) -> int:
    return len(preprocess.data.iloc[idx]['text'].split())
```

## Visualizers

Visualizer functions translate encoded `data` , which is derived from a tensor, an input or ground\_truth, to a chosen format that can be visualized. See [**Visualizers**](/user-interface/project/network/network-mapping/create-a-mapping-deprecated/visualizer-node.md) for more info.

In this example, the visualizer function received `data` in a form of a tokenized text, and returns the decoded text sequence. The `LeapText` data-class can later be read and visualized within the platform.

```python
def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts(data)
    return LeapText(texts[0].split(' '))
```

## Binding Functions

For the Tensorleap platform to register the encoders and functions, we use the [**leap\_binder**](/tensorleap-integration/python-api/code_loader/leap_binder.md) object:

```python
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='text')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_toxicity, metadata_type=DatasetMetadataType.string, name='toxicity')
leap_binder.set_metadata(function=metadata_word_count, metadata_type=DatasetMetadataType.int, name='word_count')
leap_binder.add_prediction(name='classes', labels=['non-toxic','toxic'], metrics=[Metric.Accuracy])
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapDataType.Text, name='text_from_token')
```

The `add_prediction` function provides information about the prediction type of the current use-case and its metrics. This information will later be used for calculating selected metrics and visualizations.

## Extra Metadata

Our dataset includes extra metadata such as `identity_attack` , `insult`, `threat`, and more. These fields are implemented using the wrapper function `metadata_encoder` that generates a metadata function for each extra field.

At the end of this code snippet, we set the generated metadata functions to the [**leap\_binder**](/tensorleap-integration/python-api/code_loader/leap_binder.md) object for each of the extra fields.

```python
# Extra metadata
EXTRA_METADATA = ['identity_attack', 'insult', 'obscene', 'severe_toxicity', 'threat', 'toxicity']

def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessingResponse], int]:
    def func(idx: int, preprocess: PreprocessResponse) -> int:
        return preprocessing.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]

    func.__name__ = EXTRA_METADATA[extra_metadata_key]
    return func

for i in range(len(EXTRA_METADATA)):
    leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])
```

## Full Script

For your convenience, the full script is given below:

```python
import os
from typing import List, Union, Callable

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import texthero as hero
from transformers import AutoTokenizer

# Tensorleap imports
from code_loader import leap_binder
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader.contract.enums import Metric, DatasetMetadataType
from code_loader.contract.visualizer_classes import LeapText


MAX_LENGTH = 250
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Preprocessing Function:
def preprocess_func() -> List[PreprocessResponse]:
    PERSISTENT_DIR = '/nfs/'

    train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR)
    val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR)
    train_df = tfds.as_dataframe(train_ds)
    val_df = tfds.as_dataframe(val_ds)

    # Text Preproccessing
    feature_col = "text" 
    train_df[feature_col] = train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)              
    val_df[feature_col] = val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)
    # Bind `word to index` mapping
    word_index = tokenizer.vocab
    word_index[""] = word_index.pop("[PAD]")
    leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer

    # Generate a PreprocessResponse for each data slice, to later be read by the encoders.
    # The length of each data slice is provided, along with the data frame.
    train = PreprocessResponse(length=len(train_df), data=train_df)
    val = PreprocessResponse(length=len(val_df), data=val_df)

    return [train, val]

# Input encoder fetches the image with the index `idx` from the data from set in
# the PreprocessResponse's data. Returns an ndarray containing the sample's tokens.
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
    text = preprocess.data["text"].iloc[idx]
    tokens = tokenizer(text, return_tensors='tf', truncation=True, padding='max_length', max_length=MAX_LENGTH,  add_special_tokens=True)
    input_ids = tokens["input_ids"][0]
    return input_ids

# Ground truth encoder fetches the label with the index `idx` from the `toxicity` column set in
# the PreprocessResponse's data. Returns a numpy array containing a numeric multi-label
def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
    to_predict = ['identity_attack', 'insult','obscene', 'severe_toxicity', 'threat', 'toxicity']
    return np.array(preprocess.data.iloc[idx][to_predict])

# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds label as a string.
def metadata_toxicity(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
    return 'toxic' if preprocess.data['toxicity'].iloc[idx] > 0 else 'non-toxic'

def metadata_word_count(idx: int, preprocess: Union[PreprocessResponse, list]) -> int:
    return len(preprocess.data.iloc[idx]['text'].split())

# Visualizers
def text_visualizer_func(data: np.ndarray) -> LeapText:
    tokenizer = leap_binder.custom_tokenizer
    texts = tokenizer.sequences_to_texts(data)
    return LeapText(texts[0].split(' '))

# Binding functions to bind the functions above to Tensorleap.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='text')
leap_binder.set_ground_truth(function=gt_encoder, name='classes')
leap_binder.set_metadata(function=metadata_toxicity, metadata_type=DatasetMetadataType.string, name='toxicity')
leap_binder.set_metadata(function=metadata_word_count, metadata_type=DatasetMetadataType.int, name='word_count')
leap_binder.add_prediction(name='classes', labels=['non-toxic','toxic'], metrics=[Metric.Accuracy])
leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapText.type, name='text_from_token')

# Extra metadata
EXTRA_METADATA = ['identity_attack', 'insult', 'obscene', 'severe_toxicity', 'threat', 'toxicity']

def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessResponse], int]:
    def func(idx: int, preprocess: PreprocessResponse) -> int:
        return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]

    func.__name__ = EXTRA_METADATA[extra_metadata_key]
    return func

for i in range(len(EXTRA_METADATA)):
    leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])

```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorleap.ai/tensorleap-integration/writing-integration-code/examples/wikipedia-toxicity-using-tensorflow-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
