This dataset contains text comments from a Wikipedia talk page that have been labeled for toxicity. The comments are classified into various categories of toxicity - severe toxicity, obscenity, threatening language, insulting language, and identity attack. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification.
The preprocess_func(custom name) is a preprocess function that is called just once before the training/evaluating process. It prepares the data for later use in input encoders, output encoders, and metadata functions. More info at Preprocess Function.
The implementation below loads the wikipedia_toxicity_subtypes dataset by using Tensorflow Datasets, and converts it to a DataFrame for easier handling. The tfdf.load function is provided with the PERSISTENT_DIR path for caching.
Once the dataset is fetched and loaded, a preprocessing step is performed - converting to UTF-8 format, lowercasing, and removing URLs, digits, punctuations, and HTML tags.
After tokenizing the data, the word_to_index mapping is set so that the Tensorleap platform can translate the tokens into words for visualization purposes.
Lastly, the PreprocessResponse objects are set for the train and validation data slices. These objects are passed to the encoder and metadata functions later on.
defpreprocess_func() -> List[PreprocessResponse]: PERSISTENT_DIR ='/nfs/' train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR) val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR) train_df = tfds.as_dataframe(train_ds) val_df = tfds.as_dataframe(val_ds)# Text Preproccessing feature_col ="text" train_df[feature_col]= train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags) val_df[feature_col]= val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)# Bind `word to index` mapping word_index = tokenizer.vocab word_index[""]= word_index.pop("[PAD]") leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer# Generate a PreprocessResponse for each data slice, to later be read by the encoders.# The length of each data slice is provided, along with the data frame. train =PreprocessResponse(length=len(train_df), data=train_df) val =PreprocessResponse(length=len(val_df), data=val_df)return [train, val]
Input Encoder
The input encoder generates an input component of a sample with index idx from the preprocessingPreprocessResponse object. This sample will later be fetched as input by the network. The function is called for every evaluated sample. More info at Input Encoder.
In the example below, the text with index idx is retrieved from the preprocessing's data. This text is tokenized by our tokenizer and turned into a list of word IDs, which is fetched as our model's input.
The ground truth encoder generates a ground truth component of a sample with index idx, from the preprocessing. This function is called for each evaluated sample. It will later be used as the ground truth for the loss function. More info at Ground Truth Encoder.
In the code below, the to_predict list contains the keys for multi-label prediction. With the multi label values set to either 0 or 1, we will use a binary cross-entropy loss function and sigmoid activation on the last layer.
Visualizer functions translate encoded data , which is derived from a tensor, an input or ground_truth, to a chosen format that can be visualized. See Visualizers for more info.
In this example, the visualizer function received data in a form of a tokenized text, and returns the decoded text sequence. The LeapText data-class can later be read and visualized within the platform.
The add_prediction function provides information about the prediction type of the current use-case and its metrics. This information will later be used for calculating selected metrics and visualizations.
Extra Metadata
Our dataset includes extra metadata such as identity_attack , insult, threat, and more. These fields are implemented using the wrapper function metadata_encoder that generates a metadata function for each extra field.
At the end of this code snippet, we set the generated metadata functions to the leap_binder object for each of the extra fields.
# Extra metadataEXTRA_METADATA = ['identity_attack','insult','obscene','severe_toxicity','threat','toxicity']defmetadata_encoder(extra_metadata_key:str) -> Callable[[int, PreprocessingResponse],int]:deffunc(idx:int,preprocess: PreprocessResponse) ->int:return preprocessing.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx] func.__name__= EXTRA_METADATA[extra_metadata_key]return funcfor i inrange(len(EXTRA_METADATA)): leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])
Full Script
For your convenience, the full script is given below:
import osfrom typing import List, Union, Callableimport tensorflow as tfimport tensorflow_datasets as tfdsimport numpy as npimport texthero as herofrom transformers import AutoTokenizer# Tensorleap importsfrom code_loader import leap_binderfrom code_loader.contract.datasetclasses import PreprocessResponsefrom code_loader.contract.enums import Metric, DatasetMetadataTypefrom code_loader.contract.visualizer_classes import LeapTextMAX_LENGTH =250tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")# Preprocessing Function:defpreprocess_func() -> List[PreprocessResponse]: PERSISTENT_DIR ='/nfs/' train_ds = tfds.load('wikipedia_toxicity_subtypes', split='train', data_dir=PERSISTENT_DIR) val_ds = tfds.load('wikipedia_toxicity_subtypes', split='test', data_dir=PERSISTENT_DIR) train_df = tfds.as_dataframe(train_ds) val_df = tfds.as_dataframe(val_ds)# Text Preproccessing feature_col ="text" train_df[feature_col]= train_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags) val_df[feature_col]= val_df[feature_col].str.decode('utf-8').pipe(hero.lowercase).pipe(hero.remove_urls).pipe(hero.remove_digits).pipe(hero.remove_punctuation).pipe(hero.remove_html_tags)# Bind `word to index` mapping word_index = tokenizer.vocab word_index[""]= word_index.pop("[PAD]") leap_binder.custom_tokenizer = tokenizer # to be used within the visualizer# Generate a PreprocessResponse for each data slice, to later be read by the encoders.# The length of each data slice is provided, along with the data frame. train =PreprocessResponse(length=len(train_df), data=train_df) val =PreprocessResponse(length=len(val_df), data=val_df)return [train, val]# Input encoder fetches the image with the index `idx` from the data from set in# the PreprocessResponse's data. Returns an ndarray containing the sample's tokens.definput_encoder(idx:int,preprocess: PreprocessResponse) -> np.ndarray: text = preprocess.data["text"].iloc[idx] tokens =tokenizer(text, return_tensors='tf', truncation=True, padding='max_length', max_length=MAX_LENGTH, add_special_tokens=True) input_ids = tokens["input_ids"][0]return input_ids# Ground truth encoder fetches the label with the index `idx` from the `toxicity` column set in# the PreprocessResponse's data. Returns a numpy array containing a numeric multi-labeldefgt_encoder(idx:int,preprocess: Union[PreprocessResponse,list]) -> np.ndarray: to_predict = ['identity_attack','insult','obscene','severe_toxicity','threat','toxicity']return np.array(preprocess.data.iloc[idx][to_predict])# Metadata functions allow to add extra data for a later use in analysis.# This metadata adds label as a string.defmetadata_toxicity(idx:int,preprocess: Union[PreprocessResponse,list]) -> Union[int,float,str,bool]:return'toxic'if preprocess.data['toxicity'].iloc[idx]>0else'non-toxic'defmetadata_word_count(idx:int,preprocess: Union[PreprocessResponse,list]) ->int:returnlen(preprocess.data.iloc[idx]['text'].split())# Visualizersdeftext_visualizer_func(data: np.ndarray) -> LeapText: tokenizer = leap_binder.custom_tokenizer texts = tokenizer.sequences_to_texts(data)returnLeapText(texts[0].split(' '))# Binding functions to bind the functions above to Tensorleap.leap_binder.set_preprocess(function=preprocess_func)leap_binder.set_input(function=input_encoder, name='text')leap_binder.set_ground_truth(function=gt_encoder, name='classes')leap_binder.set_metadata(function=metadata_toxicity, metadata_type=DatasetMetadataType.string, name='toxicity')leap_binder.set_metadata(function=metadata_word_count, metadata_type=DatasetMetadataType.int, name='word_count')leap_binder.add_prediction(name='classes', labels=['non-toxic','toxic'], metrics=[Metric.Accuracy])leap_binder.set_visualizer(function=text_visualizer_func, visualizer_type=LeapText.type, name='text_from_token')# Extra metadataEXTRA_METADATA = ['identity_attack','insult','obscene','severe_toxicity','threat','toxicity']defmetadata_encoder(extra_metadata_key:str) -> Callable[[int, PreprocessResponse],int]:deffunc(idx:int,preprocess: PreprocessResponse) ->int:return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx] func.__name__= EXTRA_METADATA[extra_metadata_key]return funcfor i inrange(len(EXTRA_METADATA)): leap_binder.set_metadata(function=metadata_encoder(i), metadata_type=DatasetMetadataType.int, name=EXTRA_METADATA[i])