In this example, the CelebA dataset is integrated into the Tensorleap platform and prepared for use with a classification model. The predicted field is whether or not the person in the image wears glasses.
In addition, we will show how to download and cache data from a Google Cloud Storage bucket.
Data Preparation
In this section, the CelebA data will be placed in a Google Cloud Storage bucket for later use by the Dataset Script.
Download the CelebFaces Attributes dataset from here. Extract the file and upload its folder to your Google Cloud Storage bucket. This can be done using the gsutil as follows:
gsutil-mrsync-rcelebA/gs://<<yourbucket>>/celebA/
The example below will show how the files are read, cached, and preprocessed.
For your convenience, we set up a public bucket with the data, which is then accessed using the dataset script below.
Setup
In the first part of the script, we import all the relevant modules:
Common modules
PIL.Image - image processing module
google.cloud - google cloud access module
code_loader - Tensorleap's integration module
In addition, set the following constants:
PROJECT_IDBUCKET_NAME - points to the Google cloud project and bucket where we store the data
IMAGE_SIZE - the input image size
MAIN_ATTRIBUTE - the attribute we would like to predict, currently set for Eyeglasses.
This code section contains helper functions that are used for fetching and caching data from the Google Cloud Storage bucket.
In this example, the files are stored under the HOME path. Similarly, you can point it to the persistent folder for persistent caching.
#Helper Functions:@lru_cache()def_connect_to_gcs_and_return_bucket(bucket_name:str) -> Bucket: gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())return gcs_client.bucket(bucket_name)def_download(cloud_file_path:str,local_file_path: Optional[str]=None) ->str:print("download data from GC")# if local_file_path is not specified saving in home dirif local_file_path isNone: home_dir = os.getenv("HOME") local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)# check if file is already existsif os.path.exists(local_file_path):return local_file_path bucket =_connect_to_gcs_and_return_bucket(BUCKET_NAME) dir_path = os.path.dirname(local_file_path) os.makedirs(dir_path, exist_ok=True) blob = bucket.blob(cloud_file_path) blob.download_to_filename(local_file_path)return local_file_path
Preprocess Function
The preprocess_func(custom name) is a preprocessing function that is called just once before the training/evaluating process. It prepares the data for later use in input encoders, output encoders, and metadata functions. More info at Preprocess Function.
The implementation below downloads the list_attr_celeba.csv (contains attributes) and list_eval_partition.csv (contains train/validation/test data slices), loads them into a DataFrame, and joins them into df_attr. Then it splits the data into train, validation, and test according to the list_eval_partition.
Lastly, the PreprocessResponse objects are set for the train and validation data slices. These objects are later passed on to the encoder and metadata functions.
The input encoder generates an input component of a sample with index idx from the PreprocessResponse object. This sample will later be fetched as input by the network. The function is called for every evaluated sample. More info at Input Encoder.
The input encoder function is called for every evaluated sample and generates for each an input component with an index from the PreprocessResponse object - idx. This input component will later be fetched by the network. More info at Input Encoder.
In the example below, the image file name with index idx is retrieved from the preprocessing's data. The image is then downloaded and opened. Additionally, the image is center cropped and resized before it is fetched as the model's input.
# Input encoder fetches the image with the index `idx` from the data from set in# the PreprocessResponse's data. Returns an ndarray containing the sample's image.definput_encoder(idx:int,preprocess: PreprocessResponse) -> np.ndarray: sample = preprocess.data.iloc[idx] fpath =f'celebA/img_align_celeba/img_align_celeba/{sample.name}' fpath =_download(fpath) image = Image.open(fpath)# center crop celeba_face_size =178 width, height = image.size left = (width - celeba_face_size)/2 top = (height - celeba_face_size)/2 right = (width + celeba_face_size)/2 bottom = (height + celeba_face_size)/2 image = image.crop((left, top, right, bottom)) image = image.resize((IMAGE_SIZE, IMAGE_SIZE))return image
Ground Truth Encoder
The ground truth encoder generates a ground truth component of a sample with index idx from the preprocessing. It will later be used as the ground truth for the loss function. This function is called for each evaluated sample. More info at Ground Truth Encoder.
The ground truth encoder generates a ground truth component of a sample with index idx from the preprocess. This function is called for each evaluated sample and will later be used as the ground truth for the loss function. More info at Ground Truth Encoder.
The implementation below extracts the MAIN_ATTRIBUTE of a sample with index idx and returns a one-hot-vector.
Note: The CelebA dataset's attributes are stored in a -1 for negative and 1 for positive.
# Ground truth encoder fetches the label with the index `idx` from the MAIN_ATTRIBUTE column set in# the PreprocessResponse's data and returns its hot vector representation.defgt_encoder(idx:int,preprocess: Union[PreprocessResponse,list]) -> np.ndarray:return [0.0,1.0] if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] ==1else [1.0,0.0]
Metadata Function
For each sample, Tensorleap allows extra data to be added for future analysis. Each defined metadata is wrapped in a metadata function.
The metadata function below adds the label glasses or no-glasses as metadata to each sample.
# Metadata functions allow to add extra data for a later use in analysis.# This metadata adds label as a string.defmetadata_label(idx:int,preprocess: Union[PreprocessResponse,list]) -> Union[int,float,str,bool]:return'glasses'if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] ==1else'no-glasses'
Binding Functions
For the Tensorleap platform to register the encoders and functions, we use the leap_binder object:
# Leap binding functions to bind the functions above to the `Dataset`.leap_binder.set_preprocess(function=preprocess_func)leap_binder.set_input(function=input_encoder, name='image')leap_binder.set_ground_truth(function=gt_encoder, name='glasses')leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.string, name='label')leap_binder.add_prediction(name='prediction', labels=['glasses','no-glasses'], metrics=[Metric.Accuracy])
The add_prediction function provides information about the prediction tensor of the current use-case, and its metrics. This information will later be used for calculating selected metrics and visualizations.
Extra Metadata
Our dataset includes extra metadata such as Bald , Young, Smiling, and more. These fields are implemented using the wrapper function metadata_encoder that generates a metadata function for each extra field.
At the end of this code snippet, we set the generated metadata functions to the leap_binder object for each of the extra fields.
# Extra metadataEXTRA_METADATA = ['5_o_Clock_Shadow','Arched_Eyebrows','Attractive','Bags_Under_Eyes','Bald','Bangs','Big_Lips','Big_Nose','Black_Hair','Blond_Hair','Blurry','Brown_Hair','Bushy_Eyebrows','Chubby','Double_Chin','Eyeglasses','Goatee','Gray_Hair','Heavy_Makeup','High_Cheekbones','Male','Mouth_Slightly_Open','Mustache','Narrow_Eyes','No_Beard','Oval_Face','Pale_Skin','Pointy_Nose','Receding_Hairline','Rosy_Cheeks','Sideburns','Smiling','Straight_Hair','Wavy_Hair','Wearing_Earrings','Wearing_Hat','Wearing_Lipstick','Wearing_Necklace','Wearing_Necktie','Young']defmetadata_encoder(extra_metadata_key:str) -> Callable[[int, PreprocessResponse],int]:deffunc(idx:int,preprocess: PreprocessResponse) ->int:return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx] func.__name__= EXTRA_METADATA[extra_metadata_key]return funcfor i inrange(len(EXTRA_METADATA)): leap_binder.set_metadata( function=metadata_encoder(i), metadata_type=DatasetMetadataType.int,name=EXTRA_METADATA[i] )
Full Script
For your convenience, the full script is given below:
import osfrom typing import Optional, List, Union, Tuple, Callablefrom pathlib import Pathfrom functools import lru_cachefrom google.cloud import storagefrom google.cloud.storage import Bucketfrom google.auth.credentials import AnonymousCredentialsimport numpy as npimport pandas as pdimport PIL.Image as Image# Tensorleap importsfrom code_loader.contract.enums import DatasetMetadataType, Metricfrom code_loader.contract.datasetclasses import PreprocessResponsefrom code_loader import leap_binderPROJECT_ID ='example-dev-project-nmrksf0o'BUCKET_NAME ='example-datasets-47ml982d'IMAGE_SIZE =64MAIN_ATTRIBUTE ='Eyeglasses'#Helper Functions:@lru_cache()def_connect_to_gcs_and_return_bucket(bucket_name:str) -> Bucket: gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())return gcs_client.bucket(bucket_name)def_download(cloud_file_path:str,local_file_path: Optional[str]=None) ->str:print("download data from GC")# if local_file_path is not specified saving in home dirif local_file_path isNone: home_dir = os.getenv("HOME") local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)# check if file is already existsif os.path.exists(local_file_path):return local_file_path bucket =_connect_to_gcs_and_return_bucket(BUCKET_NAME) dir_path = os.path.dirname(local_file_path) os.makedirs(dir_path, exist_ok=True) blob = bucket.blob(cloud_file_path) blob.download_to_filename(local_file_path)return local_file_path# Preprocess Function:defpreprocess_func() -> List[PreprocessResponse]: annotations_path =_download("celebA/list_attr_celeba.csv") partition_path =_download("celebA/list_eval_partition.csv") df_attr = pd.read_csv(annotations_path, index_col=0) df_partition = pd.read_csv(partition_path, index_col=0) df_attr = df_attr.join(df_partition) df_train = df_attr[df_attr.partition ==0] df_valid = df_attr[df_attr.partition ==1] df_test = df_attr[df_attr.partition ==2] train =PreprocessResponse(length=len(df_train), data=df_train) val =PreprocessResponse(length=len(df_valid), data=df_valid) test =PreprocessResponse(length=len(df_test), data=df_test)return train, val, test# Input encoder fetches the image with the index `idx` from the data from set in# the PreprocessResponse's data. Returns an ndarray containing the sample's image.definput_encoder(idx:int,preprocess: PreprocessResponse) -> np.ndarray: sample = preprocess.data.iloc[idx] fpath =f'celebA/img_align_celeba/img_align_celeba/{sample.name}' fpath =_download(fpath) image = Image.open(fpath)# center crop celeba_face_size =178 width, height = image.size left = (width - celeba_face_size)/2 top = (height - celeba_face_size)/2 right = (width + celeba_face_size)/2 bottom = (height + celeba_face_size)/2 image = image.crop((left, top, right, bottom)) image = image.resize((IMAGE_SIZE, IMAGE_SIZE))return image# Ground truth encoder fetches the label with the index `idx` from the MAIN_ATTRIBUTE column set in# the PreprocessResponse's data and returns its hot vector representation.defgt_encoder(idx:int,preprocess: Union[PreprocessResponse,list]) -> np.ndarray:return [0.0,1.0] if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] ==1else [1.0,0.0]# Metadata functions allow to add extra data for a later use in analysis.# This metadata adds label as a string.defmetadata_label(idx:int,preprocess: Union[PreprocessResponse,list]) -> Union[int,float,str,bool]:return'glasses'if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] ==1else'no-glasses'# Dataset binding functions to bind the functions above to the `Dataset`.leap_binder.set_preprocess(function=preprocess_func)leap_binder.set_input(function=input_encoder, name='image')leap_binder.set_ground_truth(function=gt_encoder, name='glasses')leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.string, name='label')leap_binder.add_prediction(name='prediction', labels=['glasses','no-glasses'], metrics=[Metric.Accuracy])# Extra metadataEXTRA_METADATA = ['5_o_Clock_Shadow','Arched_Eyebrows','Attractive','Bags_Under_Eyes','Bald','Bangs','Big_Lips','Big_Nose','Black_Hair','Blond_Hair','Blurry','Brown_Hair','Bushy_Eyebrows','Chubby','Double_Chin','Eyeglasses','Goatee','Gray_Hair','Heavy_Makeup','High_Cheekbones','Male','Mouth_Slightly_Open','Mustache','Narrow_Eyes','No_Beard','Oval_Face','Pale_Skin','Pointy_Nose','Receding_Hairline','Rosy_Cheeks','Sideburns','Smiling','Straight_Hair','Wavy_Hair','Wearing_Earrings','Wearing_Hat','Wearing_Lipstick','Wearing_Necklace','Wearing_Necktie','Young']defmetadata_encoder(extra_metadata_key:str) -> Callable[[int, PreprocessResponse],int]:deffunc(idx:int,preprocess: PreprocessResponse) ->int:return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx] func.__name__= EXTRA_METADATA[extra_metadata_key]return funcfor i inrange(len(EXTRA_METADATA)): leap_binder.set_metadata( function=metadata_encoder(i), metadata_type=DatasetMetadataType.int,name=EXTRA_METADATA[i] )