CelebA Classification (using GCS)
The CelebFaces Attributes (CelebA) dataset contains images of faces and 40 attribute notations per image.
In this example, the CelebA dataset is integrated into the Tensorleap platform and prepared for use with a classification model. The predicted field is whether or not the person in the image wears glasses.
In this section, the CelebA data will be placed in a Google Cloud Storage bucket for later use by the Dataset Script.
gsutil -m rsync -r celebA/ gs://<<your bucket>>/celebA/
The example below will show how the files are read, cached, and preprocessed.
For your convenience, we set up a public bucket with the data, which is then accessed using the dataset script below.
In the first part of the script, we import all the relevant modules:
- Common modules
PIL.Image
- image processing modulegoogle.cloud
- google cloud access modulecode_loader
- Tensorleap's integration module
In addition, set the following constants:
PROJECT_ID
BUCKET_NAME
- points to the Google cloud project and bucket where we store the dataIMAGE_SIZE
- the input image sizeMAIN_ATTRIBUTE
- the attribute we would like to predict, currently set forEyeglasses
.
import os
from typing import Optional, List, Union, Tuple, Callable
from pathlib import Path
from functools import lru_cache
from google.cloud import storage
from google.cloud.storage import Bucket
from google.auth.credentials import AnonymousCredentials
import numpy as np
import pandas as pd
import PIL.Image as Image
# Tensorleap imports
from code_loader.contract.enums import DatasetMetadataType, Metric
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader import leap_binder
PROJECT_ID = 'example-dev-project-nmrksf0o'
BUCKET_NAME = 'example-datasets-47ml982d'
IMAGE_SIZE = 64
MAIN_ATTRIBUTE = 'Eyeglasses'
This code section contains helper functions that are used for fetching and caching data from the Google Cloud Storage bucket.
In this example, the files are stored under the
HOME
path. Similarly, you can point it to the persistent folder for persistent caching.
#Helper Functions:
@lru_cache()
def _connect_to_gcs_and_return_bucket(bucket_name: str) -> Bucket:
gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())
return gcs_client.bucket(bucket_name)
def _download(cloud_file_path: str, local_file_path: Optional[str] = None) -> str:
print("download data from GC")
# if local_file_path is not specified saving in home dir
if local_file_path is None:
home_dir = os.getenv("HOME")
local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)
# check if file is already exists
if os.path.exists(local_file_path):
return local_file_path
bucket = _connect_to_gcs_and_return_bucket(BUCKET_NAME)
dir_path = os.path.dirname(local_file_path)
os.makedirs(dir_path, exist_ok=True)
blob = bucket.blob(cloud_file_path)
blob.download_to_filename(local_file_path)
return local_file_path
The
preprocess_func
(custom name) is a preprocessing function that is called just once before the training/evaluating process. It prepares the data for later use in input encoders, output encoders, and metadata functions. More info at Preprocess Function.The implementation below downloads the
list_attr_celeba.csv
(contains attributes) and list_eval_partition.csv
(contains train/validation/test data slices), loads them into a DataFrame, and joins them into df_attr
. Then it splits the data into train, validation, and test according to the list_eval_partition
.Lastly, the PreprocessResponse objects are set for the train and validation data slices. These objects are later passed on to the encoder and metadata functions.
# Preprocess Function:
def preprocess_func() -> List[PreprocessResponse]:
annotations_path = _download("celebA/list_attr_celeba.csv")
partition_path = _download("celebA/list_eval_partition.csv")
df_attr = pd.read_csv(annotations_path, index_col=0)
df_partition = pd.read_csv(partition_path, index_col=0)
df_attr = df_attr.join(df_partition)
df_train = df_attr[df_attr.partition == 0]
df_valid = df_attr[df_attr.partition == 1]
df_test = df_attr[df_attr.partition == 2]
train = PreprocessResponse(length=len(df_train), data=df_train)
val = PreprocessResponse(length=len(df_valid), data=df_valid)
test = PreprocessResponse(length=len(df_test), data=df_test)
return train, val, test
The input encoder generates an input component of a sample with index
idx
from the PreprocessResponse object. This sample will later be fetched as input by the network. The function is called for every evaluated sample. More info at Input Encoder.The input encoder function is called for every evaluated sample and generates for each an input component with an index from the PreprocessResponse object -
idx
. This input component will later be fetched by the network. More info at Input Encoder.In the example below, the image file name with index
idx
is retrieved from the preprocessing's data. The image is then downloaded and opened. Additionally, the image is center cropped and resized before it is fetched as the model's input. # Input encoder fetches the image with the index `idx` from the data from set in
# the PreprocessResponse's data. Returns an ndarray containing the sample's image.
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
sample = preprocess.data.iloc[idx]
fpath = f'celebA/img_align_celeba/img_align_celeba/{sample.name}'
fpath = _download(fpath)
image = Image.open(fpath)
# center crop
celeba_face_size = 178
width, height = image.size
left = (width - celeba_face_size)/2
top = (height - celeba_face_size)/2
right = (width + celeba_face_size)/2
bottom = (height + celeba_face_size)/2
image = image.crop((left, top, right, bottom))
image = image.resize((IMAGE_SIZE, IMAGE_SIZE))
return image
The ground truth encoder generates a ground truth component of a sample with index
idx
from the preprocessing
. It will later be used as the ground truth for the loss function. This function is called for each evaluated sample. More info at Ground Truth Encoder.The ground truth encoder generates a ground truth component of a sample with index
idx
from the preprocess
. This function is called for each evaluated sample and will later be used as the ground truth for the loss function. More info at Ground Truth Encoder.The implementation below extracts the
MAIN_ATTRIBUTE
of a sample with index idx
and returns a one-hot-vector. Note: The CelebA dataset's attributes are stored in a -1 for negative and 1 for positive.
# Ground truth encoder fetches the label with the index `idx` from the MAIN_ATTRIBUTE column set in
# the PreprocessResponse's data and returns its hot vector representation.
def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
return [0.0, 1.0] if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] == 1 else [1.0, 0.0]
For each sample, Tensorleap allows extra data to be added for future analysis. Each defined metadata is wrapped in a metadata function.
The metadata function below adds the label
glasses
or no-glasses
as metadata to each sample.# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds label as a string.
def metadata_label(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
return 'glasses' if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] == 1 else 'no-glasses'
# Leap binding functions to bind the functions above to the `Dataset`.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='image')
leap_binder.set_ground_truth(function=gt_encoder, name='glasses')
leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.string, name='label')
leap_binder.add_prediction(name='prediction', labels=['glasses','no-glasses'], metrics=[Metric.Accuracy])
The
add_prediction
function provides information about the prediction tensor of the current use-case, and its metrics. This information will later be used for calculating selected metrics and visualizations.Our dataset includes extra metadata such as
Bald
, Young
, Smiling
, and more. These fields are implemented using the wrapper function metadata_encoder
that generates a metadata function for each extra field.At the end of this code snippet, we set the generated metadata functions to the leap_binder object for each of the extra fields.
# Extra metadata
EXTRA_METADATA = ['5_o_Clock_Shadow', 'Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 'Big_Lips',
'Big_Nose', 'Black_Hair', 'Blond_Hair', 'Blurry', 'Brown_Hair', 'Bushy_Eyebrows', 'Chubby', 'Double_Chin', 'Eyeglasses',
'Goatee', 'Gray_Hair', 'Heavy_Makeup', 'High_Cheekbones', 'Male', 'Mouth_Slightly_Open', 'Mustache', 'Narrow_Eyes', 'No_Beard',
'Oval_Face', 'Pale_Skin', 'Pointy_Nose', 'Receding_Hairline', 'Rosy_Cheeks', 'Sideburns', 'Smiling', 'Straight_Hair',
'Wavy_Hair', 'Wearing_Earrings', 'Wearing_Hat', 'Wearing_Lipstick', 'Wearing_Necklace', 'Wearing_Necktie', 'Young']
def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessResponse], int]:
def func(idx: int, preprocess: PreprocessResponse) -> int:
return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]
func.__name__ = EXTRA_METADATA[extra_metadata_key]
return func
for i in range(len(EXTRA_METADATA)):
leap_binder.set_metadata(
function=metadata_encoder(i),
metadata_type=DatasetMetadataType.int,name=EXTRA_METADATA[i]
)
For your convenience, the full script is given below:
import os
from typing import Optional, List, Union, Tuple, Callable
from pathlib import Path
from functools import lru_cache
from google.cloud import storage
from google.cloud.storage import Bucket
from google.auth.credentials import AnonymousCredentials
import numpy as np
import pandas as pd
import PIL.Image as Image
# Tensorleap imports
from code_loader.contract.enums import DatasetMetadataType, Metric
from code_loader.contract.datasetclasses import PreprocessResponse
from code_loader import leap_binder
PROJECT_ID = 'example-dev-project-nmrksf0o'
BUCKET_NAME = 'example-datasets-47ml982d'
IMAGE_SIZE = 64
MAIN_ATTRIBUTE = 'Eyeglasses'
#Helper Functions:
@lru_cache()
def _connect_to_gcs_and_return_bucket(bucket_name: str) -> Bucket:
gcs_client = storage.Client(project=PROJECT_ID, credentials=AnonymousCredentials())
return gcs_client.bucket(bucket_name)
def _download(cloud_file_path: str, local_file_path: Optional[str] = None) -> str:
print("download data from GC")
# if local_file_path is not specified saving in home dir
if local_file_path is None:
home_dir = os.getenv("HOME")
local_file_path = os.path.join(home_dir, "Tensorleap_data", BUCKET_NAME, cloud_file_path)
# check if file is already exists
if os.path.exists(local_file_path):
return local_file_path
bucket = _connect_to_gcs_and_return_bucket(BUCKET_NAME)
dir_path = os.path.dirname(local_file_path)
os.makedirs(dir_path, exist_ok=True)
blob = bucket.blob(cloud_file_path)
blob.download_to_filename(local_file_path)
return local_file_path
# Preprocess Function:
def preprocess_func() -> List[PreprocessResponse]:
annotations_path = _download("celebA/list_attr_celeba.csv")
partition_path = _download("celebA/list_eval_partition.csv")
df_attr = pd.read_csv(annotations_path, index_col=0)
df_partition = pd.read_csv(partition_path, index_col=0)
df_attr = df_attr.join(df_partition)
df_train = df_attr[df_attr.partition == 0]
df_valid = df_attr[df_attr.partition == 1]
df_test = df_attr[df_attr.partition == 2]
train = PreprocessResponse(length=len(df_train), data=df_train)
val = PreprocessResponse(length=len(df_valid), data=df_valid)
test = PreprocessResponse(length=len(df_test), data=df_test)
return train, val, test
# Input encoder fetches the image with the index `idx` from the data from set in
# the PreprocessResponse's data. Returns an ndarray containing the sample's image.
def input_encoder(idx: int, preprocess: PreprocessResponse) -> np.ndarray:
sample = preprocess.data.iloc[idx]
fpath = f'celebA/img_align_celeba/img_align_celeba/{sample.name}'
fpath = _download(fpath)
image = Image.open(fpath)
# center crop
celeba_face_size = 178
width, height = image.size
left = (width - celeba_face_size)/2
top = (height - celeba_face_size)/2
right = (width + celeba_face_size)/2
bottom = (height + celeba_face_size)/2
image = image.crop((left, top, right, bottom))
image = image.resize((IMAGE_SIZE, IMAGE_SIZE))
return image
# Ground truth encoder fetches the label with the index `idx` from the MAIN_ATTRIBUTE column set in
# the PreprocessResponse's data and returns its hot vector representation.
def gt_encoder(idx: int, preprocess: Union[PreprocessResponse, list]) -> np.ndarray:
return [0.0, 1.0] if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] == 1 else [1.0, 0.0]
# Metadata functions allow to add extra data for a later use in analysis.
# This metadata adds label as a string.
def metadata_label(idx: int, preprocess: Union[PreprocessResponse, list]) -> Union[int, float, str, bool]:
return 'glasses' if preprocess.data.iloc[idx][MAIN_ATTRIBUTE] == 1 else 'no-glasses'
# Dataset binding functions to bind the functions above to the `Dataset`.
leap_binder.set_preprocess(function=preprocess_func)
leap_binder.set_input(function=input_encoder, name='image')
leap_binder.set_ground_truth(function=gt_encoder, name='glasses')
leap_binder.set_metadata(function=metadata_label, metadata_type=DatasetMetadataType.string, name='label')
leap_binder.add_prediction(name='prediction', labels=['glasses','no-glasses'], metrics=[Metric.Accuracy])
# Extra metadata
EXTRA_METADATA = ['5_o_Clock_Shadow', 'Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 'Big_Lips',
'Big_Nose', 'Black_Hair', 'Blond_Hair', 'Blurry', 'Brown_Hair', 'Bushy_Eyebrows', 'Chubby', 'Double_Chin', 'Eyeglasses',
'Goatee', 'Gray_Hair', 'Heavy_Makeup', 'High_Cheekbones', 'Male', 'Mouth_Slightly_Open', 'Mustache', 'Narrow_Eyes', 'No_Beard',
'Oval_Face', 'Pale_Skin', 'Pointy_Nose', 'Receding_Hairline', 'Rosy_Cheeks', 'Sideburns', 'Smiling', 'Straight_Hair',
'Wavy_Hair', 'Wearing_Earrings', 'Wearing_Hat', 'Wearing_Lipstick', 'Wearing_Necklace', 'Wearing_Necktie', 'Young']
def metadata_encoder(extra_metadata_key: str) -> Callable[[int, PreprocessResponse], int]:
def func(idx: int, preprocess: PreprocessResponse) -> int:
return preprocess.data[EXTRA_METADATA[extra_metadata_key]].iloc[idx]
func.__name__ = EXTRA_METADATA[extra_metadata_key]
return func
for i in range(len(EXTRA_METADATA)):
leap_binder.set_metadata(
function=metadata_encoder(i),
metadata_type=DatasetMetadataType.int,name=EXTRA_METADATA[i]
)
Last modified 30d ago