Gender Bias in Multimodal Embeddings: An Example of OpenAI CLIP

11 minute read

Published: March 03, 2024

Introduction:

Artificial intelligence has made significant strides in understanding and processing multimodal information, such as images and text, simultaneously. OpenAI’s CLIP (Contrastive Language–Image Pretraining) model is a prime example of this advancement, showcasing the ability to learn from diverse image and text data to perform various tasks, including image classification based on textual descriptions. However, as with any AI model, CLIP is not immune to biases present in its training data, raising concerns about its performance and behavior in real-world applications.

In this blog post, I investigate the issue of gender bias in multimodal embeddings, focusing on our analysis of OpenAI CLIP using the UTKFace dataset. By examining how CLIP processes and represents gender-related information, I aim to shed light on potential biases and their implications.Our methodology involves comparing embeddings of images paired with gender-related words to predict the gender of the faces in those images. Additionally, I investigate CLIP’s tendency towards associating certain attributes with specific genders, highlighting the challenges and considerations in developing fair and unbiased AI models.

Packages and Dataset

To conduct our analysis, I utilized Google Colab for its powerful computational capabilities and convenient access to the necessary libraries. Let’s begin by installing the required packages.

!pip -q install ftfy regex tqdm
!pip -q install git+https://github.com/openai/CLIP.git

Next, we need to download the UTKFace dataset, which contains a diverse collection of face images labeled with age, gender, and ethnicity information. The dataset is split into three parts, which I downloaded and extracted into a directory named utk_face for further processing.

!gdown 1mb5Z24TsnKI3ygNIlX6ZFiwUj0_PmpAW
!gdown 19vdaXVRtkP-nyxz1MYwXiFsh_m_OL72b
!gdown 1oj9ZWsLV2-k2idoW_nRSrLQLUP3hus3b
!mkdir utk_face
!tar -xf  'part1.tar.gz' -C './utk_face'
!tar -xf  'part2.tar.gz' -C './utk_face'
!tar -xf  'part3.tar.gz' -C './utk_face'

Creating the Dataset

To create a PyTorch dataset for the UTKFace dataset, I defined a custom UTKFace class that inherits from torch.utils.data.Dataset. This class prepares the samples by loading images and their corresponding labels (age, gender, and race) from the dataset directory. We also included mappings for gender and race labels for easier interpretation.

import glob
import os
from torch.utils import data

class UTKFace(data.Dataset):
  
    gender_map = {0: 'male', 1: 'female'}
    race_map = {0: 'white', 1: 'black', 2: 'asian', 3: 'indian', 4: 'others'}

    def __init__(self, root, transform=None):
        self.root = root
        self.samples = self._prepare_samples(root)
        self.transform = transform

    def __getitem__(self, index):
        path, label = self.samples[index]
        image = Image.open(path)

        if self.transform is not None:
          image = self.transform(image)

        return image, label

    def __len__(self):
        return len(self.samples)

    def _prepare_samples(self, root):
        samples = []

        paths = glob.glob(os.path.join(root, '*/*'))

        for path in paths:
            try:
              label = self._load_label(path)
            except Exception as e:
                print(f'path: {path}, exception: {e}')

            samples.append((path, label))

        return samples

    def _load_label(self, path):
        str_list = os.path.basename(path).split('.')[0].strip().split('_')
        age, gender, race = map(int, str_list[:3])
        label = dict(age=age, gender=gender, race=race)
        return label

To visualize an example image from the dataset, I randomly selected one image and displayed it using matplotlib.

import random
import matplotlib.pyplot as plt
from PIL import Image

utkface = UTKFace(root='utk_face')
print(f'num images: {len(utkface)}')
sample_image,  sample_label= random.choice(utkface)

plt.imshow(sample_image)
plt.axis("off")
plt.title(str(sample_label))
plt.show()

Model and DataLoader Setup

We selected the CLIP model architecture “ViT-B/32” for our analysis, which is a Vision Transformer model with a patch size of 32x32 pixels. We also set up the data loader to efficiently load and preprocess images from the UTKFace dataset for inference.

import torch
import clip
from PIL import Image
from tqdm import tqdm

print('Available Models: ', clip.available_models())

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'device is {device}')
model, preprocess = clip.load("ViT-B/32", device=device)
model.eval()

utkface_dataset = UTKFace(root='utk_face', transform=preprocess)
utkface_dataloader = data.DataLoader(utkface_dataset, batch_size=512, shuffle=False, num_workers=2)

Gender Prediction

Here’s the code for predicting gender based on the provided keywords:

import numpy as np
# change this to use one or more keywords
text_tokens = []
text_tokens.append(clip.tokenize(["male", "female"]).to(device))
# text_tokens.append(clip.tokenize(["man", "woman"]).to(device))
# text_tokens.append(clip.tokenize(["boy", "girl"]).to(device))

n = len(text_tokens)

age = []
gender = []
race = []
gender_p = []


with torch.no_grad():
    text_feats = []
    for images, label in tqdm(utkface_dataloader):
      age.extend(label["age"].tolist())
      gender.extend(label["gender"].tolist())
      race.extend(label["race"].tolist())


      probability_g = np.zeros((images.shape[0], 2))

      for tt in text_tokens:
        logits_per_image, logits_per_text = model(images.to(device), tt)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        probability_g += probs / n

      gender_p.extend(probability_g[:,1])

This code calculates the gender probabilities for each image in the dataset based on the provided keywords. The gender_p list will contain the predicted probabilities of being female for each image. You can adjust the text_tokens list to include different sets of keywords for gender prediction.

Generating Ground Truth Data

To ensure the integrity of our analysis, let’s filter out samples with invalid gender or race labels from the UTKFace dataset. I retained only samples where the gender label was less than 3 (indicating male or female) and the race label was less than 5 (indicating one of the specified races). This filtering process resulted in a total of 24,105 samples for which we have both ground truth gender labels and predicted gender probabilities.

age = np.array(age)
gender = np.array(gender)
race = np.array(race)
gender_p = np.array(gender_p)

ind_keep = (gender < 3) & (race < 5)

age = age[ind_keep]
gender = gender[ind_keep]
race = race[ind_keep]
gender_p = gender_p[ind_keep]
print(f'Shape of ground truth is {gender_p.shape}')

Optimizing Threshold and Calculating Metrics

To evaluate the performance of our gender prediction model based on the predicted gender probabilities from OpenAI CLIP, I first classified the probabilities using a threshold of 0.5 to determine the predicted gender class. We then optimized the threshold to maximize the F-score, a metric that balances precision and recall, for gender prediction.

from sklearn.metrics import precision_recall_fscore_support
from scipy import stats

gender_p_class = (gender_p > 0.5).astype(int)
mode = stats.mode(gender_p_class)
mode_class = UTKFace.gender_map[mode.mode]
print(f'Gender has bias towards {mode_class}.')
print(f'Predicted #{mode_class}: {np.mean(gender_p_class==mode.mode)*100:.2f}%/ True value of #{mode_class}: {np.mean(gender==mode.mode)*100:.2f}%')

# optimize threshold
fmax = 0
threshold = None
for th in np.arange(0.05, 1, 0.05):
  p, r, f, s= precision_recall_fscore_support(gender, gender_p>th)
  if np.mean(f) > fmax:
    fmax = np.mean(f)
    threshold = th

print(f'\nF-score max: {fmax:.4f} with threshold: {threshold}')
print(f'Accuracy (Threshold: {threshold}): {np.mean((gender_p > threshold) == gender)}')
print(f'Accuracy (Threshold: {0.5}): {np.mean((gender_p > 0.5) == gender)}')

This code snippet provides insights into the bias present in the predicted gender classes and demonstrates the optimization of the threshold for gender prediction. The resulting F-score and accuracy metrics help evaluate the performance of the gender prediction model and provide a basis for further analysis.

Error Analysis by Race

To understand how gender prediction errors vary across different races in the dataset, I calculated the error rate for each race. The error rate is defined as the proportion of samples for which the predicted gender class differs from the ground truth gender class.

gender_p_class = (gender_p > threshold).astype(int)
race_error = race[gender != gender_p_class]
unique, counts = np.unique(race_error, return_counts=True)
for u, c in zip(unique, counts):
  print(f'{UTKFace.race_map[u]} : {c/len(race_error):.4f}')

This analysis provides insights into whether the gender prediction model exhibits biases in predicting gender across different racial groups. The results can help identify potential areas for improvement and ensure the fairness of the model across diverse populations. To further analyze the gender prediction performance across different racial groups, I computed confusion matrices for each race. A confusion matrix provides a detailed breakdown of correct and incorrect predictions made by the gender prediction model.

from sklearn.metrics import confusion_matrix
gender_p_class = (gender_p > threshold).astype(int)

for ur in np.unique(race):
  print(UTKFace.race_map[ur])
  print(confusion_matrix(gender[race==ur], gender_p_class[race==ur]))
  print('\n')

Results: Analysis of Gender Bias in CLIP

Male vs. Female

When using the keywords “male” and “female” to predict gender, the model exhibited a bias towards predicting male gender. The predicted percentage of males in the dataset was 66.30%, compared to the true value of 52.20%. This bias highlights the need to carefully select keywords and thresholds for gender prediction.

By optimizing the threshold to 0.2, we achieved a maximum F-score of 0.9603. This threshold indicates that for an image to be classified as female, the probability of being female should be at least 0.2 which shows the bias towards male. At this threshold, the accuracy was 0.9604, significantly higher than the accuracy at the default threshold of 0.5 (0.8503).

The error rates for gender prediction varied across different racial groups. For the white race, the error rate was 0.3309, indicating a relatively high rate of misclassification. In contrast, the error rates for black, asian, indian, and other races were lower, ranging from 0.1173 to 0.2785. These results suggest that the model’s performance in predicting gender may be influenced by racial factors.

Man vs. Woman

Using the keywords “man” and “woman” to predict gender, the model exhibited a bias towards predicting male gender. The predicted percentage of males was 53.65%, compared to the true value of 52.20%. This bias underscores the importance of carefully selecting keywords and thresholds for gender prediction to mitigate biases.

By optimizing the threshold to 0.3, we achieved a maximum F-score of 0.9487. This threshold indicates that for an image to be classified as female, the probability of being female should be at least 0.3. At this threshold, the accuracy was 0.9487, which is slightly higher than the accuracy at the default threshold of 0.5 (0.9421).

The error rates for gender prediction varied across different racial groups. The error rate was highest for the asian race at 0.3641, followed by the white race at 0.2921. The error rates for black, indian, and other races were lower, ranging from 0.0963 to 0.1472. These results suggest that the model’s performance in predicting gender may be influenced by racial factors.

Boy vs. Girl

Using the keywords “boy” and “girl” to predict gender, the model exhibited a bias towards predicting male gender. The predicted percentage of males was 61.34%, compared to the true value of 52.20%.

By optimizing the threshold to 0.2, we achieved a maximum F-score of 0.9412. This threshold indicates that for an image to be classified as female, the probability of being female should be at least 0.2, indicating a high bias towards male. At this threshold, the accuracy was 0.9413, which is higher than the accuracy at the default threshold of 0.5 (0.8865).

The error rates for gender prediction varied across different racial groups. The error rate was highest for the white race at 0.4064, followed by the asian race at 0.2615. The error rates for black, indian, and other races were lower, ranging from 0.0777 to 0.1569.

Colculsion

In this blog post, we analyzed gender bias in OpenAI CLIP using different keyword pairs (“male” and “female”, “man” and “woman”, “boy” and “girl”) to predict gender in images. We investigated the model’s bias, optimized thresholds for gender prediction, and evaluated its performance across different racial groups.

Gender Bias: Across all keyword pairs, the model exhibited a bias towards predicting male gender. The predicted percentages of males were consistently higher than the true values, indicating a systematic bias in the model’s predictions.
Optimized Thresholds: By optimizing the threshold for gender prediction, we were able to achieve higher accuracies and F-scores compared to the default threshold of 0.5. This suggests that adjusting the threshold can improve the model’s performance and reduce bias.
Error Analysis by Race: The error rates for gender prediction varied across different racial groups. The model performed relatively poorly for the white and asian races, indicating potential challenges in accurately predicting gender for these groups.

What We Learned

In multimodal models, the choice of keywords and thresholds is crucial in mitigating bias and improving the performance of gender prediction models.
Racial diversity in the dataset can impact the model’s performance, highlighting the importance of diverse and representative datasets.

Colab

You can find the code at Colab.

Reference

[1] CLIP

[2] UTKFace

[2] ChatGPT

Share on

Twitter Facebook LinkedIn

Ramin Toosi