Weak Supervision with Snorkel: Image Classification Example

13 minute read

Published:

Introduction:

In the world of machine learning, data is often hailed as the crown jewel that powers models and drives innovation. Yet, obtaining high-quality, labeled data remains a significant challenge, often demanding painstaking manual efforts from human annotators. This is where the concept of weak supervision emerges as a beacon of hope for machine learning engineers and practitioners.

Weak supervision is the art of leveraging various sources of noisy or imprecise supervision to label a large amount of data efficiently. It takes the burden off exhaustive manual labeling and opens the door to scaling up projects that might have been otherwise resource-intensive. In this post, we embark on a journey to explore the Snorkel, a powerful tool that empowers us to automate the labeling process, saving time and effort without compromising on results.

In this tutorial, tailored for machine learning engineers and enthusiasts alike, we’ll unveil the advantages of weak supervision using a practical example: Image Classification. By the end of this guide, you’ll have a basic understanding of how to harness the potential of Snorkel to streamline your image classification pipelines and achieve impressive results with reduced labeling efforts.

Whether you’re a seasoned practitioner seeking to optimize your workflow or a newcomer eager to unlock the potential of weak supervision, this tutorial will equip you with the knowledge and skills needed to elevate your machine learning projects. So, let’s dive into the world of weak supervision and see how Snorkel can revolutionize the way we approach labeling and ultimately, supercharging our machine learning models.

Are you ready to embark on this exciting journey? Let’s begin!

Data Download: Exploring the Open Images Dataset V7

Before we dive into the exciting world of weak supervision and Snorkel for image classification, we need to set the stage by obtaining the necessary data. In this tutorial, we’ll be using the Open Images Dataset V7, a rich collection of images spanning a wide array of categories. This dataset is a treasure trove for machine learning tasks, providing a diverse range of visuals that will help us showcase the power of weak supervision.

To get started, we’ll perform a series of commands to download the essential files from the Open Images Dataset V7. These files contain crucial information about class labels, class descriptions, and annotations. Below is the code snippet you’ll need to execute to gather these files:

In this set of commands, we create a directory named oiv7 to neatly organize the downloaded files. The downloaded files include:

  1. oidv7-classes-trainable.txt: A list of trainable (verified) class labels.
  2. oidv7-class-descriptions.csv: A CSV file containing class descriptions.
  3. oidv7-train-annotations-human-imagelabels.csv: Annotations for training images.
  4. oidv7-val-annotations-human-imagelabels.csv: Annotations for validation images.
  5. oidv7-test-annotations-human-imagelabels.csv: Annotations for test images.

Now that we have the essential data files in place, it’s time to turn our attention to the actual image files. In this section, we’ll walk through the process of downloading labeled images that are trainable according to the Open Images Dataset V7.

The provided Python code streamlines this image download process, ensuring that we only retrieve images that are relevant and fit for training.

Let’s break down the key components of the code:

  1. download_one_image: This function downloads a single image from the specified split (train, val, or test) and saves it to the specified path. The function uses the BUCKET resource from the boto3 library to interact with the S3 bucket.

  2. get_class_label: This function retrieves class labels for the requested class names. It ensures that the requested classes are trainable, as per the dataset specifications.

  3. get_image_ids: This function retrieves image IDs for requested splits and labels. It identifies images that match the requested class labels and have a confidence level of 1.0 (verified by human).

  4. download_images: This function orchestrates the image download process based on requested labels, splits, and paths. It uses concurrent futures to speed up the download process by using multiple threads.

By combining these components, the code provides a way to download labeled images from the Open Images Dataset V7. Next, we’ll use the code to download labeled images, specifically focusing on the classes “sea” and “Jungle.” These images will serve as our starting point for weak supervision, demonstrating how we can leverage Snorkel to automatically label and train an image classifier.

Organizing and Splitting the Dataset

In this section, we’ll walk through a code snippet that performs data splitting and organization. This step is pivotal in setting the stage for robust model training and evaluation.

Let’s delve into the key components of the code:

  1. stat: This function computes statistics on the data by counting the number of images for each class. It returns a dictionary mapping class names to lists of image paths.

  2. split_manual: This function splits the data into different subsets (train, val, test) based on specified ratios. It ensures that each subset maintains a proportional representation of different classes. We split the data into train, val and test proportional to 0.7, 0.2, and 0.1.

Labeling Functions

The heart of the Snorkel lies in creating labeling functions that generate noisy labels for our data. In this section, we’ll go though a code script that defines a set of labeling functions, each contributing to the creation of our weakly labeled dataset.

Here’s an overview of the key components of the provided code:

  1. check_color: This function classifies images based on their dominant color on average.

  2. check_pixel_color: This function classifies images based on the mode of max color per pixel.

  3. check_with_efficientNet: This function leverages EfficientNet predictions and FastText embeddings to classify images as “SEA” or “JUNGLE.” Here, first we classify the image using EfficientNet. Then, the closeness of the output label is examined against several words related to sea or jungle with FastText. Thus, if the meaning of the output label is closer to sea, we label it as SEA.

  4. Adding labeling functions: The script uses the add_func decorator to add each labeling function to the LABELING_FUNCS list, which will be used later.

By combining these labeling functions, we will generate a set of noisy labels for our images. These labels are the cornerstone of our weak supervision approach, allowing us to utilize Snorkel’s capabilities.

Weak Supervision Labeling with Snorkel

The power of weak supervision comes to life when we leverage labeling functions to create noisy labels for our dataset. Now, we’ll explore a script that performs weak supervision labeling with Snorkel.

Here’s a breakdown of the key elements in the provided code:

  1. DATA PREPARATION: The script starts by specifying the root directory containing the input images and the splits (e.g., ‘train’, ‘val’) to process. Additionally, it defines the root directory to save the labeled data.

  2. LFApplier: The labeling functions (LABELING_FUNCS) defined in the previous code script are applied to all images in the specified split. The result is a label matrix (L_train) where each row corresponds to an image and each column corresponds to a labeling function.

  3. LFAnalysis: This step provides an analysis of the labeling functions’ performance on the data. It generates a summary that indicates how well the labeling functions agree or disagree on assigning labels to images.

  4. label_model: A LabelModel is trained using the label matrix (L_train). This model learns to estimate the true underlying labels by accounting for the noise introduced by the labeling functions.

  5. Label Prediction: The label model predicts probabilities of labels for each image based on the noisy labels from the labeling functions.

  6. Saving Labeled Data: The labeled data, including the predicted labels and image paths, is saved to pickle files. This data will serve as the input for our model training process.

By executing this script, we perform the crucial step of labeling our data using weak supervision techniques. Snorkel helps us manage the uncertainty introduced by the labeling functions, creating a labeled dataset that reflects the inherent noise in the weakly supervised data.

Dataloaders

Next, we implement our dataloaders for the supervised and the weakly supervised procedures.

Here’s a breakdown of the key elements in the provided code:

  1. get_transforms(): This function provides data transformation pipelines tailored for different dataset splits: ‘train’, ‘val’, and ‘test’. These transformations include resizing, cropping, flipping, rotation, normalization, and tensor conversion.

  2. SnorkelDataset: This custom dataset class is designed for weakly supervised learning using Snorkel labels. It takes a path to a pickled data file and a label type (‘hard’ or ‘soft’) as inputs. In weakly supervised learning, “hard labels” refer to discrete, definite labels assigned to data points, indicating clear categorization (e.g., ‘SEA’ or ‘JUNGLE’). On the other hand, “soft labels” represent probabilistic or continuous assignments, reflecting the uncertainty or ambiguity in classification. The class loads images and corresponding labels from the data file and applies the specified transformations.

  3. get_data_loader(): This function creates and returns data loaders for different dataset splits. It utilizes the ImageFolder dataset from PyTorch, which organizes data into class folders. The dataloaders are configured with appropriate transformations and batch sizes for training, validation, and testing.

  4. get_data_loader_snorkel(): This function generates data loaders for the specified dataset splits using Snorkel-generated labels. It utilizes the SnorkelDataset class to load images and labels from pickled data files, enabling weakly supervised learning. The dataloaders are configured similarly to those in get_data_loader(), tailored for Snorkel-labeled data.

By leveraging these utility functions and classes, we ensure that our data is well-prepared and ready to be fed into our CNN model.

Training function

Here is the training function. I skip the description of this part, since it follows a common pattern of model training with pytorch.

NOTE: The condition if len(labels.shape) > 1: deals with soft labels.

Inference

With the following code, we can evaluate all weights saved in the weight_folder on the test data.

Results

Now let’s see the results. With the following script, we can train the model with original supervised procedure or two semi-supervised ones (soft and hard labels).

The result is

Original Labels:

  • Precision: For the original labels, the model achieves high precision scores for both classes (‘SEA’ and ‘JUNGLE’), indicating that when it predicts a class, it’s usually correct. Specifically, the precision values of approximately 0.94 and 0.92 for ‘SEA’ and ‘JUNGLE’ respectively demonstrate the model’s accuracy in its predictions.
  • Accuracy (ACC): The overall accuracy of 0.926 suggests that the model is successful in correctly classifying approximately 92.6% of the images in the dataset.

  • Snorkel Soft Labels:

  • Precision: With soft labels, where the labeling functions provide probabilistic or continuous assignments, the precision scores remain relatively high but show a slight decrease compared to the original labels. The values of around 0.93 for ‘SEA’ and 0.909 for ‘JUNGLE’ indicate a minor decrease in precision.
  • Accuracy (ACC): The accuracy of 0.9132, while slightly lower than the original labels, still demonstrates a strong performance, capturing approximately 91.3% of the dataset correctly.

Snorkel Hard Labels:

  • Precision: When using hard labels (discrete, definite labels) provided by Snorkel, there is a more noticeable decrease in precision for the ‘SEA’ class, dropping to approximately 0.882. However, the precision for ‘JUNGLE’ remains high at around 0.923.
  • Accuracy (ACC): The overall accuracy of 0.9144, although slightly lower than the original labels, showcases the model’s ability to maintain a strong classification performance with Snorkel hard labels.

Conclusion

In the pursuit of harnessing the power of weak supervision, our journey has traversed a landscape where precision meets ambiguity and accuracy coexists with uncertainty. The application of labeling functions in the Snorkel framework has enabled us to embrace our prior knowledge in our data, offering a nuanced perspective on image classification. We also observed the advantage of using soft labels: their ability to mitigate the impact of class imbalances. In our dataset, where ‘SEA’ and ‘JUNGLE’ classes exhibited varying instances, soft labels allowed for a more nuanced representation of uncertainty. This nuanced understanding ensured that the precision for both classes stayed relatively close, compared to the more discrete hard labels. In an imbalanced dataset, the imprecision introduced by labeling functions might disproportionately affect the minority class. Soft labels, by representing class assignments probabilistically, provided a flexibility that allowed the model to balance the precision between the classes more effectively. This balancing act is crucial, especially in applications where misclassifying the minority class carries significant consequences. It’s important to note that while our labeling functions in this example were relatively straightforward to implement, many real-world scenarios pose complex challenges. Designing accurate labeling functions can be intricate, requiring domain expertise and careful consideration. Despite these challenges, our study demonstrates that noisy labels, when harnessed intelligently through weak supervision techniques, can still offer valuable insights and contribute to robust model training. Our exploration underscores the resilience of machine learning models in the face of noisy or uncertain labels. Even when labeling functions are not perfect, the intelligent integration of these noisy annotations can lead to significant advancements in model performance. Embracing the inherent noise in weakly supervised data and leveraging techniques like Snorkel not only expands the scope of feasible applications but also highlights the adaptability and learning potential of modern machine learning systems.

GitHub

You can find the code at the project GitHub repository.

Reference

[1] Snorkel

[2] ChatGPT

[3] Open Images Downloader