Fine-Tuning LLaMA 3.2-11B-Vision for Product Descriptions

9 minute read

Published:

Open In Colab

The complete implementation is available on GitHub.

Introduction:

Large vision-language models (LVLMs) like LLaMA 3.2-11B-Vision have revolutionized AI-powered content generation. However, their performance can be significantly enhanced when fine-tuned on domain-specific data. In my recent project, I fine-tuned LLaMA 3.2-11B-Vision to generate high-quality product descriptions for images of girls’ clothing.

The goal was to create engaging, detailed, and accurate descriptions that could be used for e-commerce platforms, enhancing product listings and improving customer experience. By leveraging a carefully curated dataset and optimizing the fine-tuning process, I was able to achieve impressive results. In this blog post, I’ll walk through the entire process—dataset preparation, fine-tuning methodology, challenges faced, and key takeaways.

Stay tuned to learn how you can fine-tune LLaMA 3.2-11B-Vision for your own vision-language tasks!

Setup

Just open the colab version: Open In Colab

or clone the repo

git clone https://github.com/ramintoosi/product_description
cd product_description

and follow the unsloth installation guide.

Dataset Preparation

Before fine-tuning LLaMA 3.2-11B-Vision, the first step is to acquire and prepare the dataset. The dataset consists of images of girls’ clothing along with corresponding product descriptions, which will be used to train the model to generate high-quality text based on image inputs. I gathered this dataset by crawling a web store, ensuring that it contains diverse and well-structured product listings. While I cannot share the crawling code, the dataset itself is publicly available.

To get started, we need to download the dataset from Google Drive and extract it into a working directory. The following commands will accomplish this:

# Install gdown if not already installed
pip install gdown  

# Download the dataset from Google Drive
gdown --id 14PptNxqI7D6YuTiPOjt1H0uLaa8Qr0qF  

# Unzip the dataset into the 'data' directory
unzip -q product_description_data.zip -d data  

Once extracted, the dataset will be available in the data/ directory, ready for preprocessing. In the next section, we will explore the dataset structure and prepare it for fine-tuning.

Loading and Converting the Data

With the dataset downloaded and extracted, the next step is to load the data, clean it, and convert it into a format suitable for fine-tuning LLaMA 3.2-11B-Vision with Unsloth. This involves:

  1. Loading the dataset from a CSV file that contains product names, brand names, and image paths.
  2. Cleaning the data by removing unnecessary codes, model numbers, and redundant information from product names.
  3. Converting the data into a structured conversation format that aligns with the input-output style expected by LLaMA 3.2-11B-Vision.

Step 1: Loading and Cleaning the Data

The following Python function reads the dataset, drops irrelevant columns, removes duplicate entries, and applies text cleaning rules to refine product names:

import os
from typing import TypedDict
import pandas as pd

class Data(TypedDict):
    name: str
    brand: str
    image_path: str

def load_data(data_root: str) -> list[Data]:
    """
    Load data from csv file
    :param data_root: data root folder where data.csv and images are stored
    :return: list of dictionaries, keys are column names and values are data
    """
    data = pd.read_csv(os.path.join(data_root, 'data.csv'))
    data.drop(columns=['site_category', 'supply_category'], inplace=True)
    clean_data(data)
    return data.to_dict(orient='records')

def clean_data(data: pd.DataFrame):
    """
    Clean data by removing product codes and model numbers.
    :param data: data frame
    """
    # Remove duplicate rows
    data.drop_duplicates(inplace=True)

    # Remove codes, model numbers, and specific patterns
    pattern = r'\b(کد|مدل)\b(\s+[A-Za-z0-9]+)?|\bمجموعه\b\s+(\d+)\s+(\w+)'
    data["name"] = data["name"].str.replace(pattern, '', regex=True)

    # Remove any remaining standalone alphanumeric codes in English
    pattern2 = r'\b[A-Za-z0-9_-]+\b'
    data["name"] = data["name"].str.replace(pattern2, '', regex=True)

data = load_data('./data')
print(data[0])  # Sample output

Sample Output:

{
    "name": "ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش  ماشین پلیس",
    "brand": "سپیدپوش",
    "image_path": "data/dataset/image/59e8b029-864c-4788-bf4f-25d9f0ad494d.jpg"
}

Step 2: Converting Data to a Conversation Format

LLaMA 3.2-11B-Vision expects data in a structured conversational format, where the user provides an instruction along with an image, and the model generates a response. The following function transforms each data sample into this format:

from PIL import Image

instruction = """Create a Short Product description based on the provided ##PRODUCT BRAND NAME## and the image.
Only return description. The description should be SEO optimized and for a better mobile search experience.

##PRODUCT BRAND NAME##: {brand_name}
"""

def convert_to_conversation(sample):
    image_path = sample["image_path"].replace('dataset/', '')
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction.format(brand_name=sample["brand"])},
            {"type" : "image", "image_url" : f'file://{image_path}'} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["name"]} ]
        },
    ]
    return { "messages" : conversation }

converted_dataset = [convert_to_conversation(sample) for sample in data]
print(converted_dataset[0])  # Sample output

Sample Output:

{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Create a Short Product description based on the provided ##PRODUCT BRAND NAME## and the image.\nOnly return description. The description should be SEO optimized and for a better mobile search experience.\n\n##PRODUCT BRAND NAME##: سپیدپوش\n"
                },
                {
                    "type": "image",
                    "image_url": "file://data/image/59e8b029-864c-4788-bf4f-25d9f0ad494d.jpg"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش  ماشین پلیس"
                }
            ]
        }
    ]
}

Efficient Image Handling: Avoiding Memory Overload

In the original Unsloth library tutorial, image inputs are loaded into memory using:

{"type": "image", "image": Image(image_path)}

This method loads all images at once, which can cause memory crashes when dealing with large datasets like ours.

To avoid excessive memory usage, we store image file paths instead of loading images into memory. This way, the model loads images dynamically during training instead of keeping them in RAM.

{"type": "image", "image_url": f'file://{image_path}'}

Loading and Fine-tuning the LLaMA 3.2-11B-Vision Model with Unsloth

Now that we have our dataset ready, the next step is to load and configure the model for fine-tuning. We will use Unsloth’s FastVisionModel, which provides optimized loading and memory-efficient training.

Step 1: Import Required Libraries

from unsloth import FastVisionModel  # FastLanguageModel for LLMs
import torch

Step 2: Load Pretrained Model and Tokenizer

We initialize the LLaMA 3.2-11B-Vision-Instruct model with 4-bit quantization to optimize memory usage.

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Llama-3.2-11B-Vision-Instruct",
    load_in_4bit=True,  # Use 4-bit quantization to reduce memory usage
    use_gradient_checkpointing="unsloth",  # Activates checkpointing for long context
)

4-bit quantization reduces memory usage significantly, allowing us to fine-tune on consumer-grade GPUs.
Gradient checkpointing helps handle long-context sequences efficiently.

Checking the Model’s Pretrained Vision Capabilities

Before fine-tuning, it’s useful to check whether LLaMA 3.2-11B-Vision already understands and analyzes images effectively. We do this by running an inference test using the pretrained model.

Step 2.1: Enable Inference Mode

FastVisionModel.for_inference(model)  # Switch to inference mode

Step 2.2: Prepare Image Input

from PIL import Image

image = Image.open(data[0]["image_path"].replace('dataset/', ''))

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

Step 2.3: Tokenize Input for the Model

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

Step 2.4: Generate the Model’s Response

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128,
                   use_cache=True, temperature=1.5, min_p=0.1)

Model-generated description:

“The image showcases a children’s pajama set from the brand {brand_name}. The pajama shirt is a long-sleeved grey and white striped shirt featuring a playful police car design on the front. The police car is depicted in blue, with a white dome on top, adorned with a red siren light, and sporting black wheels. Below the car, the word ‘POLICE’ is written in blue text. The pajama bottoms are solid black, made from a stretchy fabric designed to move with the wearer.”

Original dataset caption:

“ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش ماشین پلیس”

Step 3: Enable Parameter-Efficient Fine-Tuning (PEFT)

To fine-tune the model efficiently, we use LoRA (Low-Rank Adaptation). This method only trains specific layers, drastically reducing GPU memory consumption.

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=False,  # Vision layers are frozen to focus on text generation
    finetune_language_layers=True,  # Enable fine-tuning of language layers
    finetune_attention_modules=True,  # Fine-tune attention layers for better adaptation
    finetune_mlp_modules=True,  # Fine-tune MLP layers for better generalization

    r=16,  # Controls rank of LoRA adaptation; higher values improve accuracy but increase overfitting risk
    lora_alpha=16,  # LoRA scaling factor (recommended: equal to `r`)
    lora_dropout=0,  # No dropout for stable fine-tuning
    bias="none",  # No additional bias parameters
    random_state=3407,  # Ensures reproducibility
    use_rslora=False,  # Rank-stabilized LoRA disabled (can improve LoRA stability in some cases)
    loftq_config=None,  # LoftQ disabled (for further quantization efficiency)
    # target_modules="all-linear",  # Optional: Specifies which layers to adapt
)

Step 4: Fine-Tuning the Model

Now that we’ve confirmed the pretrained model can analyze images, we move on to fine-tuning it for generating concise, SEO-optimized product descriptions. We use the Trainer API to fine-tune the model on our dataset.

trainer_stats = trainer.train()

Evaluating the Fine-Tuned Model

Now that our model is trained, let’s test it on a sample image and compare its new description with the original dataset caption.

Step 1: Select a Sample & Enable Inference Mode

sample_idx = 157
FastVisionModel.for_inference(model)  # Switch back to inference mode

Step 2: Load and Prepare the Image

from PIL import Image

image = Image.open(data[sample_idx]["image_path"].replace('dataset/', ''))

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction.format(brand_name=data[sample_idx]["brand"])}
    ]}
]

Step 3: Tokenize Input for the Model

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

Step 4: Generate a Description

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128,
                   use_cache=True, temperature=1.5, min_p=0.1)

Results

Model Caption:

جوراب ساق بلند دخترانه کاتامینا

Original Caption:

جوراب دخترانه کاتامینا

GitHub Repository

The complete implementation is available on GitHub and Colab.

Reference

[1] Unsloth Tutorials

[2] ChatGPT