Fine-Tuning LLaMA 3.2-11B-Vision for Product Descriptions
Published:
The complete implementation is available on GitHub.
Introduction:
Large vision-language models (LVLMs) like LLaMA 3.2-11B-Vision have revolutionized AI-powered content generation. However, their performance can be significantly enhanced when fine-tuned on domain-specific data. In my recent project, I fine-tuned LLaMA 3.2-11B-Vision to generate high-quality product descriptions for images of girls’ clothing.
The goal was to create engaging, detailed, and accurate descriptions that could be used for e-commerce platforms, enhancing product listings and improving customer experience. By leveraging a carefully curated dataset and optimizing the fine-tuning process, I was able to achieve impressive results. In this blog post, I’ll walk through the entire process—dataset preparation, fine-tuning methodology, challenges faced, and key takeaways.
Stay tuned to learn how you can fine-tune LLaMA 3.2-11B-Vision for your own vision-language tasks!
Setup
or clone the repo
git clone https://github.com/ramintoosi/product_description
cd product_description
and follow the unsloth installation guide.
Dataset Preparation
Before fine-tuning LLaMA 3.2-11B-Vision, the first step is to acquire and prepare the dataset. The dataset consists of images of girls’ clothing along with corresponding product descriptions, which will be used to train the model to generate high-quality text based on image inputs. I gathered this dataset by crawling a web store, ensuring that it contains diverse and well-structured product listings. While I cannot share the crawling code, the dataset itself is publicly available.
To get started, we need to download the dataset from Google Drive and extract it into a working directory. The following commands will accomplish this:
# Install gdown if not already installed
pip install gdown
# Download the dataset from Google Drive
gdown --id 14PptNxqI7D6YuTiPOjt1H0uLaa8Qr0qF
# Unzip the dataset into the 'data' directory
unzip -q product_description_data.zip -d data
Once extracted, the dataset will be available in the data/
directory, ready for preprocessing. In the next section, we will explore the dataset structure and prepare it for fine-tuning.
Loading and Converting the Data
With the dataset downloaded and extracted, the next step is to load the data, clean it, and convert it into a format suitable for fine-tuning LLaMA 3.2-11B-Vision with Unsloth
. This involves:
- Loading the dataset from a CSV file that contains product names, brand names, and image paths.
- Cleaning the data by removing unnecessary codes, model numbers, and redundant information from product names.
- Converting the data into a structured conversation format that aligns with the input-output style expected by LLaMA 3.2-11B-Vision.
Step 1: Loading and Cleaning the Data
The following Python function reads the dataset, drops irrelevant columns, removes duplicate entries, and applies text cleaning rules to refine product names:
import os
from typing import TypedDict
import pandas as pd
class Data(TypedDict):
name: str
brand: str
image_path: str
def load_data(data_root: str) -> list[Data]:
"""
Load data from csv file
:param data_root: data root folder where data.csv and images are stored
:return: list of dictionaries, keys are column names and values are data
"""
data = pd.read_csv(os.path.join(data_root, 'data.csv'))
data.drop(columns=['site_category', 'supply_category'], inplace=True)
clean_data(data)
return data.to_dict(orient='records')
def clean_data(data: pd.DataFrame):
"""
Clean data by removing product codes and model numbers.
:param data: data frame
"""
# Remove duplicate rows
data.drop_duplicates(inplace=True)
# Remove codes, model numbers, and specific patterns
pattern = r'\b(کد|مدل)\b(\s+[A-Za-z0-9]+)?|\bمجموعه\b\s+(\d+)\s+(\w+)'
data["name"] = data["name"].str.replace(pattern, '', regex=True)
# Remove any remaining standalone alphanumeric codes in English
pattern2 = r'\b[A-Za-z0-9_-]+\b'
data["name"] = data["name"].str.replace(pattern2, '', regex=True)
data = load_data('./data')
print(data[0]) # Sample output
Sample Output:
{
"name": "ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش ماشین پلیس",
"brand": "سپیدپوش",
"image_path": "data/dataset/image/59e8b029-864c-4788-bf4f-25d9f0ad494d.jpg"
}
Step 2: Converting Data to a Conversation Format
LLaMA 3.2-11B-Vision expects data in a structured conversational format, where the user provides an instruction along with an image, and the model generates a response. The following function transforms each data sample into this format:
from PIL import Image
instruction = """Create a Short Product description based on the provided ##PRODUCT BRAND NAME## and the image.
Only return description. The description should be SEO optimized and for a better mobile search experience.
##PRODUCT BRAND NAME##: {brand_name}
"""
def convert_to_conversation(sample):
image_path = sample["image_path"].replace('dataset/', '')
conversation = [
{ "role": "user",
"content" : [
{"type" : "text", "text" : instruction.format(brand_name=sample["brand"])},
{"type" : "image", "image_url" : f'file://{image_path}'} ]
},
{ "role" : "assistant",
"content" : [
{"type" : "text", "text" : sample["name"]} ]
},
]
return { "messages" : conversation }
converted_dataset = [convert_to_conversation(sample) for sample in data]
print(converted_dataset[0]) # Sample output
Sample Output:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Create a Short Product description based on the provided ##PRODUCT BRAND NAME## and the image.\nOnly return description. The description should be SEO optimized and for a better mobile search experience.\n\n##PRODUCT BRAND NAME##: سپیدپوش\n"
},
{
"type": "image",
"image_url": "file://data/image/59e8b029-864c-4788-bf4f-25d9f0ad494d.jpg"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش ماشین پلیس"
}
]
}
]
}
Efficient Image Handling: Avoiding Memory Overload
In the original Unsloth library tutorial, image inputs are loaded into memory using:
{"type": "image", "image": Image(image_path)}
This method loads all images at once, which can cause memory crashes when dealing with large datasets like ours.
To avoid excessive memory usage, we store image file paths instead of loading images into memory. This way, the model loads images dynamically during training instead of keeping them in RAM.
{"type": "image", "image_url": f'file://{image_path}'}
Loading and Fine-tuning the LLaMA 3.2-11B-Vision Model with Unsloth
Now that we have our dataset ready, the next step is to load and configure the model for fine-tuning. We will use Unsloth’s FastVisionModel, which provides optimized loading and memory-efficient training.
Step 1: Import Required Libraries
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
Step 2: Load Pretrained Model and Tokenizer
We initialize the LLaMA 3.2-11B-Vision-Instruct model with 4-bit quantization to optimize memory usage.
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Llama-3.2-11B-Vision-Instruct",
load_in_4bit=True, # Use 4-bit quantization to reduce memory usage
use_gradient_checkpointing="unsloth", # Activates checkpointing for long context
)
✅ 4-bit quantization reduces memory usage significantly, allowing us to fine-tune on consumer-grade GPUs.
✅ Gradient checkpointing helps handle long-context sequences efficiently.
Checking the Model’s Pretrained Vision Capabilities
Before fine-tuning, it’s useful to check whether LLaMA 3.2-11B-Vision already understands and analyzes images effectively. We do this by running an inference test using the pretrained model.
Step 2.1: Enable Inference Mode
FastVisionModel.for_inference(model) # Switch to inference mode
Step 2.2: Prepare Image Input
from PIL import Image
image = Image.open(data[0]["image_path"].replace('dataset/', ''))
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
Step 2.3: Tokenize Input for the Model
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
Step 2.4: Generate the Model’s Response
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128,
use_cache=True, temperature=1.5, min_p=0.1)
Model-generated description:
“The image showcases a children’s pajama set from the brand {brand_name}. The pajama shirt is a long-sleeved grey and white striped shirt featuring a playful police car design on the front. The police car is depicted in blue, with a white dome on top, adorned with a red siren light, and sporting black wheels. Below the car, the word ‘POLICE’ is written in blue text. The pajama bottoms are solid black, made from a stretchy fabric designed to move with the wearer.”
Original dataset caption:
“ست تی شرت آستین بلند و شلوار بچگانه سپیدپوش ماشین پلیس”
Step 3: Enable Parameter-Efficient Fine-Tuning (PEFT)
To fine-tune the model efficiently, we use LoRA (Low-Rank Adaptation). This method only trains specific layers, drastically reducing GPU memory consumption.
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=False, # Vision layers are frozen to focus on text generation
finetune_language_layers=True, # Enable fine-tuning of language layers
finetune_attention_modules=True, # Fine-tune attention layers for better adaptation
finetune_mlp_modules=True, # Fine-tune MLP layers for better generalization
r=16, # Controls rank of LoRA adaptation; higher values improve accuracy but increase overfitting risk
lora_alpha=16, # LoRA scaling factor (recommended: equal to `r`)
lora_dropout=0, # No dropout for stable fine-tuning
bias="none", # No additional bias parameters
random_state=3407, # Ensures reproducibility
use_rslora=False, # Rank-stabilized LoRA disabled (can improve LoRA stability in some cases)
loftq_config=None, # LoftQ disabled (for further quantization efficiency)
# target_modules="all-linear", # Optional: Specifies which layers to adapt
)
Step 4: Fine-Tuning the Model
Now that we’ve confirmed the pretrained model can analyze images, we move on to fine-tuning it for generating concise, SEO-optimized product descriptions. We use the Trainer API to fine-tune the model on our dataset.
trainer_stats = trainer.train()
Evaluating the Fine-Tuned Model
Now that our model is trained, let’s test it on a sample image and compare its new description with the original dataset caption.
Step 1: Select a Sample & Enable Inference Mode
sample_idx = 157
FastVisionModel.for_inference(model) # Switch back to inference mode
Step 2: Load and Prepare the Image
from PIL import Image
image = Image.open(data[sample_idx]["image_path"].replace('dataset/', ''))
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction.format(brand_name=data[sample_idx]["brand"])}
]}
]
Step 3: Tokenize Input for the Model
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
Step 4: Generate a Description
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128,
use_cache=True, temperature=1.5, min_p=0.1)
Results
Model Caption:
جوراب ساق بلند دخترانه کاتامینا
Original Caption:
جوراب دخترانه کاتامینا
GitHub Repository
The complete implementation is available on GitHub and .
Reference
[2] ChatGPT