Google recently introduced a new light weight vision-model PaliGemma. This model was released on the 14 May 2024 and has multimodal capabilities.
A vision-language model (VLM) is an advanced type of artificial intelligence that integrates visual and textual data to perform tasks that require understanding and generating both images and language. These models combine techniques from computer vision and natural language processing, enabling them to analyze images, generate descriptive captions, answer questions about visual content, and even engage in complex visual reasoning.
VLMs can understand context, infer relationships, and produce coherent multimodal outputs by leveraging large-scale datasets and sophisticated neural architectures. This makes them powerful tools for applications in fields such as image recognition, automated content creation, and interactive AI systems.
Gemma is a family of lightweight, cutting-edge open models developed using the same research and technology as the Gemini models. PaliGemma is a powerful open vision language model (VLM) that was recently added to the Gemma family.
PaliGemma is a powerful new open vision-language model inspired by PaLI-3, built using the SigLIP vision model and the Gemma language model. It’s designed for top-tier performance in tasks like image and short video captioning, visual question answering, text recognition in images, object detection, and segmentation.
Both the pretrained and fine-tuned checkpoints are open-sourced in various resolutions, plus task-specific ones for immediate use.
PaliGemma combines SigLIP-So400m as the image encoder and Gemma-2B as the text decoder. SigLIP is a SOTA model capable of understanding images and text, similar to CLIP, featuring a jointly trained image and text encoder. The combined PaliGemma model, inspired by PaLI-3, is pre-trained on image-text data and can be easily fine-tuned for tasks like captioning and referring segmentation. Gemma, a decoder-only model, handles text generation. By integrating SigLIP’s image encoding with Gemma via a linear adapter, PaliGemma becomes a powerful vision-language model.
Mix Checkpoints:
FT Checkpoints:
Model Resolutions:
Model Precisions:
Repository Structure:
Compatibility:
Memory Considerations:
We will explore how to use 🤗 transformers for PaliGemma inference.
Let us first, install the necessary libraries with the update flag to ensure we are using the latest versions of 🤗 transformers and other dependencies.
!pip install -q -U accelerate bitsandbytes git+https://github.com/huggingface/transformers.git
To use PaliGemma, you need to accept the Gemma license. Visit the repository to request access. If you’ve already accepted the Gemma license, you’re good to go. Once you have access, log in to the Hugging Face Hub using notebook_login()
and enter your access token by running the cell below.
input_text = "how many dogs are there in the image?"
Next, we will import the necessary libraries and import AutoTokenizer
, PaliGemmaForConditionalGeneration
, and PaliGemmaProcessor
from the transformers
library.
Once the import is done we will load the pre-trained PaliGemma model and the model is loaded with torch.bfloat16
data type, which can provide a good balance between performance and precision on modern hardware.
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)
Once the code is executed, the processor will preprocesses both the image and text.
inputs = processor(text=input_text, images=input_image,
padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
model.to(device)
inputs = inputs.to(dtype=model.dtype)
Next, use the model to generate the text based on the input question,
with torch.no_grad():
output = model.generate(**inputs, max_length=496)
print(processor.decode(output[0], skip_special_tokens=True))
Output:-
how many dogs are there in the image? 1
We can also load model in 4-bit and 8-bit, to reduce the computational and memory resources required for training and inference. First, initialize the BitsAndBytesConfig
.
from transformers import BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
Next, reload the model and pass in above object as quantization_config,
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
device="cuda"
model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,
quantization_config=nf4_config, device_map={"":0})
processor = PaliGemmaProcessor.from_pretrained(model_id)
Generate the output,
with torch.no_grad():
output = model.generate(**inputs, max_length=496)
print(processor.decode(output[0], skip_special_tokens=True))
Output:-
how many dogs are there in the image? 1
Tokenizing the Input Text:
<bos>
token is added at the beginning.\n
) is appended, which is important as it was part of the model’s training input prompt.Adding Image Tokens:
<image>
tokens.<image>
tokens depends on the input image resolution and the SigLIP model’s patch size.For PaliGemma models:
<image>
tokens (224/14 * 224/14).<image>
tokens.<image>
tokens.Memory Considerations:
Generating Token Embeddings:
Processing the Image:
Combining Image and Text Embeddings:
<image>
text embeddings.Autoregressive Text Generation:
<bos>
+ prompt + \n
).Simplified Inference:
Vision-language models like PaliGemma have a wide range of applications across various industries. Few examples are listed below:
These are just a few examples, and the potential applications of vision-language models continue to expand as researchers and developers explore new use cases and integrate these technologies into various domains.
In conclusion, we can say that PaliGemma represents a significant advancement in the field of vision-language models, offering a powerful tool for understanding and generating content based on images. With its ability to seamlessly integrate visual and textual information, PaliGemma opens up new path for research and application across a wide range of industries. From image captioning to optical character recognition and beyond, PaliGemma’s capabilities hold promise for driving innovation and addressing complex problems in the digital age.
We hope you enjoyed reading the article!
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!