In this article, we will learn how to run inference using the quantized version of IDEFICS and fine-tune IDEFICS-9b using an A100 GPU. We will also fine-tune IDEFICS 9B, a variant of the innovative visual language model. This fine-tuning process uses techniques like LoRa to improve the model’s performance in targeted areas.
When running inference and fine-tuning IDEFICS-9 b, the high processing power of A100 GPUs can significantly speed up the process, allowing you to iterate and experiment more quickly.
torch>=2.0, torchvision
)transformers
, datasets
)datasets
.IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is an open-access visual language model that can process sequences of images and text to produce text outputs, similar to GPT-4.
The model is based on DeepMind’s Flamingo model, which is not publicly available. IDEFICS utilizes publicly accessible data and models, specifically LLaMA v1 and OpenCLIP. It is offered in two versions: a base version and an instructed version. Each of these versions is available in two sizes—9 billion parameters and 80 billion parameters.
Recently, IDEFICS2 was released. The model is trained to answer questions about images, such as ‘What color is the car?’ or 'How many people are in the picture? '. It can also describe visual content, create stories grounded in multiple images, extract information from documents, and perform basic arithmetic operations.
Fine-tuning is a process that involves taking a pre-trained model and training it for a specific task using a specific dataset. This process involves updating the model’s weights based on the new task’s data, typically with a lower learning rate, to make slight adjustments without drastically altering the pre-trained knowledge.
Pre-trained models are typically trained on large, diverse datasets and capture many features. They are trained to perform well on a specific task, such as sentiment analysis, image classification, or any domain-specific application. Using data specific to the new task, the model can learn the nuances and specifics of that data, leading to improved accuracy and performance.
Fine-tuning requires significantly less computational resources and time than training a model from scratch.
In this case, we will use ‘TheFusion21/PokemonCards’ to fine-tune the model. Here is the data structure of this dataset.
We will start by installing a few necessary packages. We recommend that our users spin up a Jupyter Notebook and start working.
We will install the datasets library. This library provides tools for accessing and managing datasets for training and evaluating machine learning models. We are also installing transformers and bits and bytes for efficient fine-tuning.
Bitsandbytes is an incredible library that allows loading models in 4-bit precision, making it extremely useful for fine-tuning large language models with Qlora.
Once these packages are successfully installed, we will import the necessary libraries.
Next, we will load the quantized version of the model, so let’s get the load the model. We will select the device as ‘CUDA’; if ‘CUDA’ is unavailabl, we will use CPU.
We will load the model with 4-bit precision, which reduces memory usage and can speed up processing. Additionally, we will use double quantization, which can improve the accuracy of the 4-bit quantized model. Next, a processor is initialized to handle the model’s inputs and outputs using the pre-trained checkpoint.
This code initializes the IdeficsForVisionText2Text model by loading a pre-trained version from the specified checkpoint. Next, we will apply the quantization settings defined in bnb_config to load the model in an efficient 4-bit precision format.
Additionally, the code uses automatic device mapping to distribute the model’s components across the available hardware, optimizing for performance and resource utilization.
Once the downloads are over, we will print the model using print(model). This prints the entire model pipeline with the layer and embedding details.
This prints the entire model pipeline with the layer and embedding details.
We will use this model for inference and test the model.
The function processes input prompts, generates text using the model while filtering out unwanted tokens, and prints the resulting text. The function utilizes the tokenizer and processor to handle text tokenization and decoding, which ensures that the generated text adheres to specified constraints.
Question: What’s on the picture? Answer: A puppy.
We will prepare the dataset that we will use for our fine-tuning task.
The convert_to_rgb function ensures images are in RGB format to handle different types of images. The ds_transforms function processes a batch of examples by transforming images, preparing text prompts, and converting everything into a format suitable for model training or inference. The function helps to apply necessary transformations, tokenizes the prompts, and sets up the inputs and labels for the model.
As suggested by hugging face, we will load the ‘TheFusion21/PokemonCards’ to fine-tune the model. However, please feel free to use any dataset with the correct format.
Low-rank adaptation (LoRA) is a PEFT technique that reduces a large matrix into two smaller low-rank matrices within the attention layers, significantly reducing the number of parameters that need fine-tuning.
This code configures and applies Low-Rank Adaptation (LoRA) to our model IDEFICS9b:
trainable params: 19,750,912 || all params: 8,949,430,544 || trainable%: 0.2206946230030432
Next, we will finetune the model,
This code will start the training process according to the specified parameters, such as the learning rate, precision, batch sizes, gradient accumulation, and check pointing strategy. We use a 16-bit floating point precision for faster and more efficient training and will use the optimizer for 8-bit precision.
Question: What’s in the picture? Answer: This is Lucario. A Stage 2 Pokemon Card of type Fighting with the title Lucario and 90 HP of rarity Rare evolved from Pikachu from the set Neo Destiny and the flavor text: It can use its tail as a whip.
This concludes our exploration of fine-tuning a large language model. We have prepared our model by successfully fine-tuning it using the Pokemon dataset for inference tasks. The model can now be pushed to Hugging Face and utilized for various applications.
Fine-tuning multimodal models demands carefully balancing computational resources and training strategies to achieve the best outcomes.
The NVIDIA A100 GPU’s high performance, large memory capacity, and scalability make it an excellent choice for fine-tuning multimodal models. These features enable it to efficiently handle the complex, large-scale tasks involved in integrating visual and textual data, leading to faster and more effective model training and deployment. Additionally, GPU Droplets present another great option for fine-tuning large language models.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
the code in begin of Training ,it looks out of the code block.