LLaVA-1.5 was released as an open-source, multi-modal language model on October 5th, 2023. This was great news for AI developers because they could now experiment and innovate with multi-modals that can handle different types of information, not just words, using a completely open-sourced model.
This article explores the LLaVA1.5 model with a code demo. Different experimental examples are shown with results.This article will also explore the latest LLaVA models AI developers can use for developing their applications.
According to Grand View Research the global multimodal AI market size was estimated at USD 1.34 billion in 2023 and is projected to grow at a compound annual growth rate (CAGR) of 35.8% from 2024 to 2030.
A multimodal LLM, or multimodal Large Language Model, is an AI model designed to understand and process information from various data modalities, not just text. This means it can handle a symphony of data types, including text, images, audio, video
Traditional language AI models often focus on processing textual data. Multimodality breaks free from this limitation, enabling models to analyze images and videos, and process audio.
The full form of LLaVA is Large Language and Vision Assistant. LLaVA is an open-source model that was developed by the fine tuning LLaMA/Vicuna using multimodal instruction-following data collected by GPT. The transformer architecture serves as the foundation for this auto-regressive language model. LLaVA-1.5 achieves approximately SoTA performance on 11 benchmarks, with just simple modifications to the original LLaVA, utilizing all public data.
LLaVA models are designed for tasks like video question answering, image captioning, or generating creative text formats based on complex images. They require substantial computational resources to process and integrate information across different modalities. H100 GPUs can cater to these demanding computations.
LLaVA alone demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 90.92% accuracy score on a synthetic multimodal instruction-following dataset. But when LLaVA is combined with GPT-4 then its giving the highest performance in comparison to other models.
There are other LLaVA models out there. We have listed few of open-source models that can be tried:
1.LLaVA-HR: It is a high-resolution MLLM with strong performance and remarkable efficiency. LLaVA-HR greatly outperforms LLaVA-1.5 on multiple benchmarks. 2.LLaVA-NeXT: This model improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks 3.MoE-LLaVA (Mixture of Experts for Large Vision-Language Models): It is a novel approach that tackles a significant challenge in the world of multimodal AI – training massive LLaVA models (Large Language and Vision Assistants) efficiently. 4.Video-LLaVA: Video LLaVA builds upon the foundation of Large Language Models (LLMs), such as Cava, and extends their capabilities to the realm of video.Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. 5.LLaVA-RLHF: It is the open-source RLHF-trained large multimodal model for general-purpose visual and language understanding. It achieved the impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4. It is claimed that this model yielded 96.6% (v.s. LLaVA’s 85.1%) relative score compared with GPT-4 on a synthetic multimodal instruction.
We tested LLaVA 1.5 on different prompts and got the following results.
Prompt: Give an insightful explanation of this image.
We tested the abilities of LLaVA 1.5 by giving an image which represents the Global chatbot market. This model has given complete insightful information about the breakdown of the chatbot market as shown in the figure above.
To further test LLaVA-1.5’s image understanding abilities.
Prompt: how many books are seen in this image?Also tell what you observed in this image?
The model given the correct answer about the number of books and also the image was described accurately while focussing on minute details.
Prompt: Return the coordinates of the image in x_min, y_min, x_max, and y_max format.
The model has returned the coordinates by taking the assumption that the logo is centered and given the answer as shown above in the image. The answer was satisfactory.
We have implemented LLaVA-1.5 (7 Billion parameter) in this demo.
!pip install python-dotenv
!pip install python-dotenv transformers torch
!pip install transformers
from dotenv import load_dotenv
from transformers import pipeline
load_dotenv() # take environment variables from .env.
import os
import pathlib
import textwrap
from PIL import Image
import torch
import requests
This includes os
for interacting with the operating system, pathlib
for filesystem paths, textwrap
for text wrapping and filling, PIL.Image
for opening, manipulating, and saving many different image file formats, torch
for deep learning, and requests
for making HTTP requests.
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text",
model=model_id,
model_kwargs={})
The load_dotenv() function call loads environment variables from a .env file located in the same directory as your script. Sensitive information, like an API key (api_key), is accessed with os.getenv(“hf_v”). Here we have not shown the full HuggingFace API key because of security reasons.
How to store API keys separately?
Setting Up the Pipeline: The pipeline function from the transformers library is used to create a pipeline for the “image-to-text” task. This pipeline is a ready-to-use tool for processing images and generating text descriptions.
image_url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTdtdz2p9Rh46LN_X6m5H9M5qmToNowo-BJ-w&usqp=CAU"
image = Image.open(requests.get(image_url, stream=True).raw)
image
prompt_instructions = """
Act as an expert writer who can analyze an imagery and explain it in a descriptive way, using as much detail as possible from the image.The content should be 300 words minimum. Respond to the following prompt:
""" + input_text
image = Image.open(requests.get(image_url, stream=True).raw)
fetches the image from the URL and opens it using PIL. Write the prompt and what type of explanation is needed and set the word limit.
prompt = "USER: <image>\n" + prompt_instructions + "\nASSISTANT:"
print(outputs[0]["generated_text"])
Assignment As we have mentioned above that there are other LLaVA models available.
This article explored the potential of LLaVA-1.5, showcasing its ability to analyze images and generate insightful text descriptions. We delved into the code used for this demonstration, providing a glimpse into the inner workings of these models. We also highlighted the availability of various advanced LLaVA models like LLaVA-HR and LLaVA-NeXT, encouraging exploration and experimentation.
The future of multimodality is bright, with continuous advancements in foundation vision models and the development of even more powerful LLMs.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!