The Evolution of Generative Artificial Intelligence: LLaVA 1.5

The landscape of generative artificial intelligence (AI) is experiencing rapid advancements with the emergence of large multimodal models (LMM). These models have revolutionized our interactions with AI systems by enabling us to input both images and text. While OpenAI’s GPT-4 Vision is a leading example of this technology, its closed-source and commercial nature can impose limitations in certain applications.

Fortunately, the open-source community has risen to the challenge and introduced LLaVA 1.5 as a promising alternative to GPT-4 Vision. LLaVA 1.5 combines various generative AI components and has been fine-tuned to deliver high accuracy while being computationally efficient. This open-source LMM has the potential to set a new direction for future research in the field.

Understanding LMM Architecture

LMMs typically consist of several pre-existing components, including:

  • A pre-trained model for encoding visual features
  • A pre-trained large language model (LLM) for understanding user instructions and generating responses
  • A vision-language cross-modal connector for aligning the vision encoder and the language model

Training an instruction-following LMM involves a two-stage process. The first stage, vision-language alignment pretraining, aligns visual features with the language model’s word embedding space using image-text pairs. The second stage, visual instruction tuning, enables the model to follow and respond to prompts involving visual content. This stage can be challenging due to its computational demands and the requirement for a large dataset of carefully curated examples.

LLaVA 1.5 incorporates the CLIP (Contrastive Language–Image Pre-training) model as its visual encoder. CLIP, developed by OpenAI in 2021, associates images and text by training on a substantial dataset of image-description pairs. It is utilized in advanced text-to-image models like DALL-E 2. Vicuna, a version of Meta’s open-source LLaMA model fine-tuned for instruction-following, serves as LLaVA’s language model.

Advancements in LLaVA 1.5

LLaVA 1.5 expands upon its predecessor by:

  • Connecting the language model and vision encoder through a multi-layer perceptron (MLP), a fully connected deep learning model
  • Incorporating several open-source visual question-answering datasets into the training data
  • Scaling the input image resolution
  • Gathering data from ShareGPT, an online platform for sharing conversations with ChatGPT

The training process for LLaVA 1.5 involved approximately 600,000 examples and took only a day on eight A100 GPUs, costing a few hundred dollars. According to the researchers, LLaVA 1.5 outperforms other open-source LMMs in 11 out of 12 multimodal benchmarks. It is important to note that performance measurements of LMMs can be complex and may not necessarily reflect real-world applications.

An online demo of LLaVA 1.5 is available, showcasing impressive results from a small model that can be trained and operated on a tight budget. Additionally, the code and dataset are accessible, encouraging further development and customization. Users have shared interesting examples where LLaVA 1.5 successfully handles complex prompts.

However, there is a caveat with LLaVA 1.5. As it has been trained on data generated by ChatGPT, it is not suitable for commercial purposes due to ChatGPT’s terms of use. Developers are prohibited from using LLaVA 1.5 to train competing commercial models.

While LLaVA 1.5 may not yet rival GPT-4 Vision in terms of convenience, ease of use, and integration with other OpenAI tools, it offers cost-effectiveness and scalability for generating training data for visual instruction tuning. Several open-source alternatives to ChatGPT can fulfill this purpose, and it is only a matter of time before others replicate the success of LLaVA 1.5 and explore new possibilities, including permissive licensing and application-specific models.

LLaVA 1.5 provides a glimpse into the future of open-source LMMs. As the open-source community continues to innovate, we can anticipate the development of more efficient and accessible models, further democratizing the new wave of generative AI technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts