Nous Research Introduces Hermes 2 Vision: A Lightweight Vision-Language Model

Nous Research, a private applied research group known for publishing open-source work in the large language model (LLM) domain, has recently unveiled a new addition to their collection – Nous Hermes 2 Vision. This lightweight vision-language model, available through Hugging Face, builds upon the company’s previous OpenHermes-2.5-Mistral-7B model and offers enhanced vision capabilities.

One of the key features of Hermes 2 Vision is its ability to prompt with images and extract valuable text information from visual content. Users can now analyze images and obtain detailed answers in natural language. The co-founder of Nous, known as Teknium on X, shared a fascinating example. A screenshot showed the LLM analyzing a photo of a burger and determining its potential health impact. This level of analysis brings a new level of sophistication to language models.

“Named after Hermes, the Greek messenger of Gods, the Nous vision model is designed to be a system that navigates ‘the complex intricacies of human discourse with celestial finesse.'”

The Nous vision model, named after the Greek messenger deity Hermes, aims to excel in processing the intricacies of human conversation with celestial finesse. By combining image data provided by users with its extensive learnings, the model can decipher various aspects of the image and provide explanations in a clear and concise manner.

Unlike ChatGPT, which also supports image prompts, Nous Hermes 2 Vision differentiates itself with two significant enhancements. Firstly, the model utilizes SigLIP-400M instead of the traditional 3B vision encoders. This not only makes the architecture more efficient and lightweight but also improves performance in vision-language tasks. Additionally, the model has been trained on a custom dataset enriched with function calling. This allows users to extract written information from images, such as menus or billboards, by using a specific tag.

“Developers now have a versatile tool at their disposal, primed for crafting a myriad of ingenious automations.”

The inclusion of function calling in Nous Hermes 2 Vision transforms it into a powerful Vision-Language Action Model, providing developers with a versatile tool for automations and innovative applications. The company expressed their excitement about the model’s potential on the Hugging Face page.

Several datasets were used to train the model, including LVIS-INSTRUCT4V, ShareGPT4V, and conversations from OpenHermes-2.5. These diverse sources contribute to the model’s comprehensive understanding of language and visual context.

While Nous Hermes 2 Vision is available for research and development purposes, early usage has revealed that it is not without its flaws. The co-founder acknowledged the model’s imperfections, stating that it exhibited hallucinations and spamming of EOS tokens during the initial release. Consequently, the model was renamed as Hermes 2 Vision Alpha. However, the company plans to address these issues and release a more stable version in the near future.

“I see people talk about ‘hallucinations’ and yes, it is quite bad. I will make an updated version of this by the end of the month to resolve these problems.”

Quan Nguyen, the research fellow leading the AI efforts at Nous, expressed awareness of the model’s shortcomings and reassured users that an updated version is in the works. Feedback and user experiences play a crucial role in the development of the model.

Despite the challenges, Nous Research remains committed to advancing the field of AI. With a total of 41 open-source models under their belt, including the Hermes, YaRN, Capybara, Puffin, and Obsidian series, they continue to push the boundaries with various architectures and capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts