Home / Companies / Voxel51 / Blog / Post Details
Content Deep Dive

Understanding LLaVA: Large Language and Vision Assistant

Blog post from Voxel51

Post Details
Company
Date Published
Author
Dan Gural
Word Count
1,584
Company Posts That Month
10
Language
English
Hacker News Points
-
Summary

LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 9 1,884 250 103 -28%
AI Guardrails 1 44 24 15 -71%
Vector Search 1 906 144 68 -61%