Understanding LLaVA: Large Language and Vision Assistant

Post Details

Company

Voxel51

Date Published

Dec. 11, 2023

Author

Dan Gural

Word Count

1,584

Company Posts That Month

10

Language

English

Hacker News Points

-

Source URL

voxel51.com/blog/understanding-llava-large-language-and-vision-assistant

Summary

LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	9	1,884	250	103	-28%
AI Guardrails	1	44	24	15	-71%
Vector Search	1	906	144	68	-61%