Understanding LLaVA: Large Language and Vision Assistant
Blog post from Voxel51
LLaVA (Large Language and Vision Assistant) is an open-source project developed by researchers at the University of Wisconsin, Microsoft Research, and Columbia University. It aims to create a novel end-to-end trained large multimodal model that can compete against even the giants of models such as GPT-4. The LLaVA team created 150k image-instruction pairs using images from the COCO Train2017 dataset and leveraged GPT-4 to form conversations about the image in a cheap and efficient manner. They used the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2, for their model training. The results show that LLaVA was able to capture overall an 85% relative score compared to GPT-4. The dataset has been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| LLM | 9 | 1,884 | 250 | 103 | -28% |
| AI Guardrails | 1 | 44 | 24 | 15 | -71% |
| Vector Search | 1 | 906 | 144 | 68 | -61% |