The emergence of multimodal AI chatbots, led by OpenAI's GPT-4 and Microsoft's LLaVA, marks a significant advancement in AI-human interactions by integrating both language and visual processing capabilities. GPT-4, with its transformer-based architecture, excels in natural language processing and has expanded to include visual inputs, showing impressive performance across academic benchmarks and a variety of languages, though it remains primarily accessible through subscription. LLaVA, leveraging Vicuna and a CLIP visual encoder, stands out for its proficiency in instruction-following and competitive performance in multimodal settings, despite being trained on a smaller dataset and being open-sourced. Both models demonstrate strengths in certain computer vision tasks, but also face challenges, such as fine-grained object detection and prompt injection vulnerabilities. GPT-4 tends to outperform LLaVA in mathematical reasoning and OCR, while LLaVA shows a strong ability in conversational contexts and understanding visual content. Each model's unique strengths and limitations underscore the ongoing development and potential security concerns in the field of AI chatbots.