Large language models (LLMs) like GPT-4 and PaLM are tested for their zero-shot predictive accuracy and generative ability on a custom dataset from the Wikipedia Movie Plots data. Despite their impressive capabilities, these models face challenges such as outdated responses and hallucinations, leading to businesses hesitating in adopting them for workflows. The blog post details an evaluation using 100 data points from the dataset, focusing on genre prediction and concise plot summaries. Both models are assessed using metrics like precision, recall, F1 scores, and confusion matrices. Findings reveal that PaLM excels in precision while GPT-4 performs better in recall, with both models showing strengths and weaknesses across different genres. For summarization, PaLM produces shorter summaries, whereas GPT-4 includes more detailed descriptions. PaLM also offers safety attribute scores, useful for content moderation. The evaluation suggests that while both models perform well in zero-shot learning, further prompt tuning or fine-tuning on specific datasets may enhance their performance in real-world applications.