Interest in long-context language models (LLMs) is growing as context windows expand to accommodate up to 1 million tokens, prompting the development of new benchmarks to evaluate their capabilities. One such benchmark, the Multi-Needle + Reasoning benchmark, tests the ability of LLMs to retrieve and reason over multiple facts within a large context. The results show that performance degrades as the number of facts (needles) increases, and the challenge intensifies when reasoning is required. Notably, models like GPT-4 tend to retrieve facts placed towards the end of the context while ignoring those at the beginning, a pattern seen in both single and multiple needle setups. As context length grows, retrieval and reasoning performance both decline, highlighting potential limitations of LLMs when used in retrieval-augmented generation (RAG) applications. Understanding these limitations is crucial for effectively utilizing long-context LLMs, as they may not guarantee retrieval of multiple facts, especially as the context size increases, and specific prompting strategies might be needed to enhance performance.