Long Context Reasoning

For AI systems to effectively perform knowledge work, synthesising information and drawing conclusions from multiple long documents is an essential component.
Prior literature has over indexed on retrieval capability as a measure of model performance over long contexts. This is incomplete for assessing model capabilities in practical tasks over large knowledge bases. Variants of needle in a haystack type questions effectively demonstrate a model's ability to find relevant information, yet it provides minimal insight into the ability to reason over this context. For most knowledge worker tasks, reasoning across information retrieved from multiple sources is the core capability needed. Retrieval a necessary, but insufficient, component for task completion.
Via AragoAI, I worked with Artificial Analysis to develop AA-LCR. A benchmark that:
...includes 100 hard text-based questions that require reasoning across multiple real-world documents, with each document set averaging ~100k input tokens. Questions are designed such that answers cannot be directly retrieved from documents and must instead be reasoned from multiple information sources...
AA-LCR was released in 2025 and has been intregrated into the Artificial Analysis Intelligence Index. For more detail on the benchmark and an overview of current model performance, review the Artificial Analysis website. The dataset for this benchmark has also been released publicly via HuggingFace.
