Subscribe to the feed

Large language models (LLMs) are evolving rapidly enabling applications such as chatbots, code generation, and knowledge extraction. One crucial factor influencing their effectiveness is the context length - the number of tokens a model can look at once. While theoretical context lengths continue to grow, the practical, effective context length (ECL) determines real-world usability.

In this blog post, we will explore the ECL of the Granite-3.1-8b instruct model and validate its capabilities across various tasks. The study takes its inspiration from the paper "Measuring Effective Context Length in Large Language Models" (Li et al., 2024), which has introduced methodologies for assessing how much context a model actually utilizes.

What Is Effective Context Length and why does it matter?

While modern LLMs boast extended context windows (e.g., 100K tokens or more), not all tokens contribute equally to model predictions. Effective context length refers to the portion of the context window that significantly influences the model's responses. Beyond the Effective context length of the model, the user might start to see a performance impact. Li et al. (2024) propose a systematic evaluation method by examining performance degradation as tokens recede further into the context. Understanding the ECL is important because developers need to balance performance with computational cost as the longer contexts also require more memory, potentially limiting batch sizes that can be computed together.

Experimental Setup and Evaluation:

We used the ECL benchmarking methodologies described in the RULER paper [2] to calculate the ECL of the Granite-3.1-8b instruct model. RULER is a diverse suite of synthetic tasks designed to evaluate the model’s ability to retain and retrieve information across long contexts. The benchmark includes multiple variants of Needle-in-a-Haystack as well as additional tasks focusing on variable tracking, counting, and long context question answering. The official code and metrics are from the official repository. The primary results for the Granite-3.1-8b instruct model are presented below for each category of the tasks.

Model performance across various task categories

Single Key Retrieval: This category includes multiple tasks that measure the model’s ability to find a single piece of information (a “needle”) accurately from a long context.

Chart showing the model's ability to find a single piece of information

Multi-Key Retrieval: Includes subtasks evaluating the model's ability to retrieve multiple pieces of information (“needles”) dispersed within long sequences.

Chart showing the model's ability to retrieve multiple pieces of information dispersed within long sequences.

Variable Tracking: The task under this category assesses whether the model can correctly follow changes in variables throughout a long sequence.

A chart that assesses whether the model can correctly follow changes in variables.

Common/Frequent Word Extraction: The task under this category assesses whether the model can correctly identify common/frequent appearing terms/words.

A chart showing whether the model can correctly identify frequently appearing terms and words.

Question Answering: Include tasks which evaluate the model’s capability to answer questions based on scattered information across an extensive context.

A chart the model's capability to answer questions based on scattered information.

Overall model performance:

Rating the effective context length performance of the model overall.

Granite-3.1-8b Model Performance at its Effective Context Length (32k):

Model

Claimed Length

Effective Length

NIAH

VT

CWE/FWE

QnA

Avg (13 tasks)

Granite-3.1-8b

128k

32k

96.6

98.6

68.85

68

88.1

Key Observations

The effective context length of Granite-3.1-8B, where it maintains an average performance above the 85.6% accuracy baseline set by llama-2-7B at 4K context, is 32K tokens. Beyond 32K, performance drops significantly for certain tasks.

  • Performance drops at very long context lengths:
    • At the 128k Context window, the model struggles particularly on cwe (Common Word Extraction) (0.8%), niah_multikey_3 (20%), and qa_2 (48%).
    • It is important to note that the performance degradation at longer context lengths is primarily driven by the Common Word Extraction (CWE) task. While most tasks maintain relatively high performance at 32K and 64K context lengths, the CWE task seems to experience a drastic drop in accuracy as context length increases.
    • This suggests that while the model supports long contexts, “retention and retrieval” mechanisms degrade at extreme scales.
  • Between 32K and 64K, we see some drop-offs, particularly in complex multi-key or multi-value retrieval tasks. However, most tasks perform well up to 32K.
  • Tasks like niah_multi_query and vt (Variable Tracking) maintain high accuracy (>90%) up to 64K tokens.
    • This suggests Granite-3.1-8B is well-suited for tasks involving structured retrieval or multi-turn interactions.
  • The results are consistent with the findings of [1], emphasizing the importance of effective context benchmarking in real-world applications.

Implications for LLM Development

These insights have practical implications for optimizing LLM architectures. By fine-tuning positional embeddings and memory mechanisms, developers can enhance context retention without merely increasing token limits. Future work will explore adaptive strategies to extend effective context length further.

While Granite-3.1-8B can theoretically support 128k context lengths, the effective context length of the model is approximately 32K tokens, meaning it maintains reliable performance up to this threshold. Understanding effective context length is essential for evaluating and improving LLM performance. As LLMs continue advancing with memory-efficient architecture and long context tuning, rigorous benchmarking will remain a cornerstone for pushing their contextual understanding to new limits.

References

[1] Li, X., Zhang, Y., & Chen, L. (2024). Why Does the Effective Context Length of LLMs Fall Short? arXiv Preprint, arXiv:2410.18745. https://arxiv.org/pdf/2410.18745

[2] Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., & Ginsburg, B. (2024). RULER: What’s the real context size of your long-context language models? arXiv Preprint, arXiv:2404.06654. https://arxiv.org/abs/2404.06654


About the authors

Nikhil Palaskar is a senior software engineer and a member of Performance and Scale Engineering at Red Hat. Nikhil's current focus is primarily on serving LLMs for inference and performance optimizing their deployments in Red Hat Openshift AI and Red Hat Enterprise Linux AI environments. Nikhil is also actively engaged in performance experimentation and tuning of running large language models on AMD hardware with the ROCm software stack. Previously he has also worked on building a Benchmarking and Performance Analysis Framework (Pbench).

Nikhil's professional interests revolve around AI/ML and deep learning, statistics, performance engineering, and application profiling. When he is not working he likes to go on hikes.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Original series icon

Original shows

Entertaining stories from the makers and leaders in enterprise tech