Handling long context remains a challenging issue for LLMs and other AI systems. Thatβs why more and more approaches for optimizing LLMs' processing of long context are appearing.
Most new approaches focus on breaking down long texts into smaller, manageable parts. However, this may lead to losing some parts of the context or the text's core idea. This process may also take much time, so developers aim to include parallel processing of multiple parts in their methods, which usually requires more computational power. Because this approach is not always effective, researchers are developing other techniques for working with long context windows.
Here are 10 novel methods for efficient processing of long contexts in LLMs:
Dolphin, a decoder-decoder architecture, for energy-efficient processing of long contexts in LMs uses a small, energy-saving model to condense extensive context into a compact form. Dolphin architecture reduces the workload for the main model, lowers energy use and speeds up processing, maintaining accuracy. β Read more
Writing in the Margins (WiM) method breaks down long texts into smaller chunks and then LLM creates "margin notes" that highlight key information. This helps LLMs to handle long context in retrieval tasks. β Read more
ReMamba enhances original Mamba architecture, using selective compression and a two-stage process to better handle long texts with minimal extra cost. It comes close to the results of similarly sized transformer-based models. β Read more
FocusLLM extends the context length of decoder-only LLMs by chunking long texts and appending local context to each chunk. It uses parallel decoding mechanism to extract and integrate essential information, performing well on contexts up to 400K tokens. β Read more
ChatQA 2, a Llama3-based model by NVIDIA, extends Llama3-70B's context window from 8K to 128K tokens and enhances performance via a three-stage instruction tuning process. It matches GPT-4-Turbo's accuracy on long-context tasks and outperforms it in RAG tasks. β Read more
EM-LLM (Episodic Memory LLM) mimics how human memory works, allowing LLMs to handle much longer texts efficiently. It organizes text into meaningful "episodes" and retrieves relevant information when needed, similar to how human brains recall memories. β Read more
LazyLLM method speeds up transformer-based LMs by selectively processing only the most important tokens during the "prefilling" and "decoding" inference stages, maintaining accuracy. β Read more
LongRAG improves the performance and boosts RAG with long-context LLMs, using a βlong retriever" and a "long reader" in the architecture. LongRAG processes entire Wikipedia articles into much longer units, reducing the total number of units and making retrieval more efficient. Here is our article about LongRAG approach. β Read more
DeepSeek-V2 model implements special DeepSeekβs advanced developments like DeepSeekMoE (Mixture-of-Experts) architecture and Multi-Head Latent Attention (MLA) to manage extremely long text inputs up to 128,000 tokens. We have made a detailed overview of these technologies in our AI 101 episode. β Read more
LONGWRITER introduces AgentWrite, a method that overcomes output length limitations, by breaking down ultra-long writing tasks into smaller parts. It allows existing models to produce outputs up 20,000 words. β Read more
