ragretrievalcolbertagentic-ragvector-search

RAG Is Dead, Long Live RAG: Where Retrieval Is Going

Max P

RAG Is Dead, Long Live RAG: Where Retrieval Is Going

Every few months a Twitter thread argues that retrieval-augmented generation is over. The argument usually runs like this: context windows are huge now, models can read whole books, so why bother with a retriever? Just stuff everything in.

I have run this experiment more than once. It does not work, except in narrow cases. Long context is great when you have one document and you want to ask many questions about it. It is terrible when you have ten thousand documents, none of which fit, and the right answer is in three paragraphs scattered across four of them. The interesting question is not whether RAG dies. The interesting question is what the next generation of retrieval looks like, and that is happening right now.

This post sketches the four trends I find most useful to follow: hybrid retrieval, late-interaction models, agentic retrieval, and contextual chunking. Along the way it touches on RAGatouille, Onyx, AnythingLLM, and Qdrant as concrete examples.

What long context actually replaces

Before we move on, an honest accounting of what long context does fix:

  • It lets you skip retrieval for single-document workflows.
  • It absorbs more recent context within a session, so multi-turn agents do not need to re-fetch as often.
  • It softens the penalty for poor chunking, because more chunks fit even with a noisy retriever.

What it does not fix:

  • Cost and latency at scale. Pulling 200K tokens of context into every call is not free.
  • Retrieval over millions of documents. The set you would need to stuff into context does not fit.
  • Provenance and citations. Even a perfect long-context model will not tell you which paragraphs it leaned on unless you build the trail yourself.

For more on the foundations, our self-hosted AI coding tools piece covers the broader open-source landscape.

Trend 1: hybrid retrieval is the new default

Pure dense retrieval, where every document and every query becomes a single vector, has limits. It is bad at exact-token matches like product SKUs, error codes, function names, and regulatory citations. BM25 keyword retrieval, the old workhorse, handles those cases trivially but misses paraphrase.

Hybrid search combines both. Modern vector databases now support sparse and dense vectors in the same collection. Qdrant treats sparse vectors as a generalization of BM25 and TF-IDF, and lets you combine them with dense vectors in one query. Most production RAG systems I look at run hybrid by default, and the lift over pure dense retrieval is usually meaningful on technical content.

If you are building one of these systems, our writeup on building a private RAG stack walks through the moving parts.

Trend 2: late-interaction models like ColBERT

Dense retrieval squeezes a whole document into one vector. That is a lossy operation. Late-interaction models like ColBERT take a different bet: they keep token-level embeddings and do the matching at the token level at query time. Per the RAGatouille project, this gives better generalization to new domains, better data efficiency during training, and notably better multilingual performance, especially for low-resource non-English languages.

RAGatouille is the Python library that has done the most to make ColBERT accessible. The project page describes itself as bridging the gap between state-of-the-art retrieval research and practical RAG pipelines. You install it with pip, point it at your documents, and it indexes them and serves queries with sensible defaults.

The downside of late interaction is index size. You are storing per-token embeddings instead of per-document, and that costs storage. For collections in the hundreds of thousands or low millions of documents, the trade is usually worth it. At very high scale you are likely combining ColBERT-style retrieval with a cheaper first-stage retriever.

Trend 3: agentic retrieval

The third trend is the one most directly underselling itself when it gets called RAG. Once you let the model decide what to retrieve, retrieval becomes a tool call rather than a fixed step in a pipeline. The model can issue several queries, refine its question after seeing initial results, switch to a different connector when the first one comes up dry, and so on.

Onyx is a useful concrete example. The project describes itself as the application layer for LLMs, and ships agentic RAG with hybrid search, multi-step deep research flows, code execution sandboxes, and 50-plus connectors that the agent can choose between. The shape of the system is no longer query-in, chunks-out. It is closer to: the agent has a goal, a research budget, and a set of tools, and it spends them.

For a softer entry point, AnythingLLM offers similar agent-style behavior in a smaller, more single-tenant package, and remains one of the easiest ways to get up and running.

Trend 4: contextual chunking

The last trend is on the ingestion side. The classic chunking strategy is dumb: split every 500 tokens with a 50-token overlap. That works fine until your chunks land in the middle of a sentence or, worse, mid-paragraph in a long technical doc where a single concept stretches across sections.

Contextual chunking refers to a class of strategies that try to be smarter:

  • Section-aware splitting that respects markdown headings, code fences, table boundaries, and list structures.
  • Document-level summaries prepended to each chunk so a chunk fragment retains some of the parent context.
  • Question-aware enrichment where ingestion generates likely questions for each chunk and adds them as a side channel for retrieval.

The cost is higher ingestion time. The win is meaningful retrieval improvements on long, structured documents.

Putting it together

So, is RAG dead? No. The naive 2023 version of RAG, single-vector dense retrieval, fixed chunking, single round of retrieval, no reranking, is dead. Good. What is replacing it is more flexible, more powerful, and more interesting to build:

  • Hybrid retrieval as the default.
  • Late-interaction models like ColBERT for higher recall in domains where data is scarce.
  • Agentic retrieval where the model chooses what to fetch and when.
  • Contextual chunking that respects structure.

If you are starting fresh today, pick a vector database that supports hybrid search, an application layer like Onyx or AnythingLLM, and consider RAGatouille for late interaction on the slices of your corpus where it matters most.

You can read the source for the projects mentioned: RAGatouille on GitHub, Onyx, AnythingLLM, and Qdrant.

Tools mentioned in this post

  • RAGatouille: Python library for ColBERT-style late-interaction retrieval in RAG pipelines.
  • Onyx: open-source application layer for LLMs with agentic RAG and many connectors.
  • AnythingLLM: single-container chat app for documents with agent-style features.
  • Qdrant: open-source vector database with hybrid sparse plus dense retrieval.

Related Tools

More Articles