Context Is Gold to Find the Gold Passage: Evaluating and Training Contextual Document Embeddings

Community Article Published June 2, 2025

arXiv GitHub Hugging Face

image/png Traditional embedding methods (right, top) produce embeddings that do not include essential contextual information. Our light-weight training recipe (InSeNT) trains Contextualized Embedding Models that are aware of document-wide context when encoding a passage (right, bottom) and can integrate document-wide information in individual chunk representations, augmenting embedding relevance and improving downstream retrieval performance without increasing inference cost.

TL;DR

Dense retrievers typically embed each passage in isolation. When the relevant clues spill across passages, those models mis-rank results. ConTEB (the Contextual Text Embedding Benchmark) quantifies this weakness; InSeNT + late-chunking pooling is a promising way to fix it, with a minor fine-tuning phase and almost no runtime overhead, delivering large gains on ConTEB.

Why does context matter ?

Search applications rarely deal with tweet-length texts. Technical manuals, contracts, scientific papers, and support tickets easily run into thousands of tokens. Users, however, still expect an answer at the passage level. When the decisive evidence sits partially outside the passage boundary, a model that “sees” only a single chunk is likely to fail. Context can help resolve ambiguity, such as distinguishing between multiple meanings of a word or resolving pronouns and entity references. It is also crucial when documents have a structured format, commonly found in legal or scientific texts for instance, and knowing where the passage is within the table of content is essential to understanding.

image/png

In the example above, embedding the sentence in bold "They extended [...]" without leveraging document context will be ambiguous: are we talking about Napoleonic armies or Brazilian football?

How retrieval systems actually chunk documents

Before embedding, virtually every production pipeline breaks each document into smaller, model and reader-friendly units. Common strategies include:

Strategy Typical parameters Rationale & trade-offs
Fixed-length sliding window k ≈ 128–1024 tokens, some overlap Simple to implement; overlap reduces boundary effects but multiplies index size.
Structure-aware blocks Headings, paragraphs, list items Preserves semantic units but yields highly variable lengths.
Hybrid Fixed window inside structural blocks Combines predictability with some respect for discourse structure.

Designers must balance (i) respecting the Transformer’s maximum input, (ii) keeping enough context for downstream reading comprehension, and (iii) controlling index growth and latency. In practice, no chunking scheme can guarantee that every question’s evidence is entirely self-contained.

ConTEB: a benchmark that penalises context blindness

ConTEB introduces eight retrieval tasks where answering requires information beyond any single chunk. Some datasets are synthetic and controlled (e.g. Football, Geography); others are derived from realistic RAG workloads such as NarrativeQA or Covid-QA. A “sanity test” (NanoBEIR) ensures that improvements do not come at the expense of traditional self-contained tasks.

image/png

Getting into the gist of it: How can we add context to embeddings ?

Early results on ConTEB showcase context standard retrieval methods struggle in settings in which context is key! Our approach attempts to integrate contextual information through two key components: the recently proposed Late Chunking technique, and a custom training recipe we call InSeNT.

Late chunking: pooling after embedding the full document

As previously stated, dense retrievers usually break a long document d into smaller, fixed-length chunks

d    {c1,c2,,cNd}, d \;\longrightarrow\; \{c_1, c_2, \dots, c_{N_d}\},

encode each chunk independently, and later treat the set of chunk vectors as the “representation” of the document:

ϕ(d)  =  [  ϕ(c1),  ϕ(c2),  ,  ϕ(cNd)]. \phi(d)\;=\;\bigl[\;\phi(c_1),\;\phi(c_2),\;\dots,\;\phi(c_{N_d})\bigr].

That early chunking strategy prevents any token in $c_i$ from seeing tokens from other chunks during encoding.

The Late Chunking Idea

  1. One long forward pass. Concatenate all chunks and let the encoder contextualize across the entire document:

H  =  ϕ(c1c2cNd),H=[h1,h2,,hT]. H \;=\;\phi\bigl(c_1 \oplus c_2 \oplus \dots \oplus c_{N_d}\bigr), \qquad H=[h_1,h_2,\dots,h_T].

  1. Recover chunk-level vectors after the fact. For each original chunk $c_i$, average its token embeddings inside $H$:

ϕLC(ci)  =  1citciht,i=1,,Nd. \phi_{\text{LC}}(c_i) \;=\; \frac{1}{|c_i|} \sum_{t\in c_i} h_t, \quad i=1,\dots,N_d.

The resulting set

ϕLC(d)=[ϕLC(c1),,ϕLC(cNd)] \phi_{\text{LC}}(d) =\bigl[\,\phi_{\text{LC}}(c_1),\dots,\phi_{\text{LC}}(c_{N_d})\bigr]

retains the same shape expected by downstream retrieval code, but each chunk vector now benefits from full-document context: tokens in earlier paragraphs can influence the representation of later ones (and vice-versa), capturing long-range dependencies without any training changes.

Late Chunking computes token embeddings over the full document first and then pools back to chunk vectors, letting every chunk “know” about every other chunk before it ever meets the retrieval engine.

InSeNT: In-Sequence Negative Training

Late Chunking is primarly designed to be used without retraining the underlying embedding model. We find that while it often enables information propagation between chunks, this information flow can be largely optimized through a well-designed lightweight training phase.

image/png

During contrastive fine-tuning, InSeNT mixes the usual in-batch negatives (text sequences from other documents) with in-sequence negatives — other chunks from the same (late-chunked) document.

Intuitively, training Late Chunking models contrastively with chunks from different documents encourages information propagation within each document and improves document identification. On the other hand, the contrastive term between same-document chunks ensures each chunk retains its specificity, and remains identifiable w.r.t. to its neighbors.

A small mixing weight (λ ≈ 0.1) balances these two complementary yet somewhat contradictory objectives.

image/png

An interesting finding of this work is that having both in-sequence and in-batch negatives during training is absolutely crucial to good downstream performance.

Empirical gains

Results are clear! Context is gold when it comes to constructing relevant embeddings on ConTEB documents. Late Chunking alone yields large nDCG gains (+ 9.0), and InSeNT further largely improves performance, resulting in a 23.6 nDCG@10 boost on average w.r.t. standard embedding methods. Performance on NanoBEIR (self-contained queries) slightly decreases, indicating minor regression on conventional benchmarks that could be corrected by adding replay data in the training mix.

image/png

We further show contextualized embeddings are able to better scale with corpus size, and are more robust to sub-optimal chunking stategies.

image/png Contextualized models trained with InSeNT are more robust to aggressive chunking strategies that remove essential information from chunks (left), and scale better with corpus size and ambiguity (right).

What about Late Interaction models ?

Late Interaction models (ColBERT-style) keep multiple vectors per passage (usually one per token), and use a multi-vector to multi-vector matching operation called MaxSim to obtain a scalar query-document relevance score. Those familiar with the work from our group know we are big fans ! These models tend to perform better than their dense counterparts at the expense of increased storage cost, and especially shine in long-context settings.

Interestingly, while LI models are good at long-context retrieving, they are poorly suited to out-of-the-box late chunking (-0.3 nDCG@10 w.r.t. ModernColBERT without LI). We posit that since token embeddings are trained without pooling, these models learn very local features and cannot leverage information from neighboring tokens. Once trained with our method, ModernColBERT+InSeNT displays large performance gains across the board (+11.5 nDCG@10 w.r.t. ModernColBERT + Late Chunking), showcasing an increased ability to leverage external context.

In practice, this means that to perform late-chunking contextualization with LI models, you really need to train them!

As a sidenote, this project stems from disappointing attempts to contextualize documents using late chunking with ColPali models without retraining. The results obtained throught this work are a promising avenue to improve visual retrievers such as ColPali by enabling them to integrate information beyond the single page they embed.

Ressources

Contact of the first-authors

Citation

If you use any datasets or models from this organization in your research, please cite the original work as follows:

@misc{conti2025contextgoldgoldpassage,
      title={Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings}, 
      author={Max Conti and Manuel Faysse and Gautier Viaud and Antoine Bosselut and Céline Hudelot and Pierre Colombo},
      year={2025},
      eprint={2505.24782},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.24782}, 
}

Acknowledgments

This work is partially supported by ILLUIN Technology, and by a grant from ANRT France. This work was performed using HPC resources from the GENCI Jeanzay supercomputer with grant AD011016393.

Community

super informative! @tomaarsen check this ^

Very cool work! Reminds me a bit of Contextual Document Embeddings, which you very reasonably cited as well. cc @jxm

Late chunking is very cool, but it worries me in terms of memory/disk space costs. Unlike with Late Interaction models which are often mapped down to smaller dimensions, late chunking on dense embedding models presumably just results in storing rather large token embeddings, right? We'll be adding Sparse embedding models to Sentence Transformers soon, that might be an interesting angle as well: Sparse late chunking.

Also, nice work on the (re-)usability of your work, the two trained models have a nice usage to them.

  • Tom Aarsen
·
Article author

Hey Tom ! Thanks a lot !
Late Chunking actually doesn't change a thing compared to classical bi-encoders in terms of storage/inference cost ! In both cases, the token embeddings are averaged over each chunk. In classic LC, you could decide to keep all token embeddings to truly be able to chunk dynamically - but most often you already know how you are going to chunk, so you can use the same chunks standard bi-encoders would use, and get better contextualization (leading to better robustness to bad chunking - see ablations in section 6).

In our case, we have [sep] tokens between the chunks, and find chunk representations largely improve when they learn to stay different from adjacent document chunks. This could not be done without the model knowing the chunk boundaries beforehand. After averaging tokens between the [sep] tokens, we thus get exactly the same embedding sizes as a standard bi-encoders.

This is nice cause it really is a drop in replacement. Having said that, we believe models we trained still need to be a bit improved for production use cases, we were mainly happy to show the research direction looked very promising!

Cheers @tomaarsen and thanks for the kind words as always @merve !

Sign up or log in to comment