CoRAG-Llama3.1-8B-MultihopQA

This is the CoRAG-8B model fine-tuned on MultihopQA data in the paper Chain-of-Retrieval Augmented Generation.

Model Evaluation

Model	2WikiQA EM	2WikiQA F1	HotpotQA EM	HotpotQA F1	Bamboogle EM	Bamboogle F1	MuSiQue EM	MuSiQue F1
3-shot Llama-3.1-8B-Inst.	30.7	39.9	34.1	46.6	28.0	37.3	7.7	15.4
3-shot GPT-4o	49.0	56.2	45.8	59.4	53.6	63.8	15.7	25.8
Fine-tuned Llama-8B w/ E5_large	55.1	60.7	50.3	63.5	40.8	53.7	17.4	28.1
CoRAG-8B (Ours)
> L=1, greedy	56.5	62.3	50.1	63.2	37.6	51.4	18.6	29.3
> L=6, greedy	70.6	75.5	54.4	67.5	48.0	63.5	27.7	38.5
> L=6, best-of-4	71.7	76.5	55.3	68.5	51.2	63.1	28.1	39.7
> L=6, tree search	71.7	76.4	55.8	69.0	48.8	64.4	29.0	40.3
> L=10, best-of-8	72.5	77.3	56.3	69.8	54.4	68.3	30.9	42.4

Please refer to https://github.com/microsoft/LMOps/tree/main/corag for evaluation instructions.

Model predictions are available as the predictions field at https://huggingface.co/datasets/corag/multihopqa

Disclaimer

This model has been specifically trained for the task of MultihopQA. It may not perform well on other tasks.

References

@article{wang2025chain,
  title={Chain-of-Retrieval Augmented Generation},
  author={Wang, Liang and Chen, Haonan and Yang, Nan and Huang, Xiaolong and Dou, Zhicheng and Wei, Furu},
  journal={arXiv preprint arXiv:2501.14342},
  year={2025}
}