|
--- |
|
language: en |
|
library_name: FlexRAG |
|
tags: |
|
- FlexRAG |
|
- retrieval |
|
- search |
|
- lexical |
|
- RAG |
|
--- |
|
|
|
# The BM25SRetriever for the wiki2021 corpus |
|
|
|
The corpus was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/flexrag) library. |
|
|
|
| Corpus Attribute | Value | |
|
| ---------------- | --------------------------------------------------------------- | |
|
| Language | English | |
|
| Domain | Wikipedia | |
|
| Size | 37.5M (33.1M text, 4.3M infobox) | |
|
| Dump Date | Dec 2021 | |
|
| Provideer | [Atlas](https://github.com/facebookresearch/atlas) | |
|
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) | |
|
|
|
|
|
| Index Attribute | Value | |
|
| --------------- | --------------------------------------------------------------- | |
|
| Index Type | BM25S | |
|
| Index Method | Lucene | |
|
| Preprocessing | LengthFilter(min_char=10, max_char=4096) | |
|
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) | |
|
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) | |
|
|
|
|
|
## Installation |
|
|
|
You can install the `FlexRAG` library with `pip`: |
|
|
|
```bash |
|
pip install flexrag |
|
``` |
|
|
|
## Loading a `FlexRAG` retriever |
|
|
|
You can use this retriever for information retrieval tasks. Here is an example: |
|
|
|
```python |
|
from flexrag.retriever import LocalRetriever |
|
|
|
# Load the retriever from the HuggingFace Hub |
|
retriever = LocalRetriever.load_from_hub("FlexRAG/wiki2021_atlas_bm25s") |
|
|
|
# You can retrieve now |
|
results = retriever.search("Who is Bruce Wayne?") |
|
``` |
|
|
|
## Running the RAG application with the retriever |
|
|
|
You can run the **GUI application** of the RAG assistant with this retriever. Here is an example: |
|
|
|
```bash |
|
python -m flexrag.entrypoints.run_interactive \ |
|
assistant_type=modular \ |
|
modular_config.used_fields=[title,text] \ |
|
modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \ |
|
modular_config.response_type=original \ |
|
modular_config.generator_type=openai \ |
|
modular_config.openai_config.model_name='gpt-4o-mini' \ |
|
modular_config.openai_config.api_key=$OPENAI_KEY \ |
|
modular_config.do_sample=False |
|
``` |
|
|
|
You can also run the **FlexRAG's RAG evaluation pipeline** with this retriever. Here is an example that evaluates the **ModularAssistant** with the retriever on the *Natural Questions* test split: |
|
|
|
```bash |
|
OUTPUT_PATH=<path_to_output> |
|
DB_PATH=<path_to_database> |
|
OPENAI_KEY=<your_openai_key> |
|
|
|
python -m flexrag.entrypoints.run_assistant \ |
|
name=nq \ |
|
split=test \ |
|
output_path=${OUTPUT_PATH} \ |
|
assistant_type=modular \ |
|
modular_config.used_fields=[title,text] \ |
|
modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \ |
|
modular_config.generator_type=openai \ |
|
modular_config.openai_config.model_name='gpt-4o-mini' \ |
|
modular_config.openai_config.api_key=$OPENAI_KEY \ |
|
modular_config.do_sample=False \ |
|
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \ |
|
eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \ |
|
eval_config.retrieval_success_rate_config.eval_field=text \ |
|
eval_config.response_preprocess.processor_type=[simplify_answer] |
|
``` |
|
|
|
## License |
|
As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license. |
|
|
|
## Related Links |
|
|
|
FlexRAG Related Links: |
|
* π[Documentation](https://flexrag.readthedocs.io/en/latest/) |
|
* π»[GitHub Repository](https://github.com/ictnlp/flexrag) |
|
|