jinhyuklee
commited on
Commit
•
f40cd39
1
Parent(s):
62e8598
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,53 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
# XTR: Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
|
6 |
+
|
7 |
+
We provide how you can run [XTR](https://arxiv.org/abs/2304.01982) on PyTorch.
|
8 |
+
|
9 |
+
We thank Mujeen Sung (https://github.com/mjeensung/xtr-pytorch) for providing this functionality.
|
10 |
+
|
11 |
+
## Installation
|
12 |
+
|
13 |
+
```
|
14 |
+
$ git clone [email protected]:mjeensung/xtr-pytorch.git
|
15 |
+
$ pip install -e .
|
16 |
+
```
|
17 |
+
|
18 |
+
## Usage
|
19 |
+
```
|
20 |
+
# Create the dataset
|
21 |
+
sample_doc = "Google LLC (/ˈɡuːɡəl/ (listen)) is an American multinational technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence..."
|
22 |
+
chunks = [chunk.lower() for chunk in sent_tokenize(sample_doc)]
|
23 |
+
|
24 |
+
# Load the XTR retriever
|
25 |
+
xtr = XtrRetriever(model_name_or_path="google/xtr-base-en", use_faiss=False, device="cuda")
|
26 |
+
|
27 |
+
# Build the index
|
28 |
+
xtr.build_index(chunks)
|
29 |
+
|
30 |
+
# Retrieve top-3 documents given the query
|
31 |
+
query = "Who founded google"
|
32 |
+
retrieved_docs, metadata = xtr.retrieve_docs([query], document_top_k=3)
|
33 |
+
for rank, (did, score, doc) in enumerate(retrieved_docs[0]):
|
34 |
+
print(f"[{rank}] doc={did} ({score:.3f}): {doc}")
|
35 |
+
|
36 |
+
"""
|
37 |
+
>> [0] doc=0 (0.925): google llc (/ˈɡuːɡəl/ (listen)) is an american multinational technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics.
|
38 |
+
>> [1] doc=1 (0.903): it has been referred to as "the most powerful company in the world" and one of the world's most valuable brands due to its market dominance, data collection, and technological advantages in the area of artificial intelligence.
|
39 |
+
>> [2] doc=2 (0.900): its parent company alphabet is considered one of the big five american information technology companies, alongside amazon, apple, meta, and microsoft.
|
40 |
+
"""
|
41 |
+
```
|
42 |
+
|
43 |
+
## Citing this work
|
44 |
+
|
45 |
+
```bibtex
|
46 |
+
@article{lee2024rethinking,
|
47 |
+
title={Rethinking the role of token retrieval in multi-vector retrieval},
|
48 |
+
author={Lee, Jinhyuk and Dai, Zhuyun and Duddu, Sai Meher Karthik and Lei, Tao and Naim, Iftekhar and Chang, Ming-Wei and Zhao, Vincent},
|
49 |
+
journal={Advances in Neural Information Processing Systems},
|
50 |
+
volume={36},
|
51 |
+
year={2024}
|
52 |
+
}
|
53 |
+
```
|