samheym commited on
Commit
d84587b
·
verified ·
1 Parent(s): e363b49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -155
README.md CHANGED
@@ -6,9 +6,12 @@ tags:
6
  - PyLate
7
  - sentence-transformers
8
  - sentence-similarity
9
- - feature-extraction
10
  pipeline_tag: sentence-similarity
11
  library_name: PyLate
 
 
 
 
12
  ---
13
 
14
  # GerColBERT
@@ -19,29 +22,16 @@ This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps s
19
 
20
  ### Model Description
21
  - **Model Type:** PyLate model
22
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
23
  - **Document Length:** 180 tokens
24
  - **Query Length:** 32 tokens
25
  - **Output Dimensionality:** 128 tokens
26
  - **Similarity Function:** MaxSim
27
- <!-- - **Training Dataset:** Unknown -->
28
  - **Language:** de
29
  <!-- - **License:** Unknown -->
30
 
31
- ### Model Sources
32
 
33
- - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
34
- - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
35
- - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
36
-
37
- ### Full Model Architecture
38
-
39
- ```
40
- ColBERT(
41
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
42
- (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
43
- )
44
- ```
45
 
46
  ## Usage
47
  First install the PyLate library:
@@ -54,10 +44,6 @@ pip install -U pylate
54
 
55
  PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
56
 
57
- #### Indexing documents
58
-
59
- First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
60
-
61
  ```python
62
  from pylate import indexes, models, retrieve
63
 
@@ -65,143 +51,9 @@ from pylate import indexes, models, retrieve
65
  model = models.ColBERT(
66
  model_name_or_path=samheym/GerColBERT,
67
  )
68
-
69
- # Step 2: Initialize the Voyager index
70
- index = indexes.Voyager(
71
- index_folder="pylate-index",
72
- index_name="index",
73
- override=True, # This overwrites the existing index if any
74
- )
75
-
76
- # Step 3: Encode the documents
77
- documents_ids = ["1", "2", "3"]
78
- documents = ["document 1 text", "document 2 text", "document 3 text"]
79
-
80
- documents_embeddings = model.encode(
81
- documents,
82
- batch_size=32,
83
- is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
84
- show_progress_bar=True,
85
- )
86
-
87
- # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
88
- index.add_documents(
89
- documents_ids=documents_ids,
90
- documents_embeddings=documents_embeddings,
91
- )
92
- ```
93
-
94
- Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
95
-
96
- ```python
97
- # To load an index, simply instantiate it with the correct folder/name and without overriding it
98
- index = indexes.Voyager(
99
- index_folder="pylate-index",
100
- index_name="index",
101
- )
102
  ```
103
 
104
- #### Retrieving top-k documents for queries
105
-
106
- Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
107
- To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
108
-
109
- ```python
110
- # Step 1: Initialize the ColBERT retriever
111
- retriever = retrieve.ColBERT(index=index)
112
-
113
- # Step 2: Encode the queries
114
- queries_embeddings = model.encode(
115
- ["query for document 3", "query for document 1"],
116
- batch_size=32,
117
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
118
- show_progress_bar=True,
119
- )
120
-
121
- # Step 3: Retrieve top-k documents
122
- scores = retriever.retrieve(
123
- queries_embeddings=queries_embeddings,
124
- k=10, # Retrieve the top 10 matches for each query
125
- )
126
- ```
127
 
128
- ### Reranking
129
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
130
-
131
- ```python
132
- from pylate import rank, models
133
-
134
- queries = [
135
- "query A",
136
- "query B",
137
- ]
138
-
139
- documents = [
140
- ["document A", "document B"],
141
- ["document 1", "document C", "document B"],
142
- ]
143
-
144
- documents_ids = [
145
- [1, 2],
146
- [1, 3, 2],
147
- ]
148
-
149
- model = models.ColBERT(
150
- model_name_or_path=samheym/GerColBERT,
151
- )
152
-
153
- queries_embeddings = model.encode(
154
- queries,
155
- is_query=True,
156
- )
157
-
158
- documents_embeddings = model.encode(
159
- documents,
160
- is_query=False,
161
- )
162
-
163
- reranked_documents = rank.rerank(
164
- documents_ids=documents_ids,
165
- queries_embeddings=queries_embeddings,
166
- documents_embeddings=documents_embeddings,
167
- )
168
- ```
169
-
170
- <!--
171
- ### Direct Usage (Transformers)
172
-
173
- <details><summary>Click to see the direct usage in Transformers</summary>
174
-
175
- </details>
176
- -->
177
-
178
- <!--
179
- ### Downstream Usage (Sentence Transformers)
180
-
181
- You can finetune this model on your own dataset.
182
-
183
- <details><summary>Click to expand</summary>
184
-
185
- </details>
186
- -->
187
-
188
- <!--
189
- ### Out-of-Scope Use
190
-
191
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
192
- -->
193
-
194
- <!--
195
- ## Bias, Risks and Limitations
196
-
197
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
198
- -->
199
-
200
- <!--
201
- ### Recommendations
202
-
203
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
204
- -->
205
 
206
  ## Training Details
207
 
@@ -215,7 +67,7 @@ You can finetune this model on your own dataset.
215
  - Datasets: 2.21.0
216
  - Tokenizers: 0.21.0
217
 
218
-
219
  ## Citation
220
 
221
  ### BibTeX
 
6
  - PyLate
7
  - sentence-transformers
8
  - sentence-similarity
 
9
  pipeline_tag: sentence-similarity
10
  library_name: PyLate
11
+ datasets:
12
+ - samheym/ger-dpr-collection
13
+ base_model:
14
+ - deepset/gbert-base
15
  ---
16
 
17
  # GerColBERT
 
22
 
23
  ### Model Description
24
  - **Model Type:** PyLate model
25
+ - **Base model:** [deepset/gbert-base](https://huggingface.co/deepset/gbert-base)
26
  - **Document Length:** 180 tokens
27
  - **Query Length:** 32 tokens
28
  - **Output Dimensionality:** 128 tokens
29
  - **Similarity Function:** MaxSim
30
+ - **Training Dataset:** samheym/ger-dpr-collection
31
  - **Language:** de
32
  <!-- - **License:** Unknown -->
33
 
 
34
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Usage
37
  First install the PyLate library:
 
44
 
45
  PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
46
 
 
 
 
 
47
  ```python
48
  from pylate import indexes, models, retrieve
49
 
 
51
  model = models.ColBERT(
52
  model_name_or_path=samheym/GerColBERT,
53
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ```
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ## Training Details
59
 
 
67
  - Datasets: 2.21.0
68
  - Tokenizers: 0.21.0
69
 
70
+ <!--
71
  ## Citation
72
 
73
  ### BibTeX