Cyrile commited on
Commit
37f957a
·
verified ·
1 Parent(s): 97defb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -27
README.md CHANGED
@@ -12,9 +12,9 @@ pipeline_tag: sentence-similarity
12
  ## Evaluation
13
 
14
  To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
15
- the first question from each paragraph, along with the paragraph constituting the excerpt that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
16
- the number of themes is limited, and each excerpt from a corresponding theme that does not match the question forms a hard negative (other excerpts outside the theme are
17
- simple negatives). Thus, we can construct the following table, with each theme showing the number of excerpts and associated questions:
18
 
19
  | Theme name | Context number |
20
  |---------------------------------------------:|:---------------|
@@ -54,7 +54,7 @@ simple negatives). Thus, we can construct the following table, with each theme s
54
  | French_and_Indian_War | 46 |
55
  | Force | 44 |
56
 
57
- The evaluation corpus consists of 1204 pairs of question/context to be ranked.
58
 
59
  Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
60
 
@@ -69,7 +69,7 @@ Initially, the evaluation scores will be calculated in cases where both the quer
69
  | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 1.06 | 89.37 | 99.75 | 100 | 93.79 | 0.94 | 0.10 |
70
 
71
 
72
- Next, we evaluate the model in a cross-language context, with queries in English and contexts in French.
73
 
74
  | Model (French/English) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
75
  |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
@@ -84,25 +84,3 @@ Next, we evaluate the model in a cross-language context, with queries in English
84
  As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
85
  Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
86
  for RAG-type applications.
87
-
88
-
89
-
90
- | Model (French/French) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
91
- |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
92
- | BM25 | 14.47 | 92.19 | 69.77 | 92.03 | 98.09 | 77.74 | NA | NA |
93
- | [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | 5.72 | 36.88 | 69.35 | 95.51 | 98.92 | 79.51 | 0.83 | 0.37 |
94
- | [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | 5.54 | 25.90 | 66.11 | 92.77 | 99.17 | 76.00 | 0.80 | 0.39 |
95
- | [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | 4.43 | 30.27 | 71.51 | 95.68 | 99.42 | 80.17 | 0.78 | 0.38 |
96
- | [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) | 15.13 | 60.39 | 57.23 | 83.87 | 96.18 | 66.21 | 0.53 | 0.11 |
97
- | [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) | 1.49 | 2.58 | 83.55 | 99.17 | 100 | 89.98 | 0.93 | 0.15 |
98
- | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 1.06 | 89.37 | 99.75 | 100 | 93.79 | 0.94 | 0.10 |
99
-
100
- | Model (French/English) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
101
- |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
102
- | BM25 | 288.04 | 371.46 | 21.93 | 41.93 | 55.15 | 28.41 | NA | NA |
103
- | [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | 12.20 | 61.39 | 59.55 | 89.71 | 97.42 | 70.38 | 0.65 | 0.47 |
104
- | [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | 40.97 | 104.78 | 25.66 | 64.78 | 88.62 | 38.83 | 0.53 | 0.49 |
105
- | [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | 6.91 | 32.16 | 59.88 | 89.95 | 99.09 | 70.39 | 0.61 | 0.46 |
106
- | [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) | 79.32 | 153.62 | 27.91 | 49.50 | 78.16 | 35.41 | 0.40 | 0.12 |
107
- | [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) | 1.51 | 1.92 | 81.89 | 99.09 | 100 | 88.64 | 0.92 | 0.15 |
108
- | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 0.98 | 89.20 | 99.84 | 100 | 93.63 | 0.94 | 0.10 |
 
12
  ## Evaluation
13
 
14
  To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
15
+ the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
16
+ the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are
17
+ simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:
18
 
19
  | Theme name | Context number |
20
  |---------------------------------------------:|:---------------|
 
54
  | French_and_Indian_War | 46 |
55
  | Force | 44 |
56
 
57
+ The evaluation corpus consists of 1204 pairs of query/context to be ranked.
58
 
59
  Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
60
 
 
69
  | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 1.06 | 89.37 | 99.75 | 100 | 93.79 | 0.94 | 0.10 |
70
 
71
 
72
+ Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.
73
 
74
  | Model (French/English) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
75
  |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
 
84
  As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
85
  Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
86
  for RAG-type applications.