cmarkea
/

bloomz-3b-reranking

@@ -12,9 +12,9 @@ pipeline_tag: sentence-similarity
 ## Evaluation
 To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
-the first question from each paragraph, along with the paragraph constituting the excerpt that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
-the number of themes is limited, and each excerpt from a corresponding theme that does not match the question forms a hard negative (other excerpts outside the theme are
-simple negatives). Thus, we can construct the following table, with each theme showing the number of excerpts and associated questions:
 | Theme name                                   | Context number |
 |---------------------------------------------:|:---------------|
@@ -54,7 +54,7 @@ simple negatives). Thus, we can construct the following table, with each theme s
 | French_and_Indian_War                        | 46             |
 | Force                                        | 44             |
-The evaluation corpus consists of 1204 pairs of question/context to be ranked.
 Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
@@ -69,7 +69,7 @@ Initially, the evaluation scores will be calculated in cases where both the quer
 | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |
-Next, we evaluate the model in a cross-language context, with queries in English and contexts in French.
 |     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
 |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
@@ -84,25 +84,3 @@ Next, we evaluate the model in a cross-language context, with queries in English
 As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
 Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
 for RAG-type applications.
-|     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
-|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
-|              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |
-| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) |    5.72    |   36.88   |   69.35   |    95.51   |    98.92    |    79.51   |       0.83       |       0.37      |
-| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) |    5.54    |   25.90   |   66.11   |    92.77   |    99.17    |    76.00   |       0.80       |       0.39      |
-| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    4.43    |   30.27   |   71.51   |    95.68   |    99.42    |    80.17   |       0.78       |       0.38      |
-| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |   15.13   |   60.39    |    57.23   |    83.87    |   96.18   |   66.21   |  0.53   |  0.11  |
-| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.49    |    2.58   |   83.55   |    99.17   |     100     |    89.98   |       0.93       |       0.15      |
-| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |
-|     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
-|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
-|              BM25             |   288.04   |   371.46  |   21.93   |    41.93   |    55.15    |    28.41   |        NA        |        NA       |
-| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)           |    12.20   |   61.39   |   59.55   |    89.71   |    97.42    |    70.38   |       0.65       |       0.47      |
-| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)        |    40.97   |   104.78  |   25.66   |    64.78   |    88.62    |    38.83   |       0.53       |       0.49      |
-| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    6.91    |   32.16   |   59.88   |    89.95   |    99.09    |    70.39   |       0.61       |       0.46      |
-| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |    79.32   |   153.62    |   27.91   |    49.50    |    78.16    |   35.41   |   0.40    |  0.12  |
-| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.51    |    1.92   |   81.89   |    99.09   |     100     |    88.64   |       0.92       |       0.15      |
-| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    0.98   |   89.20   |    99.84   |     100     |    93.63   |       0.94       |       0.10      |

 ## Evaluation
 To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
+the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
+the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are
+simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:
 | Theme name                                   | Context number |
 |---------------------------------------------:|:---------------|
 | French_and_Indian_War                        | 46             |
 | Force                                        | 44             |
+The evaluation corpus consists of 1204 pairs of query/context to be ranked.
 Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
 | [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |
+Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.
 |     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
 |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
 As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
 Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
 for RAG-type applications.