<h1 style='color: purple;'> Using on your data </h1>

Source code is available as a pip installable python package. 

## Installation

Use of a virtual enviroment is recommended.
```bash
conda create -n selfrank python=3.10 
```
Activate the virtual environment
```bash
conda activate selfrank
```

and then install,
```bash
pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git
```

## Usage

Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.
|            | M1   | M2   | M3   | ...   |
|:-----------|:-----|:-----|:-----|:------|
| Q1         | a    | a    | b    | ...   |
| Q2         | a    | b    | b    | ...   |
| ...        | ...  | ...  | ...  | ...   |


With this data, the self ranking procedure can be invoked as follows:

```python
import pandas as pd
from selfrank.algos.iterative import SelfRank # The full ranking algorithm
from selfrank.algos.greedy import SelfRankGreedy # The greedy version
from selfrank.algos.triplet import rouge, equality

f = "inferences.csv"
df = pd.read_csv(f)

models_to_rank = df.columns.tolist()
evaluator = rouge 
true_ranking = None

r = SelfRank(models_to_rank, evaluator, true_ranking)
# or, for the greedy version
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
r.fit(adf)
print(r.ranking)
```

This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision. 

We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes. 

You can also pass any arbitrary function to the ranker as long as it follows the following signature:
```python
def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
    """
    use model c to evaluate a vs. b
    df: is a dataframe with inferences of all models
    returns 1 if a is preferred or 0 if b is preferred
    """

    # Is this example, we count number of times a/b is the same as c 
    ties = df[a] == df[b]
    a_wins = sum((df[a] == df[c]) & ~(ties))
    b_wins = sum((df[b] == df[c]) & ~(ties))

    if a_wins >= b_wins:
        return 1
    else:
        return 0

```
<br>