junkim100's picture
Initial Commit
57dfc04
from src.display.utils import ModelType
TITLE = """<img src="https://i.postimg.cc/250G53CJ/src-display-SIL-logo.png" style="width:30%;display:block;margin-left:auto;margin-right:auto">"""
INTRODUCTION_TEXT = f"""
Welcome to the Self-Improving Leaderboard (SIL) - A Revolutionary Platform for Evaluating Large Language Models
The SIL offers a dynamic approach to assessing and ranking open-source LLMs and chatbots. Our innovative system continuously updates test datasets and recalculates rankings daily, ensuring evaluations reflect the rapid evolution of language processing capabilities.
Key Features:
• Daily-refreshed test datasets
• Adaptive ranking system
• Real-world language processing challenges
• Comprehensive model performance insights
Explore our cutting-edge evaluation process, gain deep insights into model capabilities, and see how different LLMs compare in this ever-changing landscape.
Ready to participate? Submit your model for evaluation on the 'Submit' page and join the forefront of LLM advancement. For a detailed look at our methodology, visit the 'About' page.
The SIL is proudly developed and maintained by [Your Organization/Team Name]. Together, let's push the boundaries of language AI!
"""
LLM_BENCHMARKS_TEXT = f"""
# How it works
🔄 The Self-Improving Leaderboard (SIL) operates on a dynamic evaluation system that continuously evolves to reflect real-world language processing challenges. Here's an overview of our process:
Daily Dataset Refresh
Our system generates new test data daily from diverse, reputable sources.
Advanced Large Language Models (LLMs) are utilized to synthesize additional relevant content.
The dataset is divided into two sections:
A primary dataset maintaining the integrity of sourced data
A noise-injected dataset simulating real-world data complexities
Model Evaluation
Participating models are rigorously evaluated against the refreshed dataset every 24 hours.
We employ a comprehensive set of metrics aligned with industry-standard benchmarks.
Our evaluation framework is built on the Eleuther AI Language Model Evaluation Harness, ensuring a robust and consistent assessment.
Ranking System
Model rankings are updated daily based on their performance across various tasks.
The leaderboard reflects not only the latest scores but also tracks consistency and adaptability over time.
Quarterly Comprehensive Evaluation
Every three months, we conduct an in-depth analysis of model performance.
This evaluation considers long-term trends, adaptability to evolving data, and overall efficacy.
Special recognition (e.g., medals or badges) may be awarded based on sustained excellence.
By continuously refreshing our test data and evaluation criteria, SIL aims to provide a more accurate representation of model performance in real-world scenarios, driving innovation in the field of Natural Language Processing.
## Icons
{ModelType.PT.to_str(" : ")} model
{ModelType.IFT.to_str(" : ")} model
{ModelType.RL.to_str(" : ")} model
If there is no icon, it indicates that there is insufficient information about the model.
Please provide information about the model through an issue! 🤩
## Details and Logs
- Detailed numerical results in the `results` dataset: https://huggingface.co/datasets/junkim100/SIL_results
- Community queries and running status in the `requests` dataset: https://huggingface.co/datasets/junkim100/SIL_requests
"""
EVALUATION_QUEUE_TEXT = f"""
# Evaluation Queue for the 🔄 Self-Improving Leaderboard
## <Some good practices before submitting a model>
### 1️⃣ Make sure you can load your model and tokenizer using AutoClasses
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
⚠️ Make sure your model is public!
⚠️ Maker sure your model runs with [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
### 2️⃣ Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
### 3️⃣ Make sure your model has an open license!
We'd love for as many people as possible to know they can use your model
### 4️⃣ Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped. Make sure you have followed the above steps first. If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""