update readme
Browse files
README.md
CHANGED
@@ -13,44 +13,36 @@ short_description: The chatbot arena for software engineering
|
|
13 |
|
14 |
# SE Arena: Explore and Test the Best SE Chatbots with Long-Context Interactions
|
15 |
|
16 |
-
Welcome to **SE Arena**, an open-source platform for evaluating software engineering-focused chatbots. SE Arena
|
17 |
|
18 |
## Key Features
|
19 |
|
20 |
-
- **
|
21 |
-
- **
|
22 |
-
- **
|
23 |
-
- **
|
24 |
|
25 |
## Why SE Arena?
|
26 |
|
27 |
-
Existing evaluation frameworks often
|
28 |
|
29 |
-
- Supporting long-context, multi-turn evaluations.
|
30 |
-
- Allowing
|
31 |
-
- Providing rich, multidimensional metrics for nuanced evaluations.
|
32 |
|
33 |
## How It Works
|
34 |
|
35 |
1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
|
36 |
-
2. **Compare Responses**: Two chatbots
|
37 |
3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
|
38 |
-
4. **Iterative Testing**: Continue the conversation with follow-up prompts to test
|
39 |
-
|
40 |
-
## Metrics Used
|
41 |
-
|
42 |
-
SE Arena goes beyond traditional Elo scores by incorporating:
|
43 |
-
|
44 |
-
- **Eigenvector Centrality**: Highlights models that perform well against high-quality competitors.
|
45 |
-
- **PageRank**: Accounts for cyclic dependencies and emphasizes importance in dense sub-networks.
|
46 |
-
- **Newman Modularity**: Groups models into clusters based on similar performance patterns, helping users identify task-specific expertise.
|
47 |
|
48 |
## Getting Started
|
49 |
|
50 |
### Prerequisites
|
51 |
|
52 |
- A [Hugging Face](https://huggingface.co) account.
|
53 |
-
- Basic
|
54 |
|
55 |
### Usage
|
56 |
|
@@ -69,15 +61,15 @@ We welcome contributions from the community! Here's how you can help:
|
|
69 |
|
70 |
## Privacy Policy
|
71 |
|
72 |
-
Your interactions are anonymized and used solely for improving SE Arena and
|
73 |
|
74 |
## Future Plans
|
75 |
|
76 |
-
- **Enhanced Metrics**: Add round-wise analysis and context-aware metrics.
|
77 |
-
- **Domain-Specific Sub-Leaderboards**:
|
78 |
-
- **
|
79 |
-
- **Support for Multimodal Models**: Evaluate models
|
80 |
|
81 |
## Contact
|
82 |
|
83 |
-
For inquiries or feedback, please [open an issue](https://github.com/
|
|
|
13 |
|
14 |
# SE Arena: Explore and Test the Best SE Chatbots with Long-Context Interactions
|
15 |
|
16 |
+
Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused chatbots. SE Arena benchmarks foundation models (FMs), such as large language models (LLMs), in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
|
17 |
|
18 |
## Key Features
|
19 |
|
20 |
+
- **Advanced Pairwise Comparisons**: Assess chatbots using Elo score, PageRank, and Newman modularity to understand both global performance and task-specific strengths.
|
21 |
+
- **Interactive Evaluation**: Test chatbots in multi-round conversations tailored for SE tasks like debugging, code generation, and requirement refinement.
|
22 |
+
- **Open-Source**: Built on [Hugging Face Spaces](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena), enabling transparency and fostering community-driven innovation.
|
23 |
+
- **Transparent Leaderboard**: View real-time model rankings across diverse SE workflows, updated using advanced evaluation metrics.
|
24 |
|
25 |
## Why SE Arena?
|
26 |
|
27 |
+
Existing evaluation frameworks often do not address the complex, iterative nature of SE tasks. SE Arena fills this gap by:
|
28 |
|
29 |
+
- Supporting long-context, multi-turn evaluations to capture iterative workflows.
|
30 |
+
- Allowing anonymous model comparisons to prevent bias.
|
31 |
+
- Providing rich, multidimensional metrics for more nuanced model evaluations.
|
32 |
|
33 |
## How It Works
|
34 |
|
35 |
1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
|
36 |
+
2. **Compare Responses**: Two anonymous chatbots provide responses to your query.
|
37 |
3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
|
38 |
+
4. **Iterative Testing**: Continue the conversation with follow-up prompts to test contextual understanding over multiple rounds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Getting Started
|
41 |
|
42 |
### Prerequisites
|
43 |
|
44 |
- A [Hugging Face](https://huggingface.co) account.
|
45 |
+
- Basic understanding of software engineering workflows.
|
46 |
|
47 |
### Usage
|
48 |
|
|
|
61 |
|
62 |
## Privacy Policy
|
63 |
|
64 |
+
Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our [Terms of Service](#).
|
65 |
|
66 |
## Future Plans
|
67 |
|
68 |
+
- **Enhanced Metrics**: Add round-wise analysis and context-aware evaluation metrics.
|
69 |
+
- **Domain-Specific Sub-Leaderboards**: Rankings focused on tasks like debugging, requirement refinement, etc.
|
70 |
+
- **Advanced Context Compression**: Techniques like LongRope and SelfExtend to manage long-term memory.
|
71 |
+
- **Support for Multimodal Models**: Evaluate models that integrate text, code, and other modalities.
|
72 |
|
73 |
## Contact
|
74 |
|
75 |
+
For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineer-Arena/issues/new) in this repository. We welcome your contributions and suggestions!
|