zhiminy commited on
Commit
b979763
·
1 Parent(s): 7a10f2f

update readme

Browse files
Files changed (1) hide show
  1. README.md +18 -26
README.md CHANGED
@@ -13,44 +13,36 @@ short_description: The chatbot arena for software engineering
13
 
14
  # SE Arena: Explore and Test the Best SE Chatbots with Long-Context Interactions
15
 
16
- Welcome to **SE Arena**, an open-source platform for evaluating software engineering-focused chatbots. SE Arena is designed to benchmark foundation models (FMs), including large language models (LLMs), in iterative and context-rich workflows characteristic of software engineering (SE) tasks.
17
 
18
  ## Key Features
19
 
20
- - **Interactive Evaluation**: Test chatbots in multi-round conversations tailored for debugging, code generation, and requirement refinement.
21
- - **Transparent Leaderboard**: View model rankings across diverse SE workflows, updated in real-time using advanced metrics.
22
- - **Advanced Pairwise Comparisons**: Evaluate chatbots using metrics like Elo score, PageRank, and Newman modularity to understand their global dominance and task-specific strengths.
23
- - **Open-Source**: Built on [Hugging Face Spaces](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena), fostering transparency and community-driven innovation.
24
 
25
  ## Why SE Arena?
26
 
27
- Existing evaluation frameworks often fall short in addressing the complex, iterative nature of SE tasks. SE Arena fills this gap by:
28
 
29
- - Supporting long-context, multi-turn evaluations.
30
- - Allowing comparisons of anonymous models without bias.
31
- - Providing rich, multidimensional metrics for nuanced evaluations.
32
 
33
  ## How It Works
34
 
35
  1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
36
- 2. **Compare Responses**: Two chatbots respond to your query side-by-side.
37
  3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
38
- 4. **Iterative Testing**: Continue the conversation with follow-up prompts to test long-context understanding.
39
-
40
- ## Metrics Used
41
-
42
- SE Arena goes beyond traditional Elo scores by incorporating:
43
-
44
- - **Eigenvector Centrality**: Highlights models that perform well against high-quality competitors.
45
- - **PageRank**: Accounts for cyclic dependencies and emphasizes importance in dense sub-networks.
46
- - **Newman Modularity**: Groups models into clusters based on similar performance patterns, helping users identify task-specific expertise.
47
 
48
  ## Getting Started
49
 
50
  ### Prerequisites
51
 
52
  - A [Hugging Face](https://huggingface.co) account.
53
- - Basic knowledge of software engineering workflows.
54
 
55
  ### Usage
56
 
@@ -69,15 +61,15 @@ We welcome contributions from the community! Here's how you can help:
69
 
70
  ## Privacy Policy
71
 
72
- Your interactions are anonymized and used solely for improving SE Arena and foundation model benchmarking. By using SE Arena, you agree to our [Terms of Service](#).
73
 
74
  ## Future Plans
75
 
76
- - **Enhanced Metrics**: Add round-wise analysis and context-aware metrics.
77
- - **Domain-Specific Sub-Leaderboards**: Focused rankings for debugging, requirement refinement, etc.
78
- - **Integration of Advanced Context Compression**: Techniques like LongRope and SelfExtend for long-term memory.
79
- - **Support for Multimodal Models**: Evaluate models integrating text, code, and other modalities.
80
 
81
  ## Contact
82
 
83
- For inquiries or feedback, please [open an issue](https://github.com/zhimin-z/SE-Arena/issues/new) in this repository. We welcome your contributions and suggestions!
 
13
 
14
  # SE Arena: Explore and Test the Best SE Chatbots with Long-Context Interactions
15
 
16
+ Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused chatbots. SE Arena benchmarks foundation models (FMs), such as large language models (LLMs), in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
17
 
18
  ## Key Features
19
 
20
+ - **Advanced Pairwise Comparisons**: Assess chatbots using Elo score, PageRank, and Newman modularity to understand both global performance and task-specific strengths.
21
+ - **Interactive Evaluation**: Test chatbots in multi-round conversations tailored for SE tasks like debugging, code generation, and requirement refinement.
22
+ - **Open-Source**: Built on [Hugging Face Spaces](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena), enabling transparency and fostering community-driven innovation.
23
+ - **Transparent Leaderboard**: View real-time model rankings across diverse SE workflows, updated using advanced evaluation metrics.
24
 
25
  ## Why SE Arena?
26
 
27
+ Existing evaluation frameworks often do not address the complex, iterative nature of SE tasks. SE Arena fills this gap by:
28
 
29
+ - Supporting long-context, multi-turn evaluations to capture iterative workflows.
30
+ - Allowing anonymous model comparisons to prevent bias.
31
+ - Providing rich, multidimensional metrics for more nuanced model evaluations.
32
 
33
  ## How It Works
34
 
35
  1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
36
+ 2. **Compare Responses**: Two anonymous chatbots provide responses to your query.
37
  3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
38
+ 4. **Iterative Testing**: Continue the conversation with follow-up prompts to test contextual understanding over multiple rounds.
 
 
 
 
 
 
 
 
39
 
40
  ## Getting Started
41
 
42
  ### Prerequisites
43
 
44
  - A [Hugging Face](https://huggingface.co) account.
45
+ - Basic understanding of software engineering workflows.
46
 
47
  ### Usage
48
 
 
61
 
62
  ## Privacy Policy
63
 
64
+ Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our [Terms of Service](#).
65
 
66
  ## Future Plans
67
 
68
+ - **Enhanced Metrics**: Add round-wise analysis and context-aware evaluation metrics.
69
+ - **Domain-Specific Sub-Leaderboards**: Rankings focused on tasks like debugging, requirement refinement, etc.
70
+ - **Advanced Context Compression**: Techniques like LongRope and SelfExtend to manage long-term memory.
71
+ - **Support for Multimodal Models**: Evaluate models that integrate text, code, and other modalities.
72
 
73
  ## Contact
74
 
75
+ For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineer-Arena/issues/new) in this repository. We welcome your contributions and suggestions!