zhiminy commited on
Commit
065faaf
·
1 Parent(s): adfb223

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -28
README.md CHANGED
@@ -11,64 +11,71 @@ pinned: false
11
  short_description: The chatbot arena for software engineering
12
  ---
13
 
14
- # SE Arena: Evaluate Best SE Chatbots with Long-Context Interactions
15
 
16
- Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused chatbots. SE Arena benchmarks foundation models (FMs), such as large language models (LLMs), in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
17
 
18
  ## Key Features
19
 
20
- - **Advanced Pairwise Comparisons**: Assess chatbots using Elo score, PageRank, and Newman modularity to understand both global performance and task-specific strengths.
21
- - **Interactive Evaluation**: Test chatbots in multi-round conversations tailored for SE tasks like debugging, code generation, and requirement refinement.
22
- - **Transparent Leaderboard**: View real-time model rankings across diverse SE workflows, updated using advanced evaluation metrics.
 
 
 
 
 
23
 
24
  ## Why SE Arena?
25
 
26
- Existing evaluation frameworks often do not address the complex, iterative nature of SE tasks. SE Arena fills this gap by:
27
 
28
- - Supporting long-context, multi-turn evaluations to capture iterative workflows.
29
- - Allowing anonymous model comparisons to prevent bias.
30
- - Providing rich, multidimensional metrics for more nuanced model evaluations.
 
31
 
32
  ## How It Works
33
 
34
- 1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
35
- 2. **Compare Responses**: Two anonymous chatbots provide responses to your query.
36
- 3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
37
- 4. **Iterative Testing**: Continue the conversation with follow-up prompts to test contextual understanding over multiple rounds.
38
 
39
  ## Getting Started
40
 
41
  ### Prerequisites
42
 
43
- - A [Hugging Face](https://huggingface.co) account.
44
- - Basic understanding of software engineering workflows.
45
 
46
  ### Usage
47
 
48
- 1. Navigate to the [SE Arena platform](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena).
49
- 2. Sign in with your Hugging Face account.
50
- 3. Enter your SE task prompt and start evaluating model responses.
51
- 4. Vote on the better response or continue multi-round interactions to test contextual understanding.
52
 
53
  ## Contributing
54
 
55
  We welcome contributions from the community! Here's how you can help:
56
 
57
- 1. **Submit Prompts**: Share your SE-related tasks to enrich our evaluation dataset.
58
- 2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository.
59
- 3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request.
60
 
61
  ## Privacy Policy
62
 
63
- Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our [Terms of Service](#).
64
 
65
  ## Future Plans
66
 
67
- - **Enhanced Metrics**: Add round-wise analysis and context-aware evaluation metrics.
68
- - **Domain-Specific Sub-Leaderboards**: Rankings focused on tasks like debugging, requirement refinement, etc.
69
- - **Advanced Context Compression**: Techniques like LongRope and SelfExtend to manage long-term memory.
70
- - **Support for Multimodal Models**: Evaluate models that integrate text, code, and other modalities.
 
71
 
72
  ## Contact
73
 
74
- For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineer-Arena/issues/new) in this repository. We welcome your contributions and suggestions!
 
11
  short_description: The chatbot arena for software engineering
12
  ---
13
 
14
+ # SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
15
 
16
+ Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SE Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
17
 
18
  ## Key Features
19
 
20
+ - **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
21
+ - **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
22
+ - **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
23
+ - Traditional metrics: Elo score and average win rate
24
+ - Network-based metrics: Eigenvector centrality, PageRank score
25
+ - Community detection: Newman modularity score
26
+ - **Consistency score**: Quantify model determinism and reliability through self-play matches
27
+ - **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
28
 
29
  ## Why SE Arena?
30
 
31
+ Existing evaluation frameworks (like Chatbot Arena, WebDev Arena, and Copilot Arena) often don't address the complex, iterative nature of SE tasks. SE Arena fills critical gaps by:
32
 
33
+ - Supporting context-rich, multi-turn evaluations to capture iterative workflows
34
+ - Integrating repository-level context through RepoChat to simulate real-world development scenarios
35
+ - Providing multidimensional metrics for nuanced model comparisons
36
+ - Focusing on the full breadth of SE tasks beyond just code generation
37
 
38
  ## How It Works
39
 
40
+ 1. **Submit a Prompt**: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
41
+ 2. **Compare Responses**: Two anonymous models provide responses to your query
42
+ 3. **Continue the Conversation**: Test contextual understanding over multiple rounds
43
+ 4. **Vote**: Choose the better model at any point, with ability to re-assess after multiple turns
44
 
45
  ## Getting Started
46
 
47
  ### Prerequisites
48
 
49
+ - A [Hugging Face](https://huggingface.co) account
50
+ - Basic understanding of software engineering workflows
51
 
52
  ### Usage
53
 
54
+ 1. Navigate to the [SE Arena platform](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena)
55
+ 2. Sign in with your Hugging Face account
56
+ 3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
57
+ 4. Engage in multi-round interactions and vote on model performance
58
 
59
  ## Contributing
60
 
61
  We welcome contributions from the community! Here's how you can help:
62
 
63
+ 1. **Submit SE Tasks**: Share your real-world SE problems to enrich our evaluation dataset
64
+ 2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository
65
+ 3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request
66
 
67
  ## Privacy Policy
68
 
69
+ Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our Terms of Service.
70
 
71
  ## Future Plans
72
 
73
+ - **Analysis of Real-World SE Workloads**: Identify common patterns and challenges in user-submitted tasks
74
+ - **Multi-Round Evaluation Metrics**: Develop specialized metrics for assessing model adaptation over successive turns
75
+ - **Enhanced Community Engagement**: Enable broader participation through voting and contributions
76
+ - **Expanded FM Coverage**: Include domain-specific and multimodal foundation models
77
+ - **Advanced Context Compression**: Integrate techniques like LongRope and SelfExtend to manage long-term memory
78
 
79
  ## Contact
80
 
81
+ For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineering-Arena/issues/new) in this repository. We welcome your contributions and suggestions!