brunneis commited on
Commit
e9d6a57
1 Parent(s): 36eddfe

Update about page

Browse files
Files changed (2) hide show
  1. README.md +1 -1
  2. src/about.py +56 -4
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: IQ Code | Solidity Leaderboard
3
  emoji: 🧠 🏆
4
  colorFrom: pink
5
  colorTo: purple
 
1
  ---
2
+ title: SolidityBench Leaderboard
3
  emoji: 🧠 🏆
4
  colorFrom: pink
5
  colorTo: purple
src/about.py CHANGED
@@ -30,18 +30,70 @@ class Tasks(Enum):
30
 
31
  # Your leaderboard name
32
  TITLE = """<br><img src="file/images/soliditybench.svg" width="500" style="display: block; margin-left: auto; margin-right: auto;">
33
- <h3 align="center" id="space-title">Solidity Leaderboard | Powered by IQ</h3>"""
34
 
35
  # What does your leaderboard evaluate?
36
  INTRODUCTION_TEXT = ""
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
39
  LLM_BENCHMARKS_TEXT = """
40
- ## How it works
41
 
42
- ## Reproducibility
43
- To reproduce our results, here is the commands you can run:
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  """
46
 
47
  EVALUATION_REQUESTS_TEXT = """
 
30
 
31
  # Your leaderboard name
32
  TITLE = """<br><img src="file/images/soliditybench.svg" width="500" style="display: block; margin-left: auto; margin-right: auto;">
33
+ <h3 align="center" id="space-title">Solidity Leaderboard by IQ</h3>"""
34
 
35
  # What does your leaderboard evaluate?
36
  INTRODUCTION_TEXT = ""
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
39
  LLM_BENCHMARKS_TEXT = """
40
+ # SolidityBench: Evaluating LLM Solidity Code Generation
41
 
42
+ SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
 
43
 
44
+ We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
45
+
46
+ ## Benchmarks
47
+
48
+ ### 1. NaïveJudge
49
+
50
+ NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
51
+
52
+ #### Evaluation Process:
53
+ - LLMs implement smart contracts based on detailed specifications.
54
+ - Generated code is compared to audited reference implementations.
55
+ - Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
56
+
57
+ #### Evaluation Criteria:
58
+ 1. Functional Completeness (0-60 points)
59
+ - Implementation of key functionality
60
+ - Handling of edge cases
61
+ - Appropriate error management
62
+
63
+ 2. Solidity Best Practices and Security (0-30 points)
64
+ - Correct and up-to-date Solidity syntax
65
+ - Adherence to best practices and design patterns
66
+ - Appropriate use of data types and visibility modifiers
67
+ - Code structure and maintainability
68
+
69
+ 3. Optimization and Efficiency (0-10 points)
70
+ - Gas efficiency
71
+ - Avoidance of unnecessary computations
72
+ - Storage efficiency
73
+ - Overall performance compared to expert implementation
74
+
75
+ The final score ranges from 0 to 100, calculated by summing the points from each criterion.
76
+
77
+ ### 2. HumanEval for Solidity
78
+
79
+ [HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
80
+
81
+ #### Dataset:
82
+ - 25 tasks of varying difficulty
83
+ - Each task includes corresponding tests designed for use with Hardhat
84
+
85
+ #### Evaluation Process:
86
+ - Custom server built on top of Hardhat compiles and tests the generated Solidity code
87
+ - Evaluates the AI model's ability to produce fully functional smart contracts
88
+
89
+ #### Metrics:
90
+ 1. pass@1 (Score: 0-100)
91
+ - Measures the model's success on the first attempt
92
+ - Assesses precision and efficiency
93
+
94
+ 2. pass@3 (Score: 0-100)
95
+ - Allows up to three attempts at solving each task
96
+ - Provides insights into the model's problem-solving capabilities over multiple tries
97
  """
98
 
99
  EVALUATION_REQUESTS_TEXT = """