brunneis commited on
Commit
292fab8
1 Parent(s): e9d6a57

Update about page

Browse files
Files changed (2) hide show
  1. app.py +8 -8
  2. src/about.py +8 -10
app.py CHANGED
@@ -150,14 +150,14 @@ with demo:
150
 
151
  with gr.TabItem("🧠 About", elem_id="llm-benchmark-tab-table", id=2):
152
  gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
153
- with gr.Accordion(
154
- "Evaluation script",
155
- open=False,
156
- ):
157
- gr.Markdown(
158
- EVALUATION_SCRIPT,
159
- elem_classes="markdown-text",
160
- )
161
 
162
  with gr.TabItem("🧪 Submissions", elem_id="llm-benchmark-tab-table", id=3):
163
  with gr.Column():
 
150
 
151
  with gr.TabItem("🧠 About", elem_id="llm-benchmark-tab-table", id=2):
152
  gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
153
+ # with gr.Accordion(
154
+ # "Evaluation script",
155
+ # open=False,
156
+ # ):
157
+ # gr.Markdown(
158
+ # EVALUATION_SCRIPT,
159
+ # elem_classes="markdown-text",
160
+ # )
161
 
162
  with gr.TabItem("🧪 Submissions", elem_id="llm-benchmark-tab-table", id=3):
163
  with gr.Column():
src/about.py CHANGED
@@ -37,24 +37,22 @@ INTRODUCTION_TEXT = ""
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
39
  LLM_BENCHMARKS_TEXT = """
40
- # SolidityBench: Evaluating LLM Solidity Code Generation
41
 
42
  SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
43
 
44
  We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
45
 
46
- ## Benchmarks
47
-
48
- ### 1. NaïveJudge
49
 
50
  NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
51
 
52
- #### Evaluation Process:
53
  - LLMs implement smart contracts based on detailed specifications.
54
  - Generated code is compared to audited reference implementations.
55
  - Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
56
 
57
- #### Evaluation Criteria:
58
  1. Functional Completeness (0-60 points)
59
  - Implementation of key functionality
60
  - Handling of edge cases
@@ -74,19 +72,19 @@ NaïveJudge is a novel approach to smart contract evaluation, integrating a data
74
 
75
  The final score ranges from 0 to 100, calculated by summing the points from each criterion.
76
 
77
- ### 2. HumanEval for Solidity
78
 
79
  [HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
80
 
81
- #### Dataset:
82
  - 25 tasks of varying difficulty
83
  - Each task includes corresponding tests designed for use with Hardhat
84
 
85
- #### Evaluation Process:
86
  - Custom server built on top of Hardhat compiles and tests the generated Solidity code
87
  - Evaluates the AI model's ability to produce fully functional smart contracts
88
 
89
- #### Metrics:
90
  1. pass@1 (Score: 0-100)
91
  - Measures the model's success on the first attempt
92
  - Assesses precision and efficiency
 
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
39
  LLM_BENCHMARKS_TEXT = """
40
+ # Evaluating LLM Solidity Code Generation
41
 
42
  SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
43
 
44
  We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
45
 
46
+ ## NaïveJudge
 
 
47
 
48
  NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
49
 
50
+ ### Evaluation Process:
51
  - LLMs implement smart contracts based on detailed specifications.
52
  - Generated code is compared to audited reference implementations.
53
  - Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
54
 
55
+ ### Evaluation Criteria:
56
  1. Functional Completeness (0-60 points)
57
  - Implementation of key functionality
58
  - Handling of edge cases
 
72
 
73
  The final score ranges from 0 to 100, calculated by summing the points from each criterion.
74
 
75
+ ## HumanEval for Solidity
76
 
77
  [HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
78
 
79
+ ### Dataset:
80
  - 25 tasks of varying difficulty
81
  - Each task includes corresponding tests designed for use with Hardhat
82
 
83
+ ### Evaluation Process:
84
  - Custom server built on top of Hardhat compiles and tests the generated Solidity code
85
  - Evaluates the AI model's ability to produce fully functional smart contracts
86
 
87
+ ### Metrics:
88
  1. pass@1 (Score: 0-100)
89
  - Measures the model's success on the first attempt
90
  - Assesses precision and efficiency