Spaces:

braindao
/

soliditybench-leaderboard

Running

App Files Files Community

brunneis commited on Oct 15, 2024

Commit

292fab8

unverified ·

1 Parent(s): e9d6a57

Update about page

Browse files

Files changed (2) hide show

app.py +8 -8
src/about.py +8 -10

app.py CHANGED Viewed

@@ -150,14 +150,14 @@ with demo:
         with gr.TabItem("🧠 About", elem_id="llm-benchmark-tab-table", id=2):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-            with gr.Accordion(
-                        "Evaluation script",
-                        open=False,
-                    ):
-                gr.Markdown(
-                    EVALUATION_SCRIPT,
-                    elem_classes="markdown-text",
-                )
         with gr.TabItem("🧪 Submissions", elem_id="llm-benchmark-tab-table", id=3):
             with gr.Column():

         with gr.TabItem("🧠 About", elem_id="llm-benchmark-tab-table", id=2):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
+            # with gr.Accordion(
+            #             "Evaluation script",
+            #             open=False,
+            #         ):
+            #     gr.Markdown(
+            #         EVALUATION_SCRIPT,
+            #         elem_classes="markdown-text",
+            #     )
         with gr.TabItem("🧪 Submissions", elem_id="llm-benchmark-tab-table", id=3):
             with gr.Column():

src/about.py CHANGED Viewed

@@ -37,24 +37,22 @@ INTRODUCTION_TEXT = ""
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = """
-# SolidityBench: Evaluating LLM Solidity Code Generation
 SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
 We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
-## Benchmarks
-### 1. NaïveJudge
 NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
-#### Evaluation Process:
 - LLMs implement smart contracts based on detailed specifications.
 - Generated code is compared to audited reference implementations.
 - Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
-#### Evaluation Criteria:
 1. Functional Completeness (0-60 points)
    - Implementation of key functionality
    - Handling of edge cases
@@ -74,19 +72,19 @@ NaïveJudge is a novel approach to smart contract evaluation, integrating a data
 The final score ranges from 0 to 100, calculated by summing the points from each criterion.
-### 2. HumanEval for Solidity
 [HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
-#### Dataset:
 - 25 tasks of varying difficulty
 - Each task includes corresponding tests designed for use with Hardhat
-#### Evaluation Process:
 - Custom server built on top of Hardhat compiles and tests the generated Solidity code
 - Evaluates the AI model's ability to produce fully functional smart contracts
-#### Metrics:
 1. pass@1 (Score: 0-100)
    - Measures the model's success on the first attempt
    - Assesses precision and efficiency

 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = """
+# Evaluating LLM Solidity Code Generation
 SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
 We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
+## NaïveJudge
 NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
+### Evaluation Process:
 - LLMs implement smart contracts based on detailed specifications.
 - Generated code is compared to audited reference implementations.
 - Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
+### Evaluation Criteria:
 1. Functional Completeness (0-60 points)
    - Implementation of key functionality
    - Handling of edge cases
 The final score ranges from 0 to 100, calculated by summing the points from each criterion.
+## HumanEval for Solidity
 [HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
+### Dataset:
 - 25 tasks of varying difficulty
 - Each task includes corresponding tests designed for use with Hardhat
+### Evaluation Process:
 - Custom server built on top of Hardhat compiles and tests the generated Solidity code
 - Evaluates the AI model's ability to produce fully functional smart contracts
+### Metrics:
 1. pass@1 (Score: 0-100)
    - Measures the model's success on the first attempt
    - Assesses precision and efficiency