aws-prototyping
/

codefu-7b-v0.1

Model card Files Files and versions

chenwuml commited on 19 days ago

Commit

2be1969

·

1 Parent(s): 1c7860f

initial commit

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -52,15 +52,13 @@ All open-weight models were tested using [vLLM](https://github.com/vllm-project/
 ### Result analysis
-We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as 'problem_name', 'prompt', 'response', 'response_length', 'solution_code', 'status', and 'score'.
-Notably, the 'status' distribution shows:
 - Success: 42 cases
 - Failure (code runs but incorrect): 37 cases
 - Fail to compile: 8 cases
 - No code: 220 cases
-Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. Many of these long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
 ## Usage

 ### Result analysis
+We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
 - Success: 42 cases
 - Failure (code runs but incorrect): 37 cases
 - Fail to compile: 8 cases
 - No code: 220 cases
+Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
 ## Usage