chenwuml commited on
Commit
2be1969
·
1 Parent(s): 1c7860f

initial commit

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -52,15 +52,13 @@ All open-weight models were tested using [vLLM](https://github.com/vllm-project/
52
 
53
  ### Result analysis
54
 
55
- We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as 'problem_name', 'prompt', 'response', 'response_length', 'solution_code', 'status', and 'score'.
56
-
57
- Notably, the 'status' distribution shows:
58
  - Success: 42 cases
59
  - Failure (code runs but incorrect): 37 cases
60
  - Fail to compile: 8 cases
61
  - No code: 220 cases
62
 
63
- Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. Many of these long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
64
 
65
  ## Usage
66
 
 
52
 
53
  ### Result analysis
54
 
55
+ We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
 
 
56
  - Success: 42 cases
57
  - Failure (code runs but incorrect): 37 cases
58
  - Fail to compile: 8 cases
59
  - No code: 220 cases
60
 
61
+ Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
62
 
63
  ## Usage
64