sriting commited on
Commit
657bd6c
·
1 Parent(s): d72937b

update README

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -121,24 +121,24 @@ foundation for next-generation language model agents to reason and tackle real-w
121
 
122
  \* conducted on the text-only HLE subset.
123
 
124
- Our models are evaluated with temperature=1.0, top_p=0.95.
125
 
126
  ### SWE-bench methodology
127
  We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
128
- "astropy__astropy-7606",
129
- "astropy__astropy-8707",
130
- "astropy__astropy-8872",
131
- "django__django-10097",
132
- "matplotlib__matplotlib-20488",
133
- "psf__requests-2317",
134
- "psf__requests-2931",
135
- "psf__requests-5414",
136
- "pylint-dev__pylint-6528",
137
- "pylint-dev__pylint-7277",
138
- "sphinx-doc__sphinx-10435",
139
- "sphinx-doc__sphinx-7985",
140
- "sphinx-doc__sphinx-8269",
141
- "sphinx-doc__sphinx-8475"
142
 
143
  ### TAU-bench methodology
144
  We evaluate TAU-Bench with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 40.
 
121
 
122
  \* conducted on the text-only HLE subset.
123
 
124
+ Our models are evaluated with `temperature=1.0`, `top_p=0.95`.
125
 
126
  ### SWE-bench methodology
127
  We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
128
+ `"astropy__astropy-7606"`,
129
+ `"astropy__astropy-8707"`,
130
+ `"astropy__astropy-8872"`,
131
+ `"django__django-10097"`,
132
+ `"matplotlib__matplotlib-20488"`,
133
+ `"psf__requests-2317"`,
134
+ `"psf__requests-2931"`,
135
+ `"psf__requests-5414"`,
136
+ `"pylint-dev__pylint-6528"`,
137
+ `"pylint-dev__pylint-7277"`,
138
+ `"sphinx-doc__sphinx-10435"`,
139
+ `"sphinx-doc__sphinx-7985"`,
140
+ `"sphinx-doc__sphinx-8269"`,
141
+ `"sphinx-doc__sphinx-8475"`
142
 
143
  ### TAU-bench methodology
144
  We evaluate TAU-Bench with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 40.