update README
Browse files
README.md
CHANGED
@@ -121,24 +121,24 @@ foundation for next-generation language model agents to reason and tackle real-w
|
|
121 |
|
122 |
\* conducted on the text-only HLE subset.
|
123 |
|
124 |
-
Our models are evaluated with temperature=1.0
|
125 |
|
126 |
### SWE-bench methodology
|
127 |
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|
128 |
-
"astropy__astropy-7606"
|
129 |
-
"astropy__astropy-8707"
|
130 |
-
"astropy__astropy-8872"
|
131 |
-
"django__django-10097"
|
132 |
-
"matplotlib__matplotlib-20488"
|
133 |
-
"psf__requests-2317"
|
134 |
-
"psf__requests-2931"
|
135 |
-
"psf__requests-5414"
|
136 |
-
"pylint-dev__pylint-6528"
|
137 |
-
"pylint-dev__pylint-7277"
|
138 |
-
"sphinx-doc__sphinx-10435"
|
139 |
-
"sphinx-doc__sphinx-7985"
|
140 |
-
"sphinx-doc__sphinx-8269"
|
141 |
-
"sphinx-doc__sphinx-8475"
|
142 |
|
143 |
### TAU-bench methodology
|
144 |
We evaluate TAU-Bench with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 40.
|
|
|
121 |
|
122 |
\* conducted on the text-only HLE subset.
|
123 |
|
124 |
+
Our models are evaluated with `temperature=1.0`, `top_p=0.95`.
|
125 |
|
126 |
### SWE-bench methodology
|
127 |
We report results derived from the Agentless scaffold. Departing from the original pipeline, our methodology employs a two-stage localization process (without any embedding-based retrieval mechanisms): initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. The values for our models are calculated on the subset of n=486 verified tasks which work on our infrastructure. The excluded 14 test cases that were incompatible with our internal infrastructure are:
|
128 |
+
`"astropy__astropy-7606"`,
|
129 |
+
`"astropy__astropy-8707"`,
|
130 |
+
`"astropy__astropy-8872"`,
|
131 |
+
`"django__django-10097"`,
|
132 |
+
`"matplotlib__matplotlib-20488"`,
|
133 |
+
`"psf__requests-2317"`,
|
134 |
+
`"psf__requests-2931"`,
|
135 |
+
`"psf__requests-5414"`,
|
136 |
+
`"pylint-dev__pylint-6528"`,
|
137 |
+
`"pylint-dev__pylint-7277"`,
|
138 |
+
`"sphinx-doc__sphinx-10435"`,
|
139 |
+
`"sphinx-doc__sphinx-7985"`,
|
140 |
+
`"sphinx-doc__sphinx-8269"`,
|
141 |
+
`"sphinx-doc__sphinx-8475"`
|
142 |
|
143 |
### TAU-bench methodology
|
144 |
We evaluate TAU-Bench with GPT-4.1 as user model and without any custom tools. The maximum number of interaction steps is 40.
|