Spaces:

ServiceNow
/

browsergym-leaderboard

Running

ligang-orby commited on Feb 25

Commit

3011461

verified ·

1 Parent(s): a35cef0

orby-agent-v0 (#8)

Files changed (6) hide show

results/OrbyAgent-ActIO-72b/README.md ADDED Viewed

+### OrbyAgent-ActIO-72b
+This agent is developed by [Orby AI](https://www.orby.ai/).
+The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
+It uses the ActIO model of 72B parameters as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).

results/OrbyAgent-ActIO-72b/miniwob.json ADDED Viewed

+[
+    {
+        "agent_name": "OrbyAgent-ActIO-72b",
+        "study_id": "orby-agent-v0-actio-v0-miniwob",
+        "benchmark": "MiniWoB",
+        "score": 64.2,
+        "std_err": 1.4,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "NA",
+        "original_or_reproduced": "Original",
+        "date_time": "2025-02-21 15:03:35"
+    }
+]

results/OrbyAgent-ActIO-72b/webarena.json ADDED Viewed

+[
+    {
+        "agent_name": "OrbyAgent-ActIO-72b",
+        "study_id": "b5fc5be7-54cc-4fc1-a9ee-73447b9c3eae",
+        "benchmark": "WebArena",
+        "score": 34.7,
+        "std_err": 0.25,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Use original WebArena eval protocol and task definitions",
+        "original_or_reproduced": "Original",
+        "date_time": "2025-02-21 15:05:12"
+    }
+]

results/OrbyAgent-Claude-3.5-Sonnet/README.md ADDED Viewed

+### OrbyAgent-Claude-3.5-Sonnet
+This agent is developed by [Orby AI](https://www.orby.ai/).
+The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
+It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).

results/OrbyAgent-Claude-3.5-Sonnet/miniwob.json ADDED Viewed

+[
+    {
+        "agent_name": "OrbyAgent-Claude-3.5-Sonnet",
+        "study_id": "orby-agent-v0-claude-3.5-miniwob",
+        "benchmark": "MiniWoB",
+        "score": 74.9,
+        "std_err": 1.2,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "NA",
+        "original_or_reproduced": "Original",
+        "date_time": "2025-02-21 14:54:16"
+    }
+]

results/OrbyAgent-Claude-3.5-Sonnet/webarena.json ADDED Viewed

+[
+    {
+        "agent_name": "OrbyAgent-Claude-3.5-Sonnet",
+        "study_id": "orby-agent-v0-claude-3.5-webarena",
+        "benchmark": "WebArena",
+        "score": 36.5,
+        "std_err": 0,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Use original WebArena eval protocol and task definitions",
+        "original_or_reproduced": "Original",
+        "date_time": "2025-02-21 15:00:22"
+    }
+]