Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
metrics:
|
4 |
+
- accuracy
|
5 |
+
base_model:
|
6 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
7 |
+
datasets:
|
8 |
+
- stanfordnlp/nnetnav-live
|
9 |
+
---
|
10 |
+
|
11 |
+
# Model Card for Llama8b-NNetNav-Live
|
12 |
+
|
13 |
+
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
14 |
+
LLama8b-NNetNav-WA is a [LLama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model that is instruct-tuned with [NNetNav-Live](https://huggingface.co/datasets/stanfordnlp/nnetnav-live) data collected via unsupervised exploration on 15 live websites, with a larger [LLama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model.
|
15 |
+
|
16 |
+
The live websites used for exploration are:
|
17 |
+
- Allrecipes.com
|
18 |
+
- amazon.com
|
19 |
+
- apple.com
|
20 |
+
- bbcnews.com
|
21 |
+
- booking.com
|
22 |
+
- dictionary.cambridge.org
|
23 |
+
- coursera.org
|
24 |
+
- espn.com
|
25 |
+
- github.com
|
26 |
+
- google.com/flights
|
27 |
+
- google.com/maps
|
28 |
+
- google.com
|
29 |
+
- huggingface.co
|
30 |
+
- wolframalpha.com
|
31 |
+
|
32 |
+
More details about this model can be found in our paper: [NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild](https://arxiv.org/abs/2410.02907).
|
33 |
+
|
34 |
+
|
35 |
+
## Table of Contents
|
36 |
+
|
37 |
+
- [Model Card for Llama8b-NNetNav-Live](#model-card-for--model_id-)
|
38 |
+
- [Table of Contents](#table-of-contents)
|
39 |
+
- [Model Details](#model-details)
|
40 |
+
- [Results on Benchmarks](#results-on-benchmarks)
|
41 |
+
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
42 |
+
- [Training Details](#training-details)
|
43 |
+
- [Training Data](#training-data)
|
44 |
+
- [Training Procedure](#training-procedure)
|
45 |
+
- [Environmental Impact](#environmental-impact)
|
46 |
+
- [Technical Specifications [optional]](#technical-specifications-optional)
|
47 |
+
- [Model Architecture and Objective](#model-architecture-and-objective)
|
48 |
+
- [Compute Infrastructure](#compute-infrastructure)
|
49 |
+
- [Hardware](#hardware)
|
50 |
+
- [Software](#software)
|
51 |
+
- [Citation](#citation)
|
52 |
+
- [Model Card Authors](#model-card-authors-optional)
|
53 |
+
- [Model Card Contact](#model-card-contact)
|
54 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
55 |
+
|
56 |
+
## Model Details
|
57 |
+
This model is intended to be used as a **web-agent** i.e. given an instruction such as "Upvote the post by user smurty123 on subreddit r/LocalLLaMA", and a web-url "reddit.com", the model can perform the task by executing a sequence of actions.
|
58 |
+
|
59 |
+
The action space of the model is as follows:
|
60 |
+
```plaintext
|
61 |
+
Page Operation Actions:
|
62 |
+
`click [id]`: This action clicks on an element with a specific id on the webpage.
|
63 |
+
`type [id] [content] [press_enter_after=0|1]`: Use this to type the content into the field with id. By default, the "Enter" key is pressed after typing unless press_enter_after is set to 0.
|
64 |
+
`hover [id]`: Hover over an element with id.
|
65 |
+
`press [key_comb]`: Simulates the pressing of a key combination on the keyboard (e.g., Ctrl+v).
|
66 |
+
`scroll [down|up]`: Scroll the page up or down.
|
67 |
+
|
68 |
+
Tab Management Actions:
|
69 |
+
`new_tab`: Open a new, empty browser tab.
|
70 |
+
`tab_focus [tab_index]`: Switch the browser's focus to a specific tab using its index.
|
71 |
+
`close_tab`: Close the currently active tab.
|
72 |
+
|
73 |
+
URL Navigation Actions:
|
74 |
+
`goto [url]`: Navigate to a specific URL.
|
75 |
+
`go_back`: Navigate to the previously viewed page.
|
76 |
+
`go_forward`: Navigate to the next page (if a previous 'go_back' action was performed).
|
77 |
+
|
78 |
+
Completion Action:
|
79 |
+
`stop [answer]`: Issue this action when you believe the task is complete. If the objective is to find a text-based answer, provide the answer in the bracket. If you believe the task is impossible to complete, provide the answer as "N/A" in the bracket.
|
80 |
+
|
81 |
+
```
|
82 |
+
|
83 |
+
## Results on Benchmarks
|
84 |
+
|
85 |
+
This model gets the following results on WebArena and WebVoyager:
|
86 |
+
|
87 |
+
| Model | WebArena (SR) | WebVoyager (SR) |
|
88 |
+
|------------------------|--------------:|---------------:|
|
89 |
+
| **GPT-4** | **14.1** | **33.5** |
|
90 |
+
| **llama8b-nnetnav-wa** | **9.5** | **35.2** |
|
91 |
+
|
92 |
+
|
93 |
+
## Bias, Risks, and Limitations
|
94 |
+
|
95 |
+
### **Bias**
|
96 |
+
As with all ML models, **Llama8b-NNetNav-Live** inherits biases from its training data. Since the dataset is collected via unsupervised exploration on live websites, during a specific time-period (December 2024 to January 2025), it will reflect biases present in these websites at that time.
|
97 |
+
|
98 |
+
- **Selection Bias:** The model is trained on interactions from a specific set of live websites, and may struggle on out-of-domain websites e.g. government websites.
|
99 |
+
- **Demographic Bias:** Our set of live-websites over-represent Western English-speaking users, and the model may perform worse on non-English or culturally distinct websites.
|
100 |
+
- Example: A model trained mostly on U.S. e-commerce sites may navigate amazon.com effectively but may struggle with Flipkart (India) or Rakuten (Japan).
|
101 |
+
|
102 |
+
If you are interested in training a NNetNav based agent for your own domain, please check out our [codebase](https://github.com/MurtyShikhar/NNetnav).
|
103 |
+
|
104 |
+
### **Risks**
|
105 |
+
#### **1. Unintended Actions**
|
106 |
+
The model operates by executing web actions based on textual observation spaces, which may lead to unintended consequences when dealing with ambiguous or poorly structured websites.
|
107 |
+
|
108 |
+
- If instructed to "delete all spam messages in my inbox," but the website has unusual button placement in the AXTree, the model might mistakenly delete important emails instead.
|
109 |
+
- If asked to "buy the cheapest laptop on Amazon," the model might select an accessory instead of an actual laptop if the AXTree of the listing page has misleading layout
|
110 |
+
|
111 |
+
#### **2. Security & Privacy Risks**
|
112 |
+
Since the model interacts with external web content, there are significant risks related to unintentional data exposure, credential leaks, and interaction with harmful content.
|
113 |
+
|
114 |
+
- If asked to "log into my Gmail and check unread emails," the model may type and submit credentials without realizing it, potentially exposing passwords.
|
115 |
+
- A user asking the model to "search for free software downloads" might inadvertently lead to interactions with phishing or malware-hosting sites.
|
116 |
+
|
117 |
+
#### **3. Adversarial Manipulation**
|
118 |
+
Malicious websites can deceive the model by using **dark patterns**—UI/UX tricks that mislead users (or bots).
|
119 |
+
|
120 |
+
- A fraudulent website may create **fake "Close" buttons** in the AXTree that actually trigger **downloads or pop-ups**. The model, thinking it's closing a window, may instead **click a malicious link**.
|
121 |
+
- If asked to "unsubscribe from a newsletter," but the page uses **misleading button labels** in the AXTree (e.g., "Unsubscribe" actually means "Resubscribe"), the model could perform the opposite action.
|
122 |
+
|
123 |
+
#### **4. Legal & Ethical Considerations**
|
124 |
+
Web navigation often involves handling user-generated content, news, and e-commerce transactions, all of which pose ethical and legal challenges.
|
125 |
+
|
126 |
+
- If instructed to "find the latest election results," the model might click on a misleading news source, potentially spreading misinformation.
|
127 |
+
- If asked to "find the cheapest flight ticket," it could unintentionally violate terms of service by scraping restricted airline data.
|
128 |
+
|
129 |
+
### **Limitations**
|
130 |
+
#### **1. Generalization to Unseen Websites**
|
131 |
+
This model is trained via interaction on 15 live-websites. These interactions were carried out between December 2024 to January 2025. If you intend to use the model on websites that are very distinct from these 15 sites, we highly suggest training your own NNetNav model. See our [codebase](https://github.com/MurtyShikhar/NNetnav) for more information.
|
132 |
+
|
133 |
+
#### **2. Instruction Sensitivity**
|
134 |
+
Vague instructions can lead to unintended actions.
|
135 |
+
|
136 |
+
- "Find me the best laptop for gaming" is **subjective**, and the model might select a **random option** instead of following some criteria (e.g., GPU, refresh rate).
|
137 |
+
|
138 |
+
#### **3. Performance on Long-Horizon Tasks**
|
139 |
+
The model may struggle when tasks require **deep memory retention, complex multi-step planning, or backtracking**.
|
140 |
+
|
141 |
+
- *Example:* When booking a hotel on a travel website, the model might navigate **through multiple filters and options** but forget previous selections when reaching the checkout page.
|
142 |
+
|
143 |
+
#### **4. Token Limitations**
|
144 |
+
The model's **maximum sequence length of 20k tokens** limits its ability to handle long, continuous web interactions.
|
145 |
+
|
146 |
+
- *Example:* When filling a very long multi-step form, the model might forget earlier responses, leading to errors.
|
147 |
+
|
148 |
+
|
149 |
+
## How to Get Started with the Model
|
150 |
+
|
151 |
+
TODO
|
152 |
+
|
153 |
+
## Training Details
|
154 |
+
|
155 |
+
### Training Data
|
156 |
+
|
157 |
+
This model was trained on the [NNetnav-Live](https://huggingface.co/datasets/stanfordnlp/nnetnav-live) dataset, which is comprised of ~5k synthetic demonstrations from 15 live websites
|
158 |
+
|
159 |
+
### Training Procedure
|
160 |
+
|
161 |
+
This model was trained for 2 epochs (roughly 4k gradient steps) with a batch size of 128, and a maximum sequence length of 20000.
|
162 |
+
|
163 |
+
## Environmental Impact
|
164 |
+
|
165 |
+
- **Hardware Type:** 4 H100 GPUs (80G)
|
166 |
+
- **Hours used:** Roughly 2 days.
|
167 |
+
- **Cloud Provider:** Stanford compute.
|
168 |
+
- **Compute Region:** Stanford energy grid.
|
169 |
+
|
170 |
+
|
171 |
+
## Technical Specifications
|
172 |
+
|
173 |
+
This model was trained on a slurm cluster.
|
174 |
+
|
175 |
+
### Hardware
|
176 |
+
|
177 |
+
This model was trained on 4 H100s.
|
178 |
+
|
179 |
+
### Software
|
180 |
+
|
181 |
+
This model was fine-tuned with [Open-Instruct](https://github.com/allenai/open-instruct/tree/main)
|
182 |
+
|
183 |
+
|
184 |
+
## Model Card Authors
|
185 |
+
|
186 |
+
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
|
187 |
+
Shikhar Murty
|
188 |
+
|
189 |
+
## Model Card Contact
|
190 |
+
|
191 |