File size: 11,299 Bytes
3943768
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Offline Mode and Security:

## TL;DR

To run offline, either do smart or manual way.

* Smart Download
    1) Run online with command that downloads the model for you (i.e. using HF link name, not file name)
    2) Go offline and run using the file directly or use UI to select the model
E.g.
```bash
# online do:
python generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF --prompt_type=zephyr --max_seq_len=4096 --add_disk_models_to_ui=False
# Then use h2oGPT as might normally for any tasks.
# Once offline do:
TRANSFORMERS_OFFLINE=1 python generate.py --base_model=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
# or:
TRANSFORMERS_OFFLINE=1 python generate.py --base_model=llama --model_path_llama=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
# or if choosing in UI do (be sure to choose correct prompt_type too):
TRANSFORMERS_OFFLINE=1 python generate.py --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
```

* Manual Download
    1) Download the model file you want and place into llamacpp_path (i.e. downloading url to local file)
    2) Go offline and run using the file directly or use UI to select the model

```bash
# online do:
wget https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf?download=true -O llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf
# Then use normally for any tasks one expects to do offline.
# Once offline do:
TRANSFORMERS_OFFLINE=1 python generate.py --base_model=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
# or:
TRANSFORMERS_OFFLINE=1 python generate.py --base_model=llama --model_path_llama=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
# or if choosing in UI do (be sure to choose correct prompt_type too):
TRANSFORMERS_OFFLINE=1 python generate.py --gradio_offline_level=2 --share=False --add_disk_models_to_ui=False
```

NOTE: If set `--prepare_offline_level=2` for first online call, h2oGPT will get standard models for offline use, but that may be more than you require.  You can tune the code `../src/prepare_offline.py` to get only the models you require.

### Easy Way:

Run h2oGPT as would in offline mode, ensuring to use LLM and upload docs using same parsers as would want in offline mode.  The `~/.cache` folder will be filled, and one can use that in offline mode.

### Moderately Easy Way:

If you can run on same (or better) system that will be like that in offline mode, you can run the following and collect all needed items in the `~/.cache/` and `~/nltk_data` folders, specifically:
* `~/.cache/selenium/`
* `~/.cache/huggingface/`
* `~/.cache/torch/`
* `~/.cache/clip/`
* `~/.cache/doctr/`
* `~/.cache/chroma/`
* `~/.cache/ms-playwright/`
* `~/.cache/selenium/`
* `~/nltk_data/`
```bash
python generate.py --score_model=None --gradio_size=small --model_lock="[{'base_model': 'h2oai/h2ogpt-4096-llama2-7b-chat'}]" --save_dir=save_fastup_chat --prepare_offline_level=2 --add_disk_models_to_ui=False
# below are already in docker
python -m nltk.downloader all
playwright install --with-deps
```
Some of these locations can be controlled, but others not, so it's best to make a local version of `~/.cache` (e.g. move original out of way), run the preceding command, archive it for offline system, restore old `~/.cache`, and then use offline.  If same system, then those steps aren't required, one can just go fully offline.

If you are only concerned with what h2oGPT needs, not any inference servers, you can run with `--prepare_offline_level=1` that will not obtain models associated with inference severs (e.g. vLLM or TGI).

If you have a GGUF/GGML file, you should download it ahead of time and place it in some path you provide to `--llamacpp_dict` for its `model_path_llama` dict entry.

### Hard Way:

Identify and download all needed models. Note that the following list is not exhaustive because the models added change frequently and each uses a different approach for downloading.

Note that when running `generate.py` and asking your first question, it will download the model(s), which for the 6.9B model takes about 15 minutes per 3 pytorch bin files if have 10MB/s download.

If all data has been put into `~/.cache` by HF transformers and GGUF/GGML files downloaded already and one points to them (e.g. with `--model_path_llama=llama-2-7b-chat.Q6_K.gguf` from pre-downloaded `https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf`), then these following steps (those related to downloading HF models) are not required.

* Download model and tokenizer of choice
    
    ```python
    from transformers import AutoTokenizer, AutoModelForCausalLM
    model_name = 'h2oai/h2ogpt-oasst1-512-12b'
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.save_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(model_name)
    ```
    If using GGUF files, those should be downloaded separately manually, e.g.:
   ```bash
      wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf
   ```
  and point to file path, e.g. `--base_model=llama --model_path_llama=llama-2-7b-chat.Q6_K.gguf`.

* Download reward model, unless pass `--score_model='None'` to `generate.py`
    ```python
    # and reward model
    reward_model = 'OpenAssistant/reward-model-deberta-v3-large-v2'
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    model = AutoModelForSequenceClassification.from_pretrained(reward_model)
    model.save_pretrained(reward_model)
    tokenizer = AutoTokenizer.from_pretrained(reward_model)
    tokenizer.save_pretrained(reward_model)
    ```
    
* For LangChain support, download embedding model:
    ```python
    hf_embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
    model_kwargs = dict(device='cpu')
    from langchain.embeddings import HuggingFaceEmbeddings
    embedding = HuggingFaceEmbeddings(model_name=hf_embedding_model, model_kwargs=model_kwargs)
    ```
    
* For HF inference server and OpenAI, this downloads the tokenizers used for Hugging Face text generation inference server and gpt-3.5-turbo:
    ```python
    import tiktoken
    encoding = tiktoken.get_encoding("cl100k_base")
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    ```

* Get gpt-2 tokenizer for summarization token counting
    ```python
    from transformers import AutoTokenizer
    model_name = 'gpt2'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(model_name)
    ```

### Run h2oGPT in offline mode

```bash
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python generate.py --base_model='h2oai/h2ogpt-oasst1-512-12b' --gradio_offline_level=2 --share=False
```
For more info for transformers, see [Offline Mode](https://huggingface.co/docs/transformers/installation#offline-mode).

Some code is always disabled that involves uploads out of user control: Huggingface telemetry, gradio telemetry, chromadb posthog.

The additional option `--gradio_offline_level=2` changes fonts to avoid download of google fonts. This option disables google fonts for downloading, which is less intrusive than uploading, but still required in air-gapped case.  The fonts don't look as nice as google fonts, but ensure full offline behavior.

If the front-end can still access internet, but just backend should not, then one can use `--gradio_offline_level=1` for slightly better-looking fonts.

Note that gradio attempts to download [iframeResizer.contentWindow.min.js](https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.1/iframeResizer.contentWindow.min.js),
but nothing prevents gradio from working without this.  So a simple firewall block is sufficient.  For more details, see: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/10324.

For non-HF models, you must specify the file name as we cannot map HF name to file name for GGUF/GPTQ etc. files automagically without internet.  E.g. after running one of the offline preparation ways above, run:
```
HF_DATASETS_OFFLINE=1;TRANSFORMERS_OFFLINE=1 python generate.py --gradio_offline_level=2 --gradio_offline_level=2 --base_model=llama --model_path_llama=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr
```
That is, you cannot do:
```
HF_DATASETS_OFFLINE=1;TRANSFORMERS_OFFLINE=1 python generate.py --gradio_offline_level=2 --gradio_offline_level=2 --base_model=TheBloke/zephyr-7B-beta-GGUF --prompt_type=zephyr
```
since the mapping from that name to get file etc. is not trivial and only possible with internet.

It is good idea to also set `--prompt_type`, since the version of model name given may not be in the prompt dictionary lookup.

#### Run vLLM offline

In order to use vLLM offline, use the absolute path to the model state, which can be locally obtained model or sitting in the `.cache` folder, e.g.:
```bash
python -m vllm.entrypoints.openai.api_server --port=5000 --host=0.0.0.0 --model "/home/hemant/.cache/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496" --tokenizer=hf-internal-testing/llama-tokenizer --tensor-parallel-size=1 --seed 1234 --max-num-batched-tokens=4096
```
Otherwise, vLLM will try to contact Hugging Face servers.

You can also do same for h2oGPT, but take note that if you pass absolute path for base model, you have to specify the `--prompt_type`.
```bash
python generate.py --inference_server="vllm:0.0.0.0:5000" --base_model='$HOME/.cache/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496' --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048 --concurrency_count=64 --batch_size=16 --prompt_type=llama2 --add_disk_models_to_ui=False
```

See [README_docker](README_docker.md) for more details on running h2oGPT in offline mode for docker.

### Disable access or port

To ensure nobody can access your gradio server, disable the port via firewall.  If that is a hassle, then one can enable authentication by adding to CLI when running `python generate.py`:
```
--auth=[('jon','password')]
```
with no spaces.  Run `python generate.py --help` for more details.

### To fully disable Chroma telemetry, which documented options still do not disable, run:

```bash
sp=`python -c 'import site; print(site.getsitepackages()[0])'`
sed -i 's/posthog\.capture/return\n            posthog.capture/' $sp/chromadb/telemetry/posthog.py
```
or the equivalent for windows/mac using.  Or edit the file manually to just return in the `capture` function.

This is automatically done if using `linux_install.sh` or `linux_install_full.sh`.

### Disable h2oGPT telemetry

To avoid h2oGPT monitoring which elements are clicked in UI, set the ENV `H2OGPT_ENABLE_HEAP_ANALYTICS=False` or pass
```bash
python generate.py --enable-heap-analytics=False ...
```
Note that no data or user inputs are included, only raw svelte UI element IDs and nothing from the user inputs or data.