Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +177 -0
config.json +34 -0
generation_config.json +8 -0
model.safetensors +3 -0
special_tokens_map.json +31 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,177 @@

+---
+language:
+- en
+library_name: transformers
+license: cc-by-4.0
+tags:
+- kl3m
+- kl3m-002
+- patent
+- all the patents
+- slm
+date: '2024-03-12T00:00:00.000Z'
+pipeline_tag: text-generation
+widget:
+ - text: "# Title\n"
+ - temperature: 0.3
+ - do_sample: True
+---
+# All the Patents 170m Model
+`kl3m-002-170m-patent` is a a (very) small language model (SLM) model fine-tuned from `kl3m-002-170m` to
+generate "realistic" patent text.  For more information about the base model,
+please see [its model page](https://huggingface.co/alea-institute/kl3m-002-170m).
+# All the Patents
+## Why?
+#### If a GPT2-sized model can generate a valid set of claims, should anyone be able to monopolize the invention?
+At their heart, patents are a temporary, sanctioned monopoly on an invention through a license to sue.  This monopoly
+is justified by the public good created by encouraging innovation and the long-term impact of that innovation being
+shared in the public domain.
+Unfortunately, this worthy policy goal has been lost in the chaos and misuse of the patent system.
+One of the most common sources of frustration is the granting of "obvious" patents.  While some inventions are clearly novel
+and non-obvious, many are not - but still slip through the examination process.  These obvious but granted patents then
+loom large over the market, creating a "thicket" that discourages use or subsequent invention in the area of the granted
+patent.  "Undoing" the grant of a patent is a costly and time-consuming process with possible negative consequences, and
+so many of these patents simply sit as prior art on the books, even if the patentholder knows they could never enforce them.
+Congress and various stakeholders have discussed and proposed changes over time, including most recently the
+America Invents Act (AIA), but the problem of obvious patents persists.
+But what if someone were to generate all the obvious inventions and make them public?
+What if we shared the means of producing these obvious inventions so that everyone could help generate them on a normal CPU or consumer GPU?
+And what if we could then make those obvious inventions easily searchable for anyone, including PTO examiners themselves, to use?
+## How it Works
+We start with a small, GPT2-sized large language model - [kl3m-170](https://273ventures.com/kl3m-the-first-legal-large-language-model/) - which was trained on a clean, copyright-free dataset.
+This helps us ensure that generations do not include copyrighted text, which would allow third-parties to interfere with the project
+via DMCA takedown requests.
+Next, we fine-tune this model on two simultaneous tasks:
+1. **Top-down drafting**:  We start from the most abstract parts of the patent - the title and abstract - and then generate the detailed claims.  This is a traditional next-token prediction order.
+```text
+# Patent
+## Title
+{title}
+## Abstract
+{abstract}
+## Claims
+1. {claim 1}
+2. {claim 2}
+...
+```
+2. **Bottom-up**: We start from the most detailed part of the patent - the claims - and then generate the abstract and title.  This reversed order can be thought of as similar to traditional extractive/abstractive summarization tasks.
+```text
+# Patent
+## Claims
+1. {claim 1}
+2. {claim 2}
+...
+## Abstract
+{abstract}
+## Title
+{title}
+```
+Once this fine-tuning is complete, we can then generate new patents using either technique by prompting the model as follows:
+1. **Top-down prompt**: `"# Patent\n\n## Title"`
+2. **Bottom-up prompt**: `"# Patent\n\n## Claims"`
+It's critical that generation occurs with sufficient randomness and diversity to ensure that the generated patents are not
+simply reproductions of the training data.  This is a key area of ongoing research and development.
+**Much like the real process of invention, most of the "ideas" generated by this process will be either nonsense or
+unpatentable otherwise. Our goal is to estimate the "hit rate" of the model and continue to improve the efficiency and
+accessibility of the generation process so that the "cost per obvious invention" is as low as possible.**
+## Current Status
+This project is still in its infancy.  We're doing R&D to develop prototype tools to demonstrate the possibility and
+cost of generating and sharing these obvious inventions.  This R&D is currently focused on data collection,
+data curation, model training, and model evaluation.
+## Generation
+You can generate your own examples as follows:
+```python
+import json
+from transformers import pipeline
+# Load the model and tokenizer on CPU
+p = pipeline('text-generation', 'alea-institute/kl3m-002-170m-patent', device='cpu')
+# Example usage on CPU
+text = "# Title\n"
+print(
+    json.dumps(
+        [
+            r.get("generated_text")
+            for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=1024)
+        ],
+        indent=2
+    )
+)
+```
+### Related Material
+* https://www.federalregister.gov/documents/2024/02/27/2024-03967/updated-guidance-for-making-a-proper-determination-of-obviousness
+## License
+This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
+The model weights are released under the CC-BY 4.0 License.
+## Contact
+The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
+- GitHub: https://github.com/alea-institute/kl3m-model-research
+- Email: [email protected]
+- Website: https://aleainstitute.ai
+## Acknowledgments
+Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
+## Citation
+Tokenizer, dataset, and model publications are pending.
+## Contact
+For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
+create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
+![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_name_or_path": "kl3m-002-170m-patent",
+  "architectures": [
+    "GPTNeoXForCausalLM"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "classifier_dropout": 0.1,
+  "eos_token_id": 1,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.0,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 1024,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 4096,
+  "model_type": "gpt_neox",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 16,
+  "num_key_value_heads": 8,
+  "pad_token_id": 2,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 10000,
+  "rotary_emb_base": 10000,
+  "rotary_pct": 0.25,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.0",
+  "use_cache": false,
+  "use_parallel_residual": true,
+  "vocab_size": 32768
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": 1,
+  "pad_token_id": 2,
+  "transformers_version": "4.38.0",
+  "use_cache": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9d4b0f72f7f1d6c9dbe753cf4b20f6f3e922771e121acef7902875071904584f
+size 671774872

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "bos_token": {
+    "content": "<|start|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<|mask|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|end|>",
+  "unk_token": {
+    "content": "<|unk|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff