alea-institute's picture
fix output snippet in readme
ab70036 verified
metadata
language:
  - en
library_name: transformers
license: cc-by-4.0
tags:
  - kl3m
  - kl3m-002
  - patent
  - all the patents
  - slm
date: '2024-03-12T00:00:00.000Z'
pipeline_tag: text-generation
widget:
  - text: |
      # Title
  - temperature: 0.3
  - do_sample: true

All the Patents 170m Model

kl3m-002-170m-patent is a a (very) small language model (SLM) model fine-tuned from kl3m-002-170m to generate "realistic" patent text. For more information about the base model, please see its model page.

All the Patents

Why?

If a GPT2-sized model can generate a valid set of claims, should anyone be able to monopolize the invention?

At their heart, patents are a temporary, sanctioned monopoly on an invention through a license to sue. This monopoly is justified by the public good created by encouraging innovation and the long-term impact of that innovation being shared in the public domain.

Unfortunately, this worthy policy goal has been lost in the chaos and misuse of the patent system.

One of the most common sources of frustration is the granting of "obvious" patents. While some inventions are clearly novel and non-obvious, many are not - but still slip through the examination process. These obvious but granted patents then loom large over the market, creating a "thicket" that discourages use or subsequent invention in the area of the granted patent. "Undoing" the grant of a patent is a costly and time-consuming process with possible negative consequences, and so many of these patents simply sit as prior art on the books, even if the patentholder knows they could never enforce them.

Congress and various stakeholders have discussed and proposed changes over time, including most recently the America Invents Act (AIA), but the problem of obvious patents persists.

But what if someone were to generate all the obvious inventions and make them public?

What if we shared the means of producing these obvious inventions so that everyone could help generate them on a normal CPU or consumer GPU?

And what if we could then make those obvious inventions easily searchable for anyone, including PTO examiners themselves, to use?

How it Works

We start with a small, GPT2-sized large language model - kl3m-170 - which was trained on a clean, copyright-free dataset. This helps us ensure that generations do not include copyrighted text, which would allow third-parties to interfere with the project via DMCA takedown requests.

Next, we fine-tune this model on two simultaneous tasks:

  1. Top-down drafting: We start from the most abstract parts of the patent - the title and abstract - and then generate the detailed claims. This is a traditional next-token prediction order.
# Patent

## Title
{title}

## Abstract
{abstract}

## Claims

1. {claim 1}

2. {claim 2}

...
  1. Bottom-up: We start from the most detailed part of the patent - the claims - and then generate the abstract and title. This reversed order can be thought of as similar to traditional extractive/abstractive summarization tasks.
# Patent

## Claims

1. {claim 1}

2. {claim 2}

...

## Abstract
{abstract}

## Title
{title}

Once this fine-tuning is complete, we can then generate new patents using either technique by prompting the model as follows:

  1. Top-down prompt: "# Patent\n\n## Title"

  2. Bottom-up prompt: "# Patent\n\n## Claims"

It's critical that generation occurs with sufficient randomness and diversity to ensure that the generated patents are not simply reproductions of the training data. This is a key area of ongoing research and development.

Much like the real process of invention, most of the "ideas" generated by this process will be either nonsense or unpatentable otherwise. Our goal is to estimate the "hit rate" of the model and continue to improve the efficiency and accessibility of the generation process so that the "cost per obvious invention" is as low as possible.

Current Status

This project is still in its infancy. We're doing R&D to develop prototype tools to demonstrate the possibility and cost of generating and sharing these obvious inventions. This R&D is currently focused on data collection, data curation, model training, and model evaluation.

Generation

You can generate your own examples as follows. For a "complete" patent, you'll want to extend the max_new_tokens value to the biggest number you can fit in your available VRAM.

import json
from transformers import pipeline

# Load the model and tokenizer on CPU
p = pipeline('text-generation', 'alea-institute/kl3m-002-170m-patent', device='cpu')

# Example usage on CPU
text = "# Patent\n\n## Title"
print(
    json.dumps(
        [
            r.get("generated_text")
            for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32)
        ], 
        indent=2
    )
)
[
  "# Patent\n\n## Title\nMethod for manufacturing a temperature-controllable polyurethane composition and method",
  "# Patent\n\n## Title\nElectronic device\n\n## Abstract\nAn electronic device includes a display panel and a",
  "# Patent\n\n## Title\nMethods and devices for tissue repair using a neural network\n\n## Abstract"
]

Related Material

License

This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.

The model weights are released under the CC-BY 4.0 License.

Contact

The KL3M model family is now maintained by the ALEA Institute. For technical support, collaboration opportunities, or general inquiries:

Acknowledgments

Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.

Citation

Tokenizer, dataset, and model publications are pending.

Contact

For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.

https://aleainstitute.ai