License-Llama3-8B

Introduction

We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model.

License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification. In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%.

Use with transformers

Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch
import json

# Definition of license terms and attitudes
Terms = {
    'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
    'Add License Terms': 'provide additional license terms',
    'Add Own Notices': 'add own notices in derivative works',
    'Ask Distribution Fee': 'ask a fee to distribute a copy',
    'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
    'Copy': 'reproduce the original work in copies',
    'Distribute': 'distribute original or modified derivative works',
    'Modify': 'modify the software and create derivatives',
    'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
    'Publicly Display': 'display the original work publicly',
    'Publicly Perform': 'perform the original work publicly',
    'Sublicense': 'incorporate the work into something that has a more restrictive license',
    'Commercial Use': 'use the software for commercial purposes',
    'Private Use': 'use or modify the software freely or privately without distributing it',
    'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
    'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
    'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
    'Include License': 'include the full text of license(license copy) in modified software',
    'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
    'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
    'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
    'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
    'Use TradeMark': 'use contributor’s name, trademark or logo',
    'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
    'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
    'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
    'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
}


Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
             "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
             "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}


# Create the Prompt
def create_prompt(term_definition, attitude_definition, license_text):
    exm = {
    "Distribute": "CAN",
    "Use": "CAN",
    "Modify": "CANNOT"
    }

    prompt = f"""### OBJECTIVE
Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.

### DEFINITION OF TERMS
{term_definition}

### DEFINITION OF ATTITUDES
{attitude_definition}

### LICENSE TEXT
{license_text}

### RESPONSE
Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.

### Output Format Example
```
{json.dumps(exm, indent=2)}
```
"""
    return prompt


# Load model and create a pipeline
model_id = "AnonymousAuthors/License-Llama3-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="auto"
)

# An example of extracting license terms
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."

prompt = create_prompt(Terms, Attitudes, license_text)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=512,
    eos_token_id=terminators,
    pad_token_id=pipeline.tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.3,
    top_p=0.7,
)

response = outputs[0]["generated_text"][len(prompt):]

print(f"License Text: {license_text}\n")
print(f"LLM Response: {response}\n")

Use with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
Install vLLM with pip:

pip install vllm == 0.3.1

Run the following command to start the vLLM server:

python -m vllm.entrypoints.openai.api_server \
       --served-model-name llama3-8b \
       --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
       --gpu-memory-utilization 0.9 \
       --tensor-parallel-size 1 \
       --host 0.0.0.0 \
       --port 8000

Then you can request the server to identify license terms:

from openai import OpenAI

client = OpenAI(
    api_key='EMPTY',
    base_url='http://127.0.0.1:8000/v1',
)


def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
    resp = client.completions.create(
        model=model_type,
        prompt=query,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        seed=42)
    
    response = resp.choices[0].text
    return response


# An example of extracting license terms
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."

# For the definition of Terms and Attitudes, please refer to the previous section
prompt = create_prompt(Terms, Attitudes, license_text)

response = license_extract(prompt, model_type='llama3-8b', 
                            max_tokens=1500, temperature=0.3, top_p=0.7)

print(f"License Text: {license_text}\n")
print(f"LLM Response: {response}\n")

⚠️ NOTE:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.

Downloads last month
11
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AnonymousAuthors/License-Llama3-8B

Finetuned
(403)
this model
Quantizations
1 model

Dataset used to train AnonymousAuthors/License-Llama3-8B