|
--- |
|
license: llama3 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- pytorch |
|
- llama |
|
- llama-3 |
|
datasets: |
|
- AnonymousAuthors/OSS-License-Terms |
|
base_model: |
|
- meta-llama/Meta-Llama-3-8B |
|
--- |
|
|
|
# License-Llama3-8B |
|
|
|
## Introduction |
|
We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model. [Our work has been accepted for publication in the proceedings of the 47th International Conference on Software Engineering(ICSE 2025).](https://conf.researchr.org/details/icse-2025/icse-2025-industry-challenge-track/6/Exploring-Large-Language-Models-for-Analyzing-Open-Source-License-Conflicts-How-Far-) |
|
|
|
License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification. In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%. |
|
|
|
## Use with transformers |
|
Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function. |
|
|
|
Make sure to update your transformers installation via pip install --upgrade transformers. |
|
|
|
````python |
|
import transformers |
|
import torch |
|
import json |
|
|
|
# Definition of license terms and attitudes |
|
Terms = { |
|
'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed', |
|
'Add License Terms': 'provide additional license terms', |
|
'Add Own Notices': 'add own notices in derivative works', |
|
'Ask Distribution Fee': 'ask a fee to distribute a copy', |
|
'Combine Libraries': 'place side by side with a library (that is not an application or covered work)', |
|
'Copy': 'reproduce the original work in copies', |
|
'Distribute': 'distribute original or modified derivative works', |
|
'Modify': 'modify the software and create derivatives', |
|
'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code', |
|
'Publicly Display': 'display the original work publicly', |
|
'Publicly Perform': 'perform the original work publicly', |
|
'Sublicense': 'incorporate the work into something that has a more restrictive license', |
|
'Commercial Use': 'use the software for commercial purposes', |
|
'Private Use': 'use or modify the software freely or privately without distributing it', |
|
'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices', |
|
'Add Statement For Additional Terms': 'place a statement of the additional terms that apply', |
|
'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.', |
|
'Include License': 'include the full text of license(license copy) in modified software', |
|
'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work', |
|
'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available', |
|
'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software', |
|
'Retain Disclaimer': 'redistributions of source code must retain disclaimer', |
|
'Use TradeMark': 'use contributor’s name, trademark or logo', |
|
'Give Credit': 'give explicit credit or acknowledgement to the author with the software', |
|
'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software', |
|
'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software', |
|
'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license' |
|
} |
|
|
|
|
|
Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can", |
|
"CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against", |
|
"MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"} |
|
|
|
|
|
# Create the Prompt |
|
def create_prompt(term_definition, attitude_definition, license_text): |
|
exm = { |
|
"Distribute": "CAN", |
|
"Use": "CAN", |
|
"Modify": "CANNOT" |
|
} |
|
|
|
prompt = f"""### OBJECTIVE |
|
Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes. |
|
|
|
### DEFINITION OF TERMS |
|
{term_definition} |
|
|
|
### DEFINITION OF ATTITUDES |
|
{attitude_definition} |
|
|
|
### LICENSE TEXT |
|
{license_text} |
|
|
|
### RESPONSE |
|
Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name. |
|
|
|
### Output Format Example |
|
``` |
|
{json.dumps(exm, indent=2)} |
|
``` |
|
""" |
|
return prompt |
|
|
|
|
|
# Load model and create a pipeline |
|
model_id = "AnonymousAuthors/License-Llama3-8B" |
|
|
|
pipeline = transformers.pipeline( |
|
"text-generation", |
|
model=model_id, |
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
device="auto" |
|
) |
|
|
|
# An example of extracting license terms |
|
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it." |
|
|
|
prompt = create_prompt(Terms, Attitudes, license_text) |
|
|
|
terminators = [ |
|
pipeline.tokenizer.eos_token_id, |
|
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>") |
|
] |
|
|
|
outputs = pipeline( |
|
prompt, |
|
max_new_tokens=512, |
|
eos_token_id=terminators, |
|
pad_token_id=pipeline.tokenizer.eos_token_id, |
|
do_sample=True, |
|
temperature=0.3, |
|
top_p=0.7, |
|
) |
|
|
|
response = outputs[0]["generated_text"][len(prompt):] |
|
|
|
print(f"License Text: {license_text}\n") |
|
print(f"LLM Response: {response}\n") |
|
```` |
|
|
|
## Use with vLLM |
|
vLLM is a fast and easy-to-use library for LLM inference and serving. |
|
Install vLLM with pip: |
|
```bash |
|
pip install vllm == 0.3.1 |
|
``` |
|
|
|
Run the following command to start the vLLM server: |
|
```bash |
|
python -m vllm.entrypoints.openai.api_server \ |
|
--served-model-name llama3-8b \ |
|
--model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \ |
|
--gpu-memory-utilization 0.9 \ |
|
--tensor-parallel-size 1 \ |
|
--host 0.0.0.0 \ |
|
--port 8000 |
|
``` |
|
|
|
Then you can request the server to identify license terms: |
|
```python |
|
from openai import OpenAI |
|
|
|
client = OpenAI( |
|
api_key='EMPTY', |
|
base_url='http://127.0.0.1:8000/v1', |
|
) |
|
|
|
|
|
def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7): |
|
resp = client.completions.create( |
|
model=model_type, |
|
prompt=query, |
|
max_tokens=max_tokens, |
|
temperature=temperature, |
|
top_p=top_p, |
|
seed=42) |
|
|
|
response = resp.choices[0].text |
|
return response |
|
|
|
|
|
# An example of extracting license terms |
|
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it." |
|
|
|
# For the definition of Terms and Attitudes, please refer to the previous section |
|
prompt = create_prompt(Terms, Attitudes, license_text) |
|
|
|
response = license_extract(prompt, model_type='llama3-8b', |
|
max_tokens=1500, temperature=0.3, top_p=0.7) |
|
|
|
print(f"License Text: {license_text}\n") |
|
print(f"LLM Response: {response}\n") |
|
``` |
|
|
|
|
|
⚠️ **NOTE**:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference. |