File size: 8,428 Bytes
066ce0b a49f42b 066ce0b bb26acb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
license: llama3
language:
- en
pipeline_tag: text-generation
tags:
- pytorch
- llama
- llama-3
datasets:
- AnonymousAuthors/OSS-License-Terms
base_model:
- meta-llama/Meta-Llama-3-8B
---
# License-Llama3-8B
## Introduction
We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model. [Our work has been accepted for publication in the proceedings of the 47th International Conference on Software Engineering(ICSE 2025).](https://conf.researchr.org/details/icse-2025/icse-2025-industry-challenge-track/6/Exploring-Large-Language-Models-for-Analyzing-Open-Source-License-Conflicts-How-Far-)
License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification. In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%.
## Use with transformers
Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
Make sure to update your transformers installation via pip install --upgrade transformers.
````python
import transformers
import torch
import json
# Definition of license terms and attitudes
Terms = {
'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
'Add License Terms': 'provide additional license terms',
'Add Own Notices': 'add own notices in derivative works',
'Ask Distribution Fee': 'ask a fee to distribute a copy',
'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
'Copy': 'reproduce the original work in copies',
'Distribute': 'distribute original or modified derivative works',
'Modify': 'modify the software and create derivatives',
'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
'Publicly Display': 'display the original work publicly',
'Publicly Perform': 'perform the original work publicly',
'Sublicense': 'incorporate the work into something that has a more restrictive license',
'Commercial Use': 'use the software for commercial purposes',
'Private Use': 'use or modify the software freely or privately without distributing it',
'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
'Include License': 'include the full text of license(license copy) in modified software',
'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
'Use TradeMark': 'use contributor’s name, trademark or logo',
'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
}
Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
"CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
"MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}
# Create the Prompt
def create_prompt(term_definition, attitude_definition, license_text):
exm = {
"Distribute": "CAN",
"Use": "CAN",
"Modify": "CANNOT"
}
prompt = f"""### OBJECTIVE
Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.
### DEFINITION OF TERMS
{term_definition}
### DEFINITION OF ATTITUDES
{attitude_definition}
### LICENSE TEXT
{license_text}
### RESPONSE
Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.
### Output Format Example
```
{json.dumps(exm, indent=2)}
```
"""
return prompt
# Load model and create a pipeline
model_id = "AnonymousAuthors/License-Llama3-8B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="auto"
)
# An example of extracting license terms
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
prompt = create_prompt(Terms, Attitudes, license_text)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=512,
eos_token_id=terminators,
pad_token_id=pipeline.tokenizer.eos_token_id,
do_sample=True,
temperature=0.3,
top_p=0.7,
)
response = outputs[0]["generated_text"][len(prompt):]
print(f"License Text: {license_text}\n")
print(f"LLM Response: {response}\n")
````
## Use with vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
Install vLLM with pip:
```bash
pip install vllm == 0.3.1
```
Run the following command to start the vLLM server:
```bash
python -m vllm.entrypoints.openai.api_server \
--served-model-name llama3-8b \
--model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 \
--host 0.0.0.0 \
--port 8000
```
Then you can request the server to identify license terms:
```python
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://127.0.0.1:8000/v1',
)
def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
resp = client.completions.create(
model=model_type,
prompt=query,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
seed=42)
response = resp.choices[0].text
return response
# An example of extracting license terms
license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
# For the definition of Terms and Attitudes, please refer to the previous section
prompt = create_prompt(Terms, Attitudes, license_text)
response = license_extract(prompt, model_type='llama3-8b',
max_tokens=1500, temperature=0.3, top_p=0.7)
print(f"License Text: {license_text}\n")
print(f"LLM Response: {response}\n")
```
⚠️ **NOTE**:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference. |