AnonymousAuthors commited on
Commit
066ce0b
·
verified ·
1 Parent(s): ba17241

Update README

Browse files
Files changed (1) hide show
  1. README.md +191 -187
README.md CHANGED
@@ -1,188 +1,192 @@
1
- ---
2
- license: llama3
3
- language:
4
- - en
5
- pipeline_tag: text-generation
6
- tags:
7
- - pytorch
8
- - llama
9
- - llama-3
10
- ---
11
-
12
- # License-Llama3-8B
13
-
14
- ## Introduction
15
- We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model.
16
-
17
- License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.
18
-
19
- ## Use with transformers
20
- Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
21
-
22
- Make sure to update your transformers installation via pip install --upgrade transformers.
23
-
24
- ````python
25
- import transformers
26
- import torch
27
- import json
28
-
29
- # Definition of license terms and attitudes
30
- Terms = {
31
- 'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
32
- 'Add License Terms': 'provide additional license terms',
33
- 'Add Own Notices': 'add own notices in derivative works',
34
- 'Ask Distribution Fee': 'ask a fee to distribute a copy',
35
- 'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
36
- 'Copy': 'reproduce the original work in copies',
37
- 'Distribute': 'distribute original or modified derivative works',
38
- 'Modify': 'modify the software and create derivatives',
39
- 'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
40
- 'Publicly Display': 'display the original work publicly',
41
- 'Publicly Perform': 'perform the original work publicly',
42
- 'Sublicense': 'incorporate the work into something that has a more restrictive license',
43
- 'Commercial Use': 'use the software for commercial purposes',
44
- 'Private Use': 'use or modify the software freely or privately without distributing it',
45
- 'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
46
- 'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
47
- 'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
48
- 'Include License': 'include the full text of license(license copy) in modified software',
49
- 'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
50
- 'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
51
- 'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
52
- 'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
53
- 'Use TradeMark': 'use contributor’s name, trademark or logo',
54
- 'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
55
- 'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
56
- 'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
57
- 'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
58
- }
59
-
60
-
61
- Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
62
- "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
63
- "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}
64
-
65
-
66
- # Create the Prompt
67
- def create_prompt(term_definition, attitude_definition, license_text):
68
- exm = {
69
- "Distribute": "CAN",
70
- "Use": "CAN",
71
- "Modify": "CANNOT"
72
- }
73
-
74
- prompt = f"""### OBJECTIVE
75
- Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.
76
-
77
- ### DEFINITION OF TERMS
78
- {term_definition}
79
-
80
- ### DEFINITION OF ATTITUDES
81
- {attitude_definition}
82
-
83
- ### LICENSE TEXT
84
- {license_text}
85
-
86
- ### RESPONSE
87
- Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.
88
-
89
- ### Output Format Example
90
- ```
91
- {json.dumps(exm, indent=2)}
92
- ```
93
- """
94
- return prompt
95
-
96
-
97
- # Load model and create a pipeline
98
- model_id = "AnonymousAuthors/License-Llama3-8B"
99
-
100
- pipeline = transformers.pipeline(
101
- "text-generation",
102
- model=model_id,
103
- model_kwargs={"torch_dtype": torch.bfloat16},
104
- device="auto"
105
- )
106
-
107
- # An example of extracting license terms
108
- license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
109
-
110
- prompt = create_prompt(Terms, Attitudes, license_text)
111
-
112
- terminators = [
113
- pipeline.tokenizer.eos_token_id,
114
- pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
115
- ]
116
-
117
- outputs = pipeline(
118
- prompt,
119
- max_new_tokens=512,
120
- eos_token_id=terminators,
121
- pad_token_id=pipeline.tokenizer.eos_token_id,
122
- do_sample=True,
123
- temperature=0.3,
124
- top_p=0.7,
125
- )
126
-
127
- response = outputs[0]["generated_text"][len(prompt):]
128
-
129
- print(f"License Text: {license_text}\n")
130
- print(f"LLM Response: {response}\n")
131
- ````
132
-
133
- ## Use with vLLM
134
- vLLM is a fast and easy-to-use library for LLM inference and serving.
135
- Install vLLM with pip:
136
- ```bash
137
- pip install vllm == 0.3.1
138
- ```
139
-
140
- Run the following command to start the vLLM server:
141
- ```bash
142
- python -m vllm.entrypoints.openai.api_server \
143
- --served-model-name llama3-8b \
144
- --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
145
- --gpu-memory-utilization 0.9 \
146
- --tensor-parallel-size 1 \
147
- --host 0.0.0.0 \
148
- --port 8000
149
- ```
150
-
151
- Then you can request the server to identify license terms:
152
- ```python
153
- from openai import OpenAI
154
-
155
- client = OpenAI(
156
- api_key='EMPTY',
157
- base_url='http://127.0.0.1:8000/v1',
158
- )
159
-
160
-
161
- def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
162
- resp = client.completions.create(
163
- model=model_type,
164
- prompt=query,
165
- max_tokens=max_tokens,
166
- temperature=temperature,
167
- top_p=top_p,
168
- seed=42)
169
-
170
- response = resp.choices[0].text
171
- return response
172
-
173
-
174
- # An example of extracting license terms
175
- license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
176
-
177
- # For the definition of Terms and Attitudes, please refer to the previous section
178
- prompt = create_prompt(Terms, Attitudes, license_text)
179
-
180
- response = license_extract(prompt, model_type='llama3-8b',
181
- max_tokens=1500, temperature=0.3, top_p=0.7)
182
-
183
- print(f"License Text: {license_text}\n")
184
- print(f"LLM Response: {response}\n")
185
- ```
186
-
187
-
 
 
 
 
188
  ⚠️ **NOTE**:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.
 
1
+ ---
2
+ license: llama3
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - pytorch
8
+ - llama
9
+ - llama-3
10
+ datasets:
11
+ - AnonymousAuthors/OSS-License-Terms
12
+ base_model:
13
+ - meta-llama/Meta-Llama-3-8B
14
+ ---
15
+
16
+ # License-Llama3-8B
17
+
18
+ ## Introduction
19
+ We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model.
20
+
21
+ License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification. In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%.
22
+
23
+ ## Use with transformers
24
+ Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
25
+
26
+ Make sure to update your transformers installation via pip install --upgrade transformers.
27
+
28
+ ````python
29
+ import transformers
30
+ import torch
31
+ import json
32
+
33
+ # Definition of license terms and attitudes
34
+ Terms = {
35
+ 'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
36
+ 'Add License Terms': 'provide additional license terms',
37
+ 'Add Own Notices': 'add own notices in derivative works',
38
+ 'Ask Distribution Fee': 'ask a fee to distribute a copy',
39
+ 'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
40
+ 'Copy': 'reproduce the original work in copies',
41
+ 'Distribute': 'distribute original or modified derivative works',
42
+ 'Modify': 'modify the software and create derivatives',
43
+ 'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
44
+ 'Publicly Display': 'display the original work publicly',
45
+ 'Publicly Perform': 'perform the original work publicly',
46
+ 'Sublicense': 'incorporate the work into something that has a more restrictive license',
47
+ 'Commercial Use': 'use the software for commercial purposes',
48
+ 'Private Use': 'use or modify the software freely or privately without distributing it',
49
+ 'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
50
+ 'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
51
+ 'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
52
+ 'Include License': 'include the full text of license(license copy) in modified software',
53
+ 'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
54
+ 'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
55
+ 'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
56
+ 'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
57
+ 'Use TradeMark': 'use contributor’s name, trademark or logo',
58
+ 'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
59
+ 'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
60
+ 'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
61
+ 'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
62
+ }
63
+
64
+
65
+ Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
66
+ "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
67
+ "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}
68
+
69
+
70
+ # Create the Prompt
71
+ def create_prompt(term_definition, attitude_definition, license_text):
72
+ exm = {
73
+ "Distribute": "CAN",
74
+ "Use": "CAN",
75
+ "Modify": "CANNOT"
76
+ }
77
+
78
+ prompt = f"""### OBJECTIVE
79
+ Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.
80
+
81
+ ### DEFINITION OF TERMS
82
+ {term_definition}
83
+
84
+ ### DEFINITION OF ATTITUDES
85
+ {attitude_definition}
86
+
87
+ ### LICENSE TEXT
88
+ {license_text}
89
+
90
+ ### RESPONSE
91
+ Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.
92
+
93
+ ### Output Format Example
94
+ ```
95
+ {json.dumps(exm, indent=2)}
96
+ ```
97
+ """
98
+ return prompt
99
+
100
+
101
+ # Load model and create a pipeline
102
+ model_id = "AnonymousAuthors/License-Llama3-8B"
103
+
104
+ pipeline = transformers.pipeline(
105
+ "text-generation",
106
+ model=model_id,
107
+ model_kwargs={"torch_dtype": torch.bfloat16},
108
+ device="auto"
109
+ )
110
+
111
+ # An example of extracting license terms
112
+ license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
113
+
114
+ prompt = create_prompt(Terms, Attitudes, license_text)
115
+
116
+ terminators = [
117
+ pipeline.tokenizer.eos_token_id,
118
+ pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
119
+ ]
120
+
121
+ outputs = pipeline(
122
+ prompt,
123
+ max_new_tokens=512,
124
+ eos_token_id=terminators,
125
+ pad_token_id=pipeline.tokenizer.eos_token_id,
126
+ do_sample=True,
127
+ temperature=0.3,
128
+ top_p=0.7,
129
+ )
130
+
131
+ response = outputs[0]["generated_text"][len(prompt):]
132
+
133
+ print(f"License Text: {license_text}\n")
134
+ print(f"LLM Response: {response}\n")
135
+ ````
136
+
137
+ ## Use with vLLM
138
+ vLLM is a fast and easy-to-use library for LLM inference and serving.
139
+ Install vLLM with pip:
140
+ ```bash
141
+ pip install vllm == 0.3.1
142
+ ```
143
+
144
+ Run the following command to start the vLLM server:
145
+ ```bash
146
+ python -m vllm.entrypoints.openai.api_server \
147
+ --served-model-name llama3-8b \
148
+ --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
149
+ --gpu-memory-utilization 0.9 \
150
+ --tensor-parallel-size 1 \
151
+ --host 0.0.0.0 \
152
+ --port 8000
153
+ ```
154
+
155
+ Then you can request the server to identify license terms:
156
+ ```python
157
+ from openai import OpenAI
158
+
159
+ client = OpenAI(
160
+ api_key='EMPTY',
161
+ base_url='http://127.0.0.1:8000/v1',
162
+ )
163
+
164
+
165
+ def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
166
+ resp = client.completions.create(
167
+ model=model_type,
168
+ prompt=query,
169
+ max_tokens=max_tokens,
170
+ temperature=temperature,
171
+ top_p=top_p,
172
+ seed=42)
173
+
174
+ response = resp.choices[0].text
175
+ return response
176
+
177
+
178
+ # An example of extracting license terms
179
+ license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
180
+
181
+ # For the definition of Terms and Attitudes, please refer to the previous section
182
+ prompt = create_prompt(Terms, Attitudes, license_text)
183
+
184
+ response = license_extract(prompt, model_type='llama3-8b',
185
+ max_tokens=1500, temperature=0.3, top_p=0.7)
186
+
187
+ print(f"License Text: {license_text}\n")
188
+ print(f"LLM Response: {response}\n")
189
+ ```
190
+
191
+
192
  ⚠️ **NOTE**:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.