AnonymousAuthors
/

License-Llama3-8B

@@ -1,188 +1,192 @@
----
-license: llama3
-language:
-- en
-pipeline_tag: text-generation
-tags:
-- pytorch
-- llama
-- llama-3
----
-# License-Llama3-8B
-## Introduction
-We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model.
-License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.
-## Use with transformers
-Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
-Make sure to update your transformers installation via pip install --upgrade transformers.
-````python
-import transformers
-import torch
-import json
-# Definition of license terms and attitudes
-Terms = {
-	'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
-	'Add License Terms': 'provide additional license terms',
-	'Add Own Notices': 'add own notices in derivative works',
-	'Ask Distribution Fee': 'ask a fee to distribute a copy',
-	'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
-	'Copy': 'reproduce the original work in copies',
-	'Distribute': 'distribute original or modified derivative works',
-	'Modify': 'modify the software and create derivatives',
-	'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
-	'Publicly Display': 'display the original work publicly',
-	'Publicly Perform': 'perform the original work publicly',
-	'Sublicense': 'incorporate the work into something that has a more restrictive license',
-	'Commercial Use': 'use the software for commercial purposes',
-	'Private Use': 'use or modify the software freely or privately without distributing it',
-	'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
-	'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
-	'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
-	'Include License': 'include the full text of license(license copy) in modified software',
-	'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
-	'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
-	'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
-	'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
-	'Use TradeMark': 'use contributor’s name, trademark or logo',
-	'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
-	'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
-	'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
-	'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
-}
-Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
-             "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
-             "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}
-# Create the Prompt
-def create_prompt(term_definition, attitude_definition, license_text):
-    exm = {
-    "Distribute": "CAN",
-    "Use": "CAN",
-    "Modify": "CANNOT"
-    }
-    prompt = f"""### OBJECTIVE
-Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.
-### DEFINITION OF TERMS
-{term_definition}
-### DEFINITION OF ATTITUDES
-{attitude_definition}
-### LICENSE TEXT
-{license_text}
-### RESPONSE
-Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.
-### Output Format Example
-```
-{json.dumps(exm, indent=2)}
-```
-"""
-    return prompt
-# Load model and create a pipeline
-model_id = "AnonymousAuthors/License-Llama3-8B"
-pipeline = transformers.pipeline(
-    "text-generation",
-    model=model_id,
-    model_kwargs={"torch_dtype": torch.bfloat16},
-    device="auto"
-)
-# An example of extracting license terms
-license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
-prompt = create_prompt(Terms, Attitudes, license_text)
-terminators = [
-    pipeline.tokenizer.eos_token_id,
-    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
-]
-outputs = pipeline(
-    prompt,
-    max_new_tokens=512,
-    eos_token_id=terminators,
-    pad_token_id=pipeline.tokenizer.eos_token_id,
-    do_sample=True,
-    temperature=0.3,
-    top_p=0.7,
-)
-response = outputs[0]["generated_text"][len(prompt):]
-print(f"License Text: {license_text}\n")
-print(f"LLM Response: {response}\n")
-````
-## Use with vLLM
-vLLM is a fast and easy-to-use library for LLM inference and serving.
-Install vLLM with pip:
-```bash
-pip install vllm == 0.3.1
-```
-Run the following command to start the vLLM server:
-```bash
-python -m vllm.entrypoints.openai.api_server \
-       --served-model-name llama3-8b \
-       --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
-       --gpu-memory-utilization 0.9 \
-       --tensor-parallel-size 1 \
-       --host 0.0.0.0 \
-       --port 8000
-```
-Then you can request the server to identify license terms:
-```python
-from openai import OpenAI
-client = OpenAI(
-    api_key='EMPTY',
-    base_url='http://127.0.0.1:8000/v1',
-)
-def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
-    resp = client.completions.create(
-        model=model_type,
-        prompt=query,
-        max_tokens=max_tokens,
-        temperature=temperature,
-        top_p=top_p,
-        seed=42)
-    response = resp.choices[0].text
-    return response
-# An example of extracting license terms
-license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
-# For the definition of Terms and Attitudes, please refer to the previous section
-prompt = create_prompt(Terms, Attitudes, license_text)
-response = license_extract(prompt, model_type='llama3-8b',
-                            max_tokens=1500, temperature=0.3, top_p=0.7)
-print(f"License Text: {license_text}\n")
-print(f"LLM Response: {response}\n")
-```
 ⚠️ **NOTE**：According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.

+---
+license: llama3
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- pytorch
+- llama
+- llama-3
+datasets:
+- AnonymousAuthors/OSS-License-Terms
+base_model:
+- meta-llama/Meta-Llama-3-8B
+---
+# License-Llama3-8B
+## Introduction
+We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model.
+License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification.  In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%.
+## Use with transformers
+Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
+Make sure to update your transformers installation via pip install --upgrade transformers.
+````python
+import transformers
+import torch
+import json
+# Definition of license terms and attitudes
+Terms = {
+	'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed',
+	'Add License Terms': 'provide additional license terms',
+	'Add Own Notices': 'add own notices in derivative works',
+	'Ask Distribution Fee': 'ask a fee to distribute a copy',
+	'Combine Libraries': 'place side by side with a library (that is not an application or covered work)',
+	'Copy': 'reproduce the original work in copies',
+	'Distribute': 'distribute original or modified derivative works',
+	'Modify': 'modify the software and create derivatives',
+	'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code',
+	'Publicly Display': 'display the original work publicly',
+	'Publicly Perform': 'perform the original work publicly',
+	'Sublicense': 'incorporate the work into something that has a more restrictive license',
+	'Commercial Use': 'use the software for commercial purposes',
+	'Private Use': 'use or modify the software freely or privately without distributing it',
+	'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices',
+	'Add Statement For Additional Terms': 'place a statement of the additional terms that apply',
+	'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.',
+	'Include License': 'include the full text of license(license copy) in modified software',
+	'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work',
+	'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available',
+	'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software',
+	'Retain Disclaimer': 'redistributions of source code must retain disclaimer',
+	'Use TradeMark': 'use contributor’s name, trademark or logo',
+	'Give Credit': 'give explicit credit or acknowledgement to the author with the software',
+	'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software',
+	'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software',
+	'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license'
+}
+Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can",
+             "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against",
+             "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"}
+# Create the Prompt
+def create_prompt(term_definition, attitude_definition, license_text):
+    exm = {
+    "Distribute": "CAN",
+    "Use": "CAN",
+    "Modify": "CANNOT"
+    }
+    prompt = f"""### OBJECTIVE
+Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes.
+### DEFINITION OF TERMS
+{term_definition}
+### DEFINITION OF ATTITUDES
+{attitude_definition}
+### LICENSE TEXT
+{license_text}
+### RESPONSE
+Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name.
+### Output Format Example
+```
+{json.dumps(exm, indent=2)}
+```
+"""
+    return prompt
+# Load model and create a pipeline
+model_id = "AnonymousAuthors/License-Llama3-8B"
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model_id,
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device="auto"
+)
+# An example of extracting license terms
+license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
+prompt = create_prompt(Terms, Attitudes, license_text)
+terminators = [
+    pipeline.tokenizer.eos_token_id,
+    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
+]
+outputs = pipeline(
+    prompt,
+    max_new_tokens=512,
+    eos_token_id=terminators,
+    pad_token_id=pipeline.tokenizer.eos_token_id,
+    do_sample=True,
+    temperature=0.3,
+    top_p=0.7,
+)
+response = outputs[0]["generated_text"][len(prompt):]
+print(f"License Text: {license_text}\n")
+print(f"LLM Response: {response}\n")
+````
+## Use with vLLM
+vLLM is a fast and easy-to-use library for LLM inference and serving.
+Install vLLM with pip:
+```bash
+pip install vllm == 0.3.1
+```
+Run the following command to start the vLLM server:
+```bash
+python -m vllm.entrypoints.openai.api_server \
+       --served-model-name llama3-8b \
+       --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \
+       --gpu-memory-utilization 0.9 \
+       --tensor-parallel-size 1 \
+       --host 0.0.0.0 \
+       --port 8000
+```
+Then you can request the server to identify license terms:
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key='EMPTY',
+    base_url='http://127.0.0.1:8000/v1',
+)
+def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7):
+    resp = client.completions.create(
+        model=model_type,
+        prompt=query,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        seed=42)
+    response = resp.choices[0].text
+    return response
+# An example of extracting license terms
+license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it."
+# For the definition of Terms and Attitudes, please refer to the previous section
+prompt = create_prompt(Terms, Attitudes, license_text)
+response = license_extract(prompt, model_type='llama3-8b',
+                            max_tokens=1500, temperature=0.3, top_p=0.7)
+print(f"License Text: {license_text}\n")
+print(f"LLM Response: {response}\n")
+```
 ⚠️ **NOTE**：According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.