File size: 7,447 Bytes
47c069f 1cb0668 9691b01 eac5218 9691b01 47c069f 1cb0668 9691b01 654b690 9691b01 07d9363 92f2355 07d9363 9691b01 1cb0668 eac5218 9691b01 eac5218 b3ebf08 9691b01 1cb0668 9691b01 304b058 1cb0668 eac5218 9691b01 1cb0668 304b058 eac5218 cb3004d eac5218 9691b01 1cb0668 9691b01 1cb0668 3670ef4 cb3004d 9691b01 1cb0668 eac5218 1cb0668 eac5218 9691b01 1cb0668 eac5218 9691b01 1cb0668 9691b01 1cb0668 9691b01 1cb0668 9691b01 1cb0668 9691b01 1cb0668 9691b01 eac5218 4cb33c4 1cb0668 9691b01 1cb0668 9e27d7d 1cb0668 9691b01 1cb0668 9691b01 1cb0668 3670ef4 9691b01 1cb0668 9691b01 1cb0668 9691b01 1cb0668 654b690 9691b01 1cb0668 9691b01 1cb0668 9691b01 1cb0668 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: mit
license_link: https://huggingface.co/microsoft/phi-1/resolve/main/LICENSE
language:
- en
pipeline_tag: text-generation
tags:
- code
---
## Model Summary
The language model Phi-1 is a Transformer with 1.3 billion parameters, specialized for basic Python coding. Its training involved a variety of data sources, including subsets of Python codes from [The Stack v1.2](https://huggingface.co/datasets/bigcode/the-stack), Q&A content from [StackOverflow](https://archive.org/download/stackexchange), competition code from [code_contests](https://github.com/deepmind/code_contests), and synthetic Python textbooks and exercises generated by [gpt-3.5-turbo-0301](https://platform.openai.com/docs/models/gpt-3-5). Even though the model and the datasets are relatively small compared to contemporary Large Language Models (LLMs), Phi-1 has demonstrated an impressive accuracy rate exceeding 50% on the simple Python coding benchmark, HumanEval.
## How to Use
Phi-1 has been integrated in the `transformers` version 4.37.0, please ensure that you are using a version equal or higher than it.
## Intended Uses
Given the nature of the training data, Phi-1 is best suited for prompts using the code format:
### Code Format:
```python
def print_prime(n):
"""
Print all primes between 1 and n
"""
for num in range(2, n+1):
for i in range(2, num):
if num % i == 0:
break
else:
print(num)
```
where the model generates the code after the comments. (Note: This is a legitimate and correct use of the else statement in Python loops.)
**Notes:**
* Phi-1 is intended for code purposes. The model-generated code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing this model in their applications.
* Direct adoption for production coding tasks is out of the scope of this research project. As a result, Phi-1 has not been tested to ensure that it performs adequately for production-level code. Please refer to the limitation sections of this document for more details.
## Sample Code
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
inputs = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)
```
## Limitations of Phi-1
* Limited Scope: 99.8% of the Python scripts in our fine-tuning dataset use only the packages "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages, we strongly recommend users manually verify all API uses.
* Replicate Scripts Online: As our model is trained on Python scripts found online, there is a small chance it may replicate such scripts, especially if they appear repetitively across different online sources.
* Generate Inaccurate Code: The model frequently generates incorrect code. We suggest that users view these outputs as a source of inspiration rather than definitive solutions.
* Unreliable Responses to Alternate Formats: Despite appearing to comprehend instructions in formats like Q&A or chat, our models often respond with inaccurate answers, even when seeming confident. Their capabilities with non-code formats are significantly more limited.
* Limitations on Natural Language Comprehension. As a coding bot, Phi-1's main focus is to help with coding-related questions. While it may have some natural language comprehension capabilities, its primary function is not to engage in general conversations or demonstrate common sense like a general AI assistant. Its strength lies in providing assistance and guidance in the context of programming and software development.
* Potential Biases: Phi-1, like other AI models, is trained on web and synthetic data. This data can contain biases and errors that might affect the AI's performance. Biases could stem from various sources like unbalanced representation, stereotypes, or controversial opinions present in the training data. As a result, the model might sometimes generate responses that reflect these biases or errors.
## Warning about Security Risks
When leveraging Phi-1, it's paramount to be vigilant. The model, though powerful, can inadvertently introduce security vulnerabilities in the generated code. Examples include, but are not limited to:
* Directory Traversal: The code might fail to implement safe checks against directory traversal attacks, potentially allowing unauthorized access to sensitive files on your system.
* Injection Attacks: There could be lapses in escaping strings properly, making the application susceptible to SQL, OS commands, or other injection attacks.
* Misunderstanding Requirements: The model might sometimes misunderstand or oversimplify user requirements, leading to incomplete or insecure solutions.
* Lack of Input Validation: In some cases, the model might neglect to incorporate input validation or sanitize user inputs, opening doors to attacks like Cross-Site Scripting (XSS).
* Insecure Defaults: The model might recommend or generate code with insecure default settings, such as weak password requirements or unencrypted data transmissions.
* Failure in Error Handling: Improper error handling can inadvertently reveal sensitive information about the system or the application's internal workings.
Given these potential pitfalls, and others not explicitly mentioned, it's essential to thoroughly review, test, and verify the generated code before deploying it in any application, especially those that are security-sensitive. Always consult with security experts or perform rigorous penetration testing when in doubt.
## Training
### Model
* Architecture: a Transformer-based model with next-word prediction objective
* Training tokens: 54B tokens (7B unique tokens)
* Precision: fp16
* GPUs: 8 A100
* Training time: 6 days
### Software
* [PyTorch](https://github.com/pytorch/pytorch)
* [DeepSpeed](https://github.com/microsoft/DeepSpeed)
* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
### License
The model is licensed under the [MIT license](https://huggingface.co/microsoft/phi-1/resolve/main/LICENSE).
### Citation
```bib
@article{gunasekar2023textbooks,
title={Textbooks Are All You Need},
author={Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio C{\'e}sar Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and others},
journal={arXiv preprint arXiv:2306.11644},
year={2023}
}
```
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. |