Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
#### Table of contents
|
3 |
+
1. [Introduction](#introduction)
|
4 |
+
2. [Using PhoBERT with `transformers`](#transformers)
|
5 |
+
- [Installation](#install2)
|
6 |
+
- [Pre-trained models](#models2)
|
7 |
+
- [Example usage](#usage2)
|
8 |
+
3. [Using PhoBERT with `fairseq`](#fairseq)
|
9 |
+
4. [Notes](#vncorenlp)
|
10 |
+
|
11 |
+
# <a name="introduction"></a> PhoBERT: Pre-trained language models for Vietnamese
|
12 |
+
|
13 |
+
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam):
|
14 |
+
|
15 |
+
- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.
|
16 |
+
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.
|
17 |
+
|
18 |
+
The general architecture and experimental results of PhoBERT can be found in our [paper](https://www.aclweb.org/anthology/2020.findings-emnlp.92/):
|
19 |
+
|
20 |
+
@inproceedings{phobert,
|
21 |
+
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
|
22 |
+
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
|
23 |
+
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
|
24 |
+
year = {2020},
|
25 |
+
pages = {1037--1042}
|
26 |
+
}
|
27 |
+
|
28 |
+
**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.
|
29 |
+
|
30 |
+
## <a name="transformers"></a> Using PhoBERT with `transformers`
|
31 |
+
|
32 |
+
### Installation <a name="install2"></a>
|
33 |
+
- Install `transformers` with pip: `pip install transformers`, or [install `transformers` from source](https://huggingface.co/docs/transformers/installation#installing-from-source). <br />
|
34 |
+
Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in [this pull request](https://github.com/huggingface/transformers/pull/17254#issuecomment-1133932067). If users would like to utilize the fast tokenizer, the users might install `transformers` as follows:
|
35 |
+
|
36 |
+
|
37 |
+
```
|
38 |
+
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
|
39 |
+
cd transformers
|
40 |
+
pip3 install -e .
|
41 |
+
```
|
42 |
+
|
43 |
+
- Install `tokenizers` with pip: `pip3 install tokenizers`
|
44 |
+
|
45 |
+
### Pre-trained models <a name="models2"></a>
|
46 |
+
|
47 |
+
|
48 |
+
Model | #params | Arch. | Max length | Pre-training data
|
49 |
+
---|---|---|---|---
|
50 |
+
`vinai/phobert-base` | 135M | base | 256 | 20GB of Wikipedia and News texts
|
51 |
+
`vinai/phobert-large` | 370M | large | 256 | 20GB of Wikipedia and News texts
|
52 |
+
`vinai/phobert-base-v2` | 135M | base | 256 | 20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301
|
53 |
+
|
54 |
+
### Example usage <a name="usage2"></a>
|
55 |
+
|
56 |
+
```python
|
57 |
+
import torch
|
58 |
+
from transformers import AutoModel, AutoTokenizer
|
59 |
+
|
60 |
+
phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
|
61 |
+
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
|
62 |
+
|
63 |
+
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
|
64 |
+
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'
|
65 |
+
|
66 |
+
input_ids = torch.tensor([tokenizer.encode(sentence)])
|
67 |
+
|
68 |
+
with torch.no_grad():
|
69 |
+
features = phobert(input_ids) # Models outputs are now tuples
|
70 |
+
|
71 |
+
## With TensorFlow 2.0+:
|
72 |
+
# from transformers import TFAutoModel
|
73 |
+
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
|
74 |
+
```
|
75 |
+
|
76 |
+
|
77 |
+
## <a name="fairseq"></a> Using PhoBERT with `fairseq`
|
78 |
+
|
79 |
+
Please see details at [HERE](https://github.com/VinAIResearch/PhoBERT/blob/master/README_fairseq.md)!
|
80 |
+
|
81 |
+
## <a name="vncorenlp"></a> Notes
|
82 |
+
|
83 |
+
In case the input texts are `raw`, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data (including [Vietnamese tone normalization](https://github.com/VinAIResearch/BARTpho/blob/main/VietnameseToneNormalization.md) and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts.
|
84 |
+
|
85 |
+
#### Installation
|
86 |
+
|
87 |
+
pip install py_vncorenlp
|
88 |
+
|
89 |
+
#### Example usage <a name="example"></a>
|
90 |
+
|
91 |
+
```python
|
92 |
+
import py_vncorenlp
|
93 |
+
|
94 |
+
# Automatically download VnCoreNLP components from the original repository
|
95 |
+
# and save them in some local machine folder
|
96 |
+
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')
|
97 |
+
|
98 |
+
# Load the word and sentence segmentation component
|
99 |
+
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')
|
100 |
+
|
101 |
+
text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
|
102 |
+
|
103 |
+
output = rdrsegmenter.word_segment(text)
|
104 |
+
|
105 |
+
print(output)
|
106 |
+
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']
|
107 |
+
```
|
108 |
+
|
109 |
+
## License
|
110 |
+
|
111 |
+
MIT License
|
112 |
+
|
113 |
+
Copyright (c) 2020 VinAI Research
|
114 |
+
|
115 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
116 |
+
of this software and associated documentation files (the "Software"), to deal
|
117 |
+
in the Software without restriction, including without limitation the rights
|
118 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
119 |
+
copies of the Software, and to permit persons to whom the Software is
|
120 |
+
furnished to do so, subject to the following conditions:
|
121 |
+
|
122 |
+
The above copyright notice and this permission notice shall be included in all
|
123 |
+
copies or substantial portions of the Software.
|
124 |
+
|
125 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
126 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
127 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
128 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
129 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
130 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
131 |
+
SOFTWARE.
|