|
--- |
|
language: |
|
- "eng" |
|
thumbnail: "url to a thumbnail used in social sharing" |
|
tags: |
|
- "pytorch" |
|
- "tensorflow" |
|
license: "apache-2.0" |
|
|
|
--- |
|
|
|
|
|
# vBERT-2021-BASE |
|
|
|
### Model Info: |
|
<ul> |
|
<li> Authors: R&D AI Lab, VMware Inc. |
|
<li> Model date: April, 2022 |
|
<li> Model version: 2021-base |
|
<li> Model type: Pretrained language model |
|
<li> License: Apache 2.0 |
|
</ul> |
|
|
|
#### Motivation |
|
Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>) |
|
|
|
We have pretrained our vBERT model to address the aforementioned issues using our <a href=https://medium.com/vmware-data-ml-blog/pretraining-a-custom-bert-model-6e37df97dfc4>BERT Pretraining Library</a>. |
|
<br> We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-base-uncased' model for additional 78K steps (71k With MSL_128 and 7k with MSL_512) (approximately 5 epochs) on VMware domain data. |
|
|
|
#### Intended Use |
|
The model functions as a VMware-specific Language Model. |
|
|
|
|
|
#### How to Use |
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
|
|
``` |
|
from transformers import BertTokenizer, BertModel |
|
tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-base') |
|
model = BertModel.from_pretrained("VMware/vbert-2021-base") |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
|
|
and in TensorFlow: |
|
|
|
``` |
|
from transformers import BertTokenizer, TFBertModel |
|
tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-base') |
|
model = TFBertModel.from_pretrained('VMware/vbert-2021-base') |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='tf') |
|
output = model(encoded_input) |
|
|
|
``` |
|
|
|
### Training |
|
|
|
#### - Datasets |
|
Publically available VMware text data such as VMware Docs, Blogs etc. were used for creating the pretraining corpus. Sourced in May, 2021. (~320,000 Documents) |
|
#### - Preprocessing |
|
<ul> |
|
<li>Decoding HTML |
|
<li>Decoding Unicode |
|
<li>Stripping repeated characters |
|
<li>Splitting compound word |
|
<li>Spelling correction |
|
</ul> |
|
|
|
#### - Model performance measures |
|
We benchmarked vBERT on various VMware-specific NLP downstream tasks (IR, classification, etc). |
|
The model scored higher than the 'bert-base-uncased' model on all benchmarks. |
|
|
|
### Limitations and bias |
|
Since the model is further pretrained on the BERT model, it may have the same biases embedded within the original BERT model. |
|
|
|
The data needs to be preprocessed using our internal vNLP Preprocessor (not available to the public) to maximize its performance. |
|
|