Spaces:
Sleeping
Sleeping
ask-arthur
/
files
/arthur-docs-markdown
/user-guide
/walkthroughs
/model-onboarding
/nlp_onboarding.md.txt
# NLP Onboarding | |
This page walks through the basics of setting up a natural language processing (NLP) model and onboarding it to the | |
Arthur system to monitor language-specific performance. | |
## Getting Started | |
The first step is to import functions from the `arthurai` package and establish a connection with Arthur. | |
```python | |
# Arthur imports | |
from arthurai import ArthurAI | |
from arthurai.common.constants import InputType, OutputType, Stage | |
arthur = ArthurAI(url="https://app.arthur.ai", | |
login="<YOUR_USERNAME_OR_EMAIL>") | |
``` | |
## Registering an NLP Model | |
Each NLP model is created with a name and with `input_type = InputType.NLP`. Here, we register a classification model | |
on text specifying a `text_delimiter` of `NOT_WORD`: | |
```python | |
arthur_nlp_model = arthur.model(name="NLPQuickstart", | |
input_type=InputType.NLP, | |
model_type=OutputType.Multiclass, | |
text_delimiter=TextDelimiter.NOT_WORD) | |
``` | |
The different `OutputType` values currently supported for NLP models are classification, multi-labeling, and regression. | |
### Text Delimiter | |
NLP models also require specifying a `text_delimiter`, which specifies how a raw document is split into tokens. | |
A common choice for `text_delimiter` will be `TextDelimiter.NOT_WORD`, since that will ignore punctuation and tokenize | |
text based only on the words present. However, if punctuation and non-word text needs to be considered by your model, | |
you should consider using other options for a delimiter to ensure those other pieces of text are processed by your NLP | |
model. | |
For a full list of available text delimiters with examples, see the | |
[TextDelimiter constant documentation](https://docs.arthur.ai/sdk/sdk_v3/apiref/arthurai.common.constants.TextDelimiter.html) | |
in our SDK reference. | |
## Formatting Reference/Inference Data | |
Column names can contain only alphanumeric and underscore characters. The rest of the string values can have | |
additional characters as raw text. | |
```python | |
text_attr pred_value ground_truth non_input_1 | |
0 'Here-is some text' 0.1 0 0.2 | |
1 'saying a whole lot' 0.05 0 -0.3 | |
2 'of important things!' 0.02 1 0.7 | |
3 'With all kinds of chars?!' 0.2 0 0.1 ... | |
4 'But attribute/column names' 0.6 1 -0.6 | |
5 'can only use underscore.' 0.9 1 -0.9 | |
... | |
``` | |
### Reviewing the Model Schema | |
Before you register your model with Arthur by calling `arthur_model.save()`, you can call `arthur_model.review()` the | |
model schema to check that your data is parsed correctly. | |
For an NLP model, the model schema should look like this: | |
```python | |
name stage value_type categorical is_unique | |
0 text_attr PIPELINE_INPUT UNSTRUCTURED_TEXT False True | |
1 pred_value PREDICTED_VALUE FLOAT False False ... | |
2 ground_truth GROUND_TRUTH INTEGER True False | |
3 non_input_1 NON_INPUT_DATA FLOAT False False | |
... | |
``` | |
## Finishing Onboarding | |
Once you have finished formatting your reference data and your model schema looks correct using `arthur_model.review()`, | |
you are finished registering your model and its attributes - so you are ready to complete onboarding your model. | |
To finish onboarding your NLP model, the following steps apply, which is the same for NLP models as it is for models | |
of any `InputType` and `OutputType`: | |
```{include} finishing_onboarding.md | |
``` | |
## Enrichments | |
For an overview of configuring enrichments for NLP models, see the {doc}`/user-guide/walkthroughs/enrichments` guide. | |
For a step-by-step walkthrough of setting up the explainability Enrichment for NLP models, see | |
{ref}`nlp_explainability`. | |