Spaces:

maxcembalest
/

ask-arthur

Sleeping

App Files Files Community

ask-arthur / files /arthur-docs-markdown /user-guide /walkthroughs /model-onboarding /nlp_onboarding.md.txt

maxcembalest

Upload 184 files

ad8da65 over 2 years ago

raw

history blame contribute delete

4.07 kB

	# NLP Onboarding

	This page walks through the basics of setting up a natural language processing (NLP) model and onboarding it to the
	Arthur system to monitor language-specific performance.

	## Getting Started

	The first step is to import functions from the `arthurai` package and establish a connection with Arthur.

	```python
	# Arthur imports
	from arthurai import ArthurAI
	from arthurai.common.constants import InputType, OutputType, Stage

	arthur = ArthurAI(url="https://app.arthur.ai",
	login="<YOUR_USERNAME_OR_EMAIL>")
	```

	## Registering an NLP Model

	Each NLP model is created with a name and with `input_type = InputType.NLP`. Here, we register a classification model
	on text specifying a `text_delimiter` of `NOT_WORD`:

	```python
	arthur_nlp_model = arthur.model(name="NLPQuickstart",
	input_type=InputType.NLP,
	model_type=OutputType.Multiclass,
	text_delimiter=TextDelimiter.NOT_WORD)
	```

	The different `OutputType` values currently supported for NLP models are classification, multi-labeling, and regression.

	### Text Delimiter

	NLP models also require specifying a `text_delimiter`, which specifies how a raw document is split into tokens.

	A common choice for `text_delimiter` will be `TextDelimiter.NOT_WORD`, since that will ignore punctuation and tokenize
	text based only on the words present. However, if punctuation and non-word text needs to be considered by your model,
	you should consider using other options for a delimiter to ensure those other pieces of text are processed by your NLP
	model.

	For a full list of available text delimiters with examples, see the
	[TextDelimiter constant documentation](https://docs.arthur.ai/sdk/sdk_v3/apiref/arthurai.common.constants.TextDelimiter.html)
	in our SDK reference.

	## Formatting Reference/Inference Data

	Column names can contain only alphanumeric and underscore characters. The rest of the string values can have
	additional characters as raw text.

	```python
	text_attr pred_value ground_truth non_input_1
	0 'Here-is some text' 0.1 0 0.2
	1 'saying a whole lot' 0.05 0 -0.3
	2 'of important things!' 0.02 1 0.7
	3 'With all kinds of chars?!' 0.2 0 0.1 ...
	4 'But attribute/column names' 0.6 1 -0.6
	5 'can only use underscore.' 0.9 1 -0.9
	...
	```

	### Reviewing the Model Schema

	Before you register your model with Arthur by calling `arthur_model.save()`, you can call `arthur_model.review()` the
	model schema to check that your data is parsed correctly.

	For an NLP model, the model schema should look like this:

	```python
	name stage value_type categorical is_unique
	0 text_attr PIPELINE_INPUT UNSTRUCTURED_TEXT False True
	1 pred_value PREDICTED_VALUE FLOAT False False ...
	2 ground_truth GROUND_TRUTH INTEGER True False
	3 non_input_1 NON_INPUT_DATA FLOAT False False
	...
	```

	## Finishing Onboarding

	Once you have finished formatting your reference data and your model schema looks correct using `arthur_model.review()`,
	you are finished registering your model and its attributes - so you are ready to complete onboarding your model.

	To finish onboarding your NLP model, the following steps apply, which is the same for NLP models as it is for models
	of any `InputType` and `OutputType`:

	```{include} finishing_onboarding.md
	```

	## Enrichments

	For an overview of configuring enrichments for NLP models, see the {doc}`/user-guide/walkthroughs/enrichments` guide.

	For a step-by-step walkthrough of setting up the explainability Enrichment for NLP models, see
	{ref}`nlp_explainability`.