maxcembalest's picture
Upload 184 files
ad8da65
raw
history blame
9.2 kB
# Advanced Example: Unique Identifiers
Initial analyses that treat inferences as independent of one another can provide tremendous value. But over time,
models often make multiple predictions about the same real-world entities. No matter what you're predicting,
it can be helpful to compare the inputs and outputs of your model on an entity-by-entity basis.
For example, let's say that your model makes predictions about whether customers will make a purchase in the next 30
days. You might have the following attributes:
- `customer_id`: a non-input attribute
- `will_purchase_pred`: the prediction attribute: whether a customer will make a purchase in the next 30 days
- `will_purchase_gt`: the ground truth attribute: whether a customer actually did make a purchase within 30 days
- `recent_purchase_count`: an input attribute with the total number of purchases the customer made in the last 90 days
- `newsletter_subscriber`: an input attribute depicting whether the customer subscribes to the deals newsletter
Your model might be run on the full universe of Customer IDs at some regular interval. With Arthur's powerful Query
API, you can follow inferences for each Customer ID through time and answer questions like:
- How does `recent_purchase_count` tend to change for each customer, from the first to last time inference is conducted?
- What is the per-customer variance of `recent_purchase_count` across time?
- How many customers changed their newsletter subscription status, from one month ago to today?
- What is the distribution of the lifetimes of Customer IDs?
## Example Queries
We'll walk through some example queries for these entity-by-entity comparisons, exploring the sample case outlined
above.
### Per-Customer Variance
We can look at how consistent `recent_purchase_count` is _for each customer_ across time. We'll compute the variance in
`recent_purchase_count` for each customer across all their inferences, and then roll those individual variances up into
a distribution.
```json
{
"select": [
{
"function": "distribution",
"alias": "recent_purchase_count_variance_distribution",
"parameters": {
"property": {
"nested_function": {
"function": "variance",
"parameters": {
"property": "recent_purchase_count"
}
}
},
"num_bins": 20
}
}
],
"subquery": {
"select": [
{
"property": "recent_purchase_count"
},
{
"property": "customer_id"
}
],
"group_by": [
{
"property": "customer_id"
}
]
}
}
```
### Change Across Batches
If our model is a batch model, we might want to compare the values for each customer between two difference batches.
We'll again look at the distribution of change in the `recent_purchase_count`, but this time look at the difference for
each customer between two specific batches.
```json
{
"select": [
{
"function": "distribution",
"alias": "recent_purchase_count_difference_distribution",
"parameters": {
"property": {
"nested_function": {
"function": "subtract",
"parameters": {
"left": "batch1_recent_purchase_count",
"right": "batch2_recent_purchase_count"
}
}
},
"num_bins": 20
}
}
],
"subquery": {
"select": [
{
"property": "customer_id"
},
{
"property": "batch1_recent_purchase_count"
},
{
"property": "batch2_recent_purchase_count"
}
],
"subquery": {
"select": [
{
"property": "customer_id"
},
{
"function": "anyIf",
"parameters": {
"result": "recent_purchase_count",
"property": "batch_id",
"comparator": "eq",
"value": "batch1"
},
"alias": "batch1_recent_purchase_count"
},
{
"function": "anyIf",
"parameters": {
"result": "recent_purchase_count",
"property": "batch_id",
"comparator": "eq",
"value": "batch2"
},
"alias": "batch2_recent_purchase_count"
}
],
"group_by": [
{
"property": "customer_id"
}
]
},
"where": [
{
"property": "batch1_recent_purchase_count",
"comparator": "NotNull"
},
{
"property": "batch2_recent_purchase_count",
"comparator": "NotNull"
}
]
}
}
```
### Change Across First to Last Inference Per Customer
We can again compare the difference between two absolute points, but instead of comparing fixed batches compute it for
the earliest and latest inference for each customer:
```json
{
"select": [
{
"function": "distribution",
"alias": "recent_purchase_count_difference_distribution",
"parameters": {
"property": {
"nested_function": {
"function": "subtract",
"parameters": {
"left": "newest_recent_purchase_count",
"right": "oldest_recent_purchase_count"
}
}
},
"num_bins": 20
}
}
],
"subquery": {
"select": [
{
"property": "customer_id"
},
{
"function": "argMax",
"parameters": {
"argument": "inference_timestamp",
"value": "recent_purchase_count"
},
"alias": "newest_recent_purchase_count"
},
{
"function": "argMin",
"parameters": {
"argument": "inference_timestamp",
"value": "recent_purchase_count"
},
"alias": "oldest_recent_purchase_count"
}
],
"group_by": [
{
"property": "customer_id"
}
]
}
}
```
### Change in Categorical Variables
We can also look at change in categorical variables on an entity-by-entity basis. Let's look at the distribution of
customers who remained subscribed, remained unsubscribed, newly subscribed, or newly unsubscribed from one batch to the
next.
```json
{
"select": [
{
"alias": "batch1_not_subscribed",
"function": "equals",
"parameters": {
"left": "batch1_newsletter_subscriber",
"right": 0
}
},
{
"alias": "batch1_is_subscribed",
"function": "equals",
"parameters": {
"left": "batch1_newsletter_subscriber",
"right": 1
}
},
{
"alias": "batch2_not_subscribed",
"function": "equals",
"parameters": {
"left": "batch2_newsletter_subscriber",
"right": 0
}
},
{
"alias": "batch2_is_subscribed",
"function": "equals",
"parameters": {
"left": "batch2_newsletter_subscriber",
"right": 1
}
},
{
"alias": "stayed_unsubscribed_count",
"function": "and",
"parameters": {
"left": {
"alias_ref": "batch1_not_subscribed"
},
"right": {
"alias_ref": "batch2_not_subscribed"
}
}
},
{
"alias": "did_subscribe_count",
"function": "and",
"parameters": {
"left": {
"alias_ref": "batch1_not_subscribed"
},
"right": {
"alias_ref": "batch2_is_subscribed"
}
}
},
{
"alias": "stayed_subscribed_count",
"function": "and",
"parameters": {
"left": {
"alias_ref": "batch1_is_subscribed"
},
"right": {
"alias_ref": "batch2_is_subscribed"
}
}
},
{
"alias": "did_unsubscribe_count",
"function": "and",
"parameters": {
"left": {
"alias_ref": "batch1_is_subscribed"
},
"right": {
"alias_ref": "batch2_not_subscribed"
}
}
}
],
"subquery": {
"select": [
{
"property": "customer_id"
},
{
"property": "batch1_newsletter_subscriber"
},
{
"property": "batch2_newsletter_subscriber"
}
],
"subquery": {
"select": [
{
"property": "customer_id"
},
{
"function": "anyIf",
"parameters": {
"result": "newsletter_subscriber",
"property": "batch_id",
"comparator": "eq",
"value": "batch1"
},
"alias": "batch1_newsletter_subscriber"
},
{
"function": "anyIf",
"parameters": {
"result": "newsletter_subscriber",
"property": "batch_id",
"comparator": "eq",
"value": "batch2"
},
"alias": "batch2_newsletter_subscriber"
}
],
"group_by": [
{
"property": "customer_id"
}
]
},
"where": [
{
"property": "batch1_newsletter_subscriber",
"comparator": "NotNull"
},
{
"property": "batch2_newsletter_subscriber",
"comparator": "NotNull"
}
]
}
}
```