Spaces:
Sleeping
Sleeping
File size: 10,212 Bytes
ad8da65 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
## SparkML Integration This guide provides an example of integrating with the ArthurAI platform to monitor a SparkML model. We'll use an example dataset to train a SparkML model from scratch, but you could also use an existing Spark Pipeline. ```python from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssembler import pyspark.sql.functions as f from pyspark.ml.classification import LogisticRegression import pandas as pd import numpy as np from arthurai import ArthurAI from arthurai.client.apiv3 import InputType, OutputType, Stage ``` ### Train and Save SparkML Model First, we'll instantiate a Spark session and load in a sample dataset. In this example, we'll use a dataset derived from the famous Boston Housing dataset to build a simple model. ```python spark = SparkSession.builder.appName('app').getOrCreate() data = spark.read.csv('./data/boston_housing.csv', header=True, inferSchema=True) train, test = data.randomSplit([0.7, 0.3]) ``` We'll use a LASSO classification model to try to predict the `is_expensive` column from all the others. This column encodes whether or not a property value was above or below the local average. As preprocessing, we'll use the VectorAssembler class to pull together the input columns into a single numeric feature vector. ```python feature_columns = data.columns[:-1] # here we omit the final column assembler = VectorAssembler(inputCols=feature_columns,outputCol="features") lasso_classifier = LogisticRegression(featuresCol="features", labelCol="is_expensive", maxIter=10, regParam=0.3, elasticNetParam=1.0) ``` Using a Pipeline, we'll combine our preprocessing steps and our ML model, and we'll fit to the training data and save. If you have an existing Spark Pipeline, you can load from disk. ```python pipeline = Pipeline(stages=[assembler, lasso_classifier]) fitted_pipeline = pipeline.fit(train) fitted_pipeline.write().overwrite().save('./data/models/boston_housing_spark_model_pipeline') ``` ### Onboard to Arthur ```python arthur = ArthurAI(url='https://app.arthur.ai', login="<YOUR_USERNAME_OR_EMAIL>", password="<YOUR_PASSWORD>") ``` To onboard our model with Arthur, we'll register the schema of the data coming into and out of the model. For simplicity, you can use a Pandas Dataframe for this step. We will take a sample of the SparkDF to the driver, and use this to register the model to Arthur. ```python sample_df = train.take(5000).toPandas() sample_Y = sample_df.loc[['is_expensive']] sample_X = sample_df.drop('is_expensive', axis=1) ``` ```python # instantiate basic model arthur_model = arthur.model({ "partner_model_id": "Boston Housing", "input_type": InputType.Tabular, "output_type": OutputType.Multiclass, "is_batch": True}) # use pandas DataFrames to register data schema arthur_model.from_dataframe(sample_X, Stage.ModelPipelineInput) arthur_model.add_binary_classifier_output_attributes( positive_predicted_attr='expensive', pred_to_ground_truth_map={ 'prediction_expensive': 'ground_truth_expensive', 'prediction_cheap': 'ground_truth_cheap' }, threshold=0.75 ) ``` The `from_dataframe()` method will inspect your dataset and infer the input schema, datatypes, and sample statistics. You can review the model structure and see if any fixes are needed. ```python arthur_model.review() ``` ```python # chas and rad were inferred as categorical, lets change those to be continuous arthur_model.get_attribute('chas', Stage.ModelPipelineInput).set(categorical=False) arthur_model.get_attribute('rad', Stage.ModelPipelineInput).set(categorical=False) arthur_model.review() ``` **Monitoring for bias** For any attributes that you want to monitor for bias, you set the `monitor_for_bias` boolean. In fact, these don't have to be model inputs, they can also be of stage NonInputData. ```python sensitive_attributes = ["Gender", "Race", "Income_Bracket"] for attribute_name in sensitive_attributes: arthur_model.get_attribute(attribute_name, Stage.ModelPipelineInput).monitor_for_bias = True ``` **Save** Now you're ready to save your model and finish onboarding. ```python arthur_model.save() ``` ### Set reference data You can set a baseline dataset in order to speed up the calculation of data drift and inference anomaly scoring. This reference set is typically the training set the model was fitted to, or a subsample. You can use either a pandas DataFrame or a directory of parquet files. The reference data can include model input features, ground truth features or model predictions on training sets. However, it is recommended that only model input features are provided. ```python arthur_model.set_reference_data(directory_path="./data/august_training_data/") ``` (sparkml_explainability)= ### Enabling Explainability for SparkML To enable explainability, you'll supply a python file that implements a `predict()` function for a single observation (a numpy array). This predict function can contain anything you need, including loading a serialized model, preprocessing/transformations, and making a final prediction. The returned result should be a numpy array. You'll also supply a requirements file for all the dependencies for running an inference through your model. For more details around enabling explainability, see the {doc}`/user-guide/walkthroughs/explainability` guide. Below we provide a Spark specific example. The first step is to save your SparkML model and pipeline so it can be imported for use in the `predict()` function ```python fitted_pipeline.write().overwrite().save('./data/models/boston_housing_spark_model_pipeline') ``` Next is to create your `predict()` function. ```python # entrypoint.py import pandas as pd import numpy as np from pyspark.sql import SparkSession from pyspark.ml import PipelineModel # To start the spark session on the model server specify the master url as local. # By default this will run spark using 1 thread, to increase threads you can specify # local[x] where x is the number of threads. When allocating more compute and memory to the spark # session be sure to increase the amount allocated to the model server when calling ArthurModel.enable_explainability() # in the sdk (by default 1 cpu and 1gb of memory is allocated to the model server). spark = SparkSession.builder.master('local').appName('app').getOrCreate() loaded_pipeline = PipelineModel.load("./data/models/boston_housing_spark_model_pipeline") def predict(input_data): col_names = ['crim','zn','indus','chas','nox','rm','age','dis','rad','tax','ptratio','b','lstat'] input_df = pd.DataFrame(input_data, columns=col_names) spark_df = spark.createDataFrame(input_df) predictions = loaded_pipeline.transform(spark_df) return np.array([float(x.prediction) for x in predictions.select('prediction').collect()]) ``` You are then ready to {ref}`enable explainability <enabling_explainability>` ```python arthur_model.enable_explainability( df=sample_df, project_directory='.', user_predict_function_import_path='entrypoint', requirements_file='requirements.txt') ``` ### Send Batch of Inferences Once your model has been onboarded, it is ready to receive inferences and model telemetry. There are some standard inputs needed to identify inferences and batches. * First, each inference needs a unique identifier so that it can later be joined with ground truth. Include a column named **partner_inference_id** and ensure these IDs are unique across batches. For example, if you run predictions across your customer base on a daily-batch cadence, then a unique identfier could be composed of your customer_id plus the date. * Second, each inference needs to be associated with a **batch_id**, but this id will be shared among one or more inferences. * Finally, each inference needs an **inference_timestamp** and these don't have to be unique. Additionally, the predictions/scores from your model should match the column names in the registered schema. If we take a look above at arthur_model.review() we'll recall that columns we created correspond to the clasiffier's output probabilities over the classes ("prediction_cheap" and "prediction_expensive") and the corresponding ground truth over the possible classes in one-hot form ("ground_truth_cheap" and "ground_truth_expensive"). We will process a batch of datapoints through the Pipeline and save the inputs (and predictions) to parquet. We will do the same for the ground truths. ```python loaded_pipeline = PipelineModel.load("./data/models/boston_housing_spark_model_pipeline") inferencesDF = loaded_pipeline.transform(test).withColumnRenamed("probability", "prediction_expensive") uuidUdf= udf(lambda : str(uuid.uuid4()), StringType()) inferencesDF = inferencesDF.withColumn('partner_inference_id', uuidUdf()) # add required columns inferencesDF["inference_timestamp"] = datetime.utcnow() inferencesDF["batch_id"] = "inferences_batch_001" inference_df["partner_inference_id"] = ... # write inferences inferencesDF.write.parquet("./data/inference_files/inferences.parquet") # write ground truths ground_truth_DF = test.select(["ground_truth_cheap", "ground_truth_expensive"]) ground_truth_DF["partner_inference_id"] = ... ground_truth_DF["ground_truth_timestamp"] = datetime.utcnow() ground_truth_DF["batch_id"] = "gt_batch_001" ground_truth_batch.write.parquet("./data/ground_truth_files/ground_truth.parquet") ``` With our model's inputs and outputs save as parquet, we upload a batch by pointing to the directory containing one or more parquet files. The directory will be traversed and all parquet files will be joined into the corresponding batch. Note, the model inputs and predictions will be uploaded separately from the ground truth. ```python arthur_model.send_bulk_inferences(directory_path='./data/inference_files/') ``` You can separately upload ground truths for each inference. Every row in the ground truth file(s) should have an `external_id` column that matches any IDs you create for the inferences. ```python arthur_model.send_bulk_ground_truths(directory_path='./data/ground_truth_files/') ``` |