Spaces:

maxcembalest
/

ask-arthur

Sleeping

File size: 42,421 Bytes

ad8da65

# Arthur Platform Backup and Restore

## Contents

* Warnings
* Overview
    * Overview - clickhouse-backup
    * Overview - Velero
    * Overview - Arthur (Argo)Workflows
    * Overview - S3
* Pre-requisites
* Installing - Velero
    * Setup ServiceAccount with IAM roles for Backup S3 Bucket
    * Setup CRDs + Cluster-level permissions + Backup Infrastructure
    * Confirm Velero is installed and configured correctly
    * Configure the Backup Storage Destination to Point to S3
* Configuring - clickhouse-backup remote storage
* Backing up
* Restoring
* Appendix
    * Running the Velero CLI
    * Working with Velero - Backup
    * Working with Velero - Restore
    * Backup Architecture

## WARNINGS

### PLEASE READ - FOLLOW THESE INSTRUCTIONS EXACTLY

These instructions have been tested as written. If you find they do not work for your use-case, please reach out to Arthur Support before modifying them. We cannot guarantee reliable operation if these instructions are not followed exactly as written.

### PLEASE READ - TAKE CARE WHEN RESTORING INTO A NEW CLUSTER

When restoring into a new cluster, you must ensure that the new cluster is unable to communicate with any services or data store in the old cluster.

If you took a backup on cluster "Apple", and performed a restore into cluster "Banana", cluster "Banana" must point to its own RDS Instance, ClickHouse Database, and Kafka Store (note: it is ok if clusters share an S3 bucket, but not ideal).

To ensure this, you must re-configure via the Admin Interface when restoring into a new cluster. Failure to do this **WILL CAUSE DATA CORRUPTION** on both clusters that is unrecoverable.

### PLEASE READ - ENSURE CONSISTENCY WITH BACKUPS

If you are either manually taking a backup, or scheduling a backup, you **MUST** take a backup of of the full platform. You **CANNOT** use a ClickHouse snapshot taken at midnight with a RDS snapshot taken at 0400 AM (or any other time). All backup operations must be performed at the same time, and when restoring, the data you are using must all belong to the same backup operation. This is to ensure data consistency across the different data stores. **IGNORING THIS WILL CAUSE DATA CORRUPTION**.

## Overview

The overall backup and restore process for the Arthur Platform is as follows:

* Backup a cluster
	* Take a backup of ClickHouse Data
	* Take a backup of Kubernetes Deployment State and Persistent Volumes
		* Enrichments infrastructure
			* Model Servers
			* Data Pipeline Services
			* Enrichment / Delete Enrichment Workflows
		* Kafka Deployment State and EBS Volumes (using EBS Snapshots)
	* Take a backup of RDS Postgres
* Restore the cluster
	* Restore RDS Postgres
	* Update configuration and install the platform
	* Restore ClickHouse Data
	* Restore the Kafka Deployment State and Persistent Volumes
	* Restore Enrichments infrastructure
	* Restore Workflows
* Smoke Tests and Validation

### Overview - clickhouse-backup

The Arthur Platform stores inference data, data built from the enrichments pipeline, reference and ground truth data in ClickHouse. ClickHouse is an open-source OLAP Database which enables SQL-like query execution, replication, sharding and many additional features.

To backup ClickHouse, the Arthur Platform uses a tool called [clickhouse-backup](https://github.com/AlexAkulov/clickhouse-backup). clickhouse-backup is a sidecar-container included on the ClickHouse pods and is responsible for taking backups, performing restores, and coordinating with remote storage (in this case S3) to store and retrieve backups. clickhouse-backup uses built-in functionality of ClickHouse to take backups and perform restores.


### Overview - Velero

The Arthur Platform uses [Velero](https://velero.io/) as its Backup and Restore tool. Velero is an industry-standard, battle-tested tool for backing up Kubernetes Resources as well as Persistent Volumes.

Arthur uses velero to backup most namespaced kubernetes resources, as well as the EBS Volume Snapshot backups for each PersistentVolumes claimed by the StatefulSets (eg: via PVCs).

Backup data (not including EBS Volume Snapshots) is stored in an S3 bucket which is accessible via a ServiceAccount that is provisioned for the Backup and Restore agent. Backup and Restores are managed by Velero using Kubernetes Custom Resource Definitions (CRDs), which are consumed by the Velero Backup Controller and Restic Agents.

Velero has a feature which also allows backups to be scheduled, using a cron-like configuration. It also provides `ServiceMonitors` which expose metrics via Prometheus, so that operators can monitor backup and restore status and set up alerts for when backups or restores fail.

### Overview - Arthur (Argo)Workflows

The Arthur Platform uses [Argo Workflows](https://argoproj.github.io/argo-workflows/) as a workflow orchestration engine for running certain jobs. Argo installs a handful of Custom Resource Definitions (CRDs) which enable the Argo Workflow services to schedule, execute and update these jobs.

Workflows are dynamically managed, meaning that their definitions are not stored in the Arthur installer script. The Backup and Restore operation accounts for this by treating restoration of Workflows on a case-by-case basis, as follows:

* Enrichments and Delete Enrichments workflows
	* These workflows are created to create and tear-down infrastructure necessary for processing enrichments data (eg: kafka topics, pods which manage the data pipeline for enrichments, etc.)
	* These workflows are idempotent and safe to recover
	* Therefore, these workflows are backed up and restored just like any other Kubernetes Resource during the backup stage
* Batch workflows
	* These workflows are created to manage batch jobs, which are used by clients when uploading large data files to models (inferences and/or ground truths).
	* These workflows are sometimes safe to recover
	* Therefore, these workflows are restored selectively based on what state they were in when the backup was taken
		* Workflows for which Arthur received all of the data from the client are resumed by manually re-submitting them (this is done via an Administrative HTTP endpoint that needs to manually be called)
		* Workflows for which Arthur did not receive all the data from the client will need to be re-submitted. Operators restoring the cluster will need to reach out to affected clients to communicate that their batch workflows should be re-submitted.
* Reference and Cron Workflows
	* Reference Workflows are created for monitoring the upload of reference datasets to S3
		* Reference datasets that were in-flight during a backup will need to be re-uploaded with the SDK.
	* Cron Workflows are scheduled workflows which perform some regular processing (eg: triggering alerts for non-batch inferences)
		* Cron Workflows are meant to be run on a regular schedule. It is safe to wait for the next workflow to be triggered, and therefore, these workflows are not backed up nor restored.

### Overview - S3

The Arthur Platform uses AWS S3 as object storage for storing inference data, reference data, as well as data and trained models for the enrichments pipeline.

Arthur recommends ensuring that the AWS S3 bucket used for this storage is configured with Live Cross-Region Replication so that objects are available in the event of an AWS region outage.

The Arthur Backup solution does not manage consistency with the S3 bucket and other backup data.
The data in S3 is only used in conjuction with data that is stored in Postgres (eg: model definitions), so it's ok if there's data in S3 that isn't represented in Postgres.
Therefore, the S3 bucket for a cluster will always reflect the most up-to-date state, regardless of when a backup was taken.

To read more about S3 Bucket Replication, check out the AWS Documentation:
* https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

## Pre-requisites


The following items must be configured specifically in this capacity in order to use Arthur's Backup and Restore capabilities:

1. Arthur must be configured using external object storage, specifically, S3
2. The access to external storage must be configured using IRSA Annotations
3. In order to use IRSA Annotations, the cluster must be deployed using Amazon EKS

If the following are not true/possible for your deployment, please reach out to Arthur Support so we can discuss.



## Installing - Velero

The only component that needs to be installed separately from Arthur to perform backup and restores is Velero. Below, instructions are provided for setting up Velero to store backups in S3 using Secret and Access keys.

The general overview of the installation is as follows:

1. Create the Velero configuration
	1. Create the policy for accessing the S3 Bucket and taking EBS Snapshots and attach to an IAM User
	2. Generate the Secret and Access keys for the IAM User
	3. Create a Velero-specific credentials file
4. Install Velero
5. Confirm Velero is installed and configured correctly
6. Configure the Backup Storage Destination to Point to S3

### Create the Velero Configuration

The instructions here are taken from the Velero AWS Plugin Documentation, which can be found in Option 1 here:
https://github.com/vmware-tanzu/velero-plugin-for-aws#setup

1. Create the IAM user:

    ```bash
    aws iam create-user --user-name velero
    ```

    If you'll be using Velero to backup multiple clusters with multiple S3 buckets, it may be desirable to create a unique username per cluster rather than the default `velero`.

2. Attach policies to give `velero` the necessary permissions:

    ```
    cat > velero-policy.json <<EOF
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:DescribeVolumes",
                    "ec2:DescribeSnapshots",
                    "ec2:CreateTags",
                    "ec2:CreateVolume",
                    "ec2:CreateSnapshot",
                    "ec2:DeleteSnapshot"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:DeleteObject",
                    "s3:PutObject",
                    "s3:AbortMultipartUpload",
                    "s3:ListMultipartUploadParts"
                ],
                "Resource": [
                    "arn:aws:s3:::${BUCKET}/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::${BUCKET}"
                ]
            }
        ]
    }
    EOF
    ```
    ```bash
    aws iam put-user-policy \
      --user-name velero \
      --policy-name velero \
      --policy-document file://velero-policy.json
    ```

3. Create an access key for the user:

    ```bash
    aws iam create-access-key --user-name velero
    ```

    The result should look like:

    ```
    {
      "AccessKey": {
            "UserName": "velero",
            "Status": "Active",
            "CreateDate": "2017-07-31T22:24:41.576Z",
            "SecretAccessKey": <AWS_SECRET_ACCESS_KEY>,
            "AccessKeyId": <AWS_ACCESS_KEY_ID>
      }
    }
    ```

4. Create a Velero-specific credentials file (`credentials-velero`) in your local directory:

    ```bash
    [default]
    aws_access_key_id=<AWS_ACCESS_KEY_ID>
    aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
    ```

    where the access key id and secret are the values returned from the `create-access-key` request.

### Install Velero

To install Velero, first install the Velero CLI. Instructions for how to do this can be found on the Velero Documentation site: https://velero.io/docs/v1.10/basic-install/#install-the-cli

Once the Velero CLI is installed, you can use this to install Velero on the cluster. Please ensure that your kubeconfig is pointing to the cluster, as this is what it used by the Velero CLI to communicate with the cluster.

To install Velero, use the following command, ensuring that the path to the credentials file you generated in the last step is provided correctly:

```
velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.6.0 \
    --bucket $BUCKET \
    --backup-location-config region=$REGION \
    --snapshot-location-config region=$REGION \
    --secret-file ./credentials-velero
```

### Confirm Velero is installed and configured correctly

To confirm that Velero is installed and configured correctly:

1. Open the Kots Admin Interface and navigate to the "Snapshots" tab
2. Click the "Check for Velero" button (see the screenshot below)

![Check for Velero](/_static/images/velero-configuration-check.png)

### Configure the Backup Storage Destination to Point to S3

The final step in configuring Velero is to configure the Backup Storage Destination to point to the S3 Bucket where the backups will be stored.

To do this, add a new "Backup Storage Destination" in the Admin Interface and fill in the details for the S3 Bucket and for the Secret and Access keys.

## Configuring - clickhouse-backup remote storage

Before continuing, please ensure that your cluster is setup and configured as described in the "Pre-requisites" section above.

Configuring clickhouse-backup to store backups in remote storage (eg: S3) can be done in the Kots Admin Interface.

If you've performed the configuration steps as mentioned in the "Pre-requisites" section, you should see the "Enable Olap Database Backup Capabilites" option in the "Olap Database" section (see screenshot)

![Configure Olap Backups](/_static/images/clickhouse-backup.png)

Ensure that:

1. The configuration that points to the bucket is correct
    * The Bucket Name
    * The Bucket Region
2. The ServiceAccount is the same ServiceAccount that you've configured with the IRSA Annotation (if you are not sure, enter the default value)
3. The IAM Role that you are using for the IRSA Annotation has the appropriate permissions to read/write/list from the S3 bucket
    * Note - you can use the same permissions that are described in the Velero Section, just be sure to update the bucket ARN that the permissions apply to
4. The S3 Path is where you want to be storing backups

Once you've configured clickhouse-backup, you can validate that the configuration is correct by trying to take a backup. Refer to the "Take a backup of ClickHouse Data" section below on how to do this.

## Backing up

Note that you can use the following script to run all of these steps together. Read the `Appendix: Running the Velero CLI` for instructions on how to run velero.

```bash
#!/bin/bash

set -euo pipefail
IFS=$'\n\t'

# You need to configure this by getting the name of the storage location
# using the velero CLI
# eg: `velero backup-location get`
storage_location="Put your storage location here"

backup_date=$(DATE +%Y-%m-%d-%H-%M-%S);
name=arthur-backup-$backup_date
echo "Creating a new backup with name $name"

echo "Taking a backup of CH data"
kubectl create job $name-clickhouse-backup \
    --from=cronjob/clickhouse-backup-cronjob
ch_backup_jobname=$(kubectl get jobs -o name | grep "$name-clickhouse-backup")
kubectl wait $ch_backup_jobname \
    --for=condition=complete \
    --timeout=30m

echo "Taking a backup of the enrichments infrastructure"
velero backup create $name-enrichments \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --selector='component in (kafka-mover-init-connector, model_server)' \
    --include-resources=deployments,services \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \
    --storage-location=$storage_location \
    --wait

echo "Taking a backup of workflows"
velero backup create $name-workflows \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --include-resources=workflows \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \
    --storage-location=$storage_location \
    --wait

echo "Taking a backup of Kafka/Kafka-ZK StatefulSets, their EBS Volumes, and related components"
velero backup create $name-stateful-sets \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --selector='app in (cp-zookeeper,cp-kafka)' \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io \
    --storage-location=$storage_location \
    --wait

echo "Taking a backup of the RDS database"
aws rds create-db-cluster-snapshot \
    --db-cluster-snapshot-identifier $name-snapshot \
    --db-cluster-identifier RDS_DB_NAME \
    --profile AWS_PROFILE_NAME \
    --region AWS_REGION
```

### Take a backup of ClickHouse Data

By default, the Arthur Platform ships with a Kubernetes CronJob which takes a ClickHouse Backup each day at midnight.

Please see the warning `PLEASE READ - ENSURE DATA CONSISTENCY` at the top if attempting to take a manual backup.

To take a manual backup of ClickHouse Data, you can run the following commands:

```bash
$ kubectl get cronjobs -n arthurai
NAME                              SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
arthurai-cron-workflow-cron-job   1 0 * * *   False     0        14h             2d18h
arthurai-model-health-cron-job    5 * * * *   False     0        20m             2d18h
clickhouse-backup-cronjob         0 0 * * *   False     0        14h             2d18h
$ kubectl create job clickhouse-backup --from=cronjob/clickhouse-backup-cronjob -n arthurai
job.batch/clickhouse-backup created
$ kubectl get jobs -n arthurai
NAME                                       COMPLETIONS   DURATION   AGE
clickhouse-backup-cronjob-27735840         1/1           8m35s      14m
```

### Take a backup of Enrichments infrastructure and Enrichment / Delete Enrichment Workflows

The Arthur Platform uses Velero to take a backup of the Enrichments Infrastructure, as well as the Enrichments workflows. The backup is orchestrated manually and requires running a command.

The Enrichments infrastructure and Enrichment Workflows are orchestrated as separate backups and will require running 2 separate commands.

To take a manual backup of the Enrichments infrastructure, run the following commands:

```bash
$ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S);
$ name=arthur-backup-$backup_date
$ velero backup create $name-enrichments \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --selector='component in (kafka-mover-init-connector, model_server)' \
    --include-resources=deployments,services \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \
    --storage-location=$storage_location \
    --wait
```

To take a manual backup of the Enrichments Workflows, run the following commands:

```bash
$ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S);
$ name=arthur-backup-$backup_date
velero backup create $name-workflows \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --include-resources=workflows \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \
    --storage-location=$storage_location \
    --wait
```

### Take a backup of Kafka Deployment State and EBS Volumes (using EBS Snapshots)

The Arthur Platform uses Velero to take a backup of the Kafka Deployment State and EBS Volumes. The backup is orchestrated manually and requires running a command.

To take a manual backup of Kafka, run the following commands:

```bash
$ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S);
$ name=arthur-backup-$backup_date
$ velero backup create $name-zookeeper \
    --namespace=arthurai \
    --include-namespaces=arthurai \
    --selector='app in (cp-zookeeper,cp-kafka)' \
    --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io \
    --storage-location=$storage_location \
    --wait
```

If you are unable to run the Velero CLI locally, you can execute it from the Velero container. See the section **Executing on the Velero Backup Controller Pod** for more details.


### Take a backup of RDS Postgres

The script above provides the command line instructions for taking a backup of a RDS Database (also copied here). Please ensure that the values for the `db-cluster-identifier`, `profile` and `region` are filled in correctly.

Note that the command is only compatible for a multi-region RDS Database. If you are using a single-region RDS Database, the command to use is `aws rds create-db-snapshot`.

```bash
aws rds create-db-cluster-snapshot \
    --db-cluster-snapshot-identifier $name-snapshot \
    --db-cluster-identifier RDS_DB_NAME \
    --profile AWS_PROFILE_NAME \
    --region AWS_REGION
```

For more information, please refer to the AWS Documentation:
* Multi-region RDS instance:
    * https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateMultiAZDBClusterSnapshot.html
    * https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-cluster-snapshot.html
* Single-region RDS instance:
    * https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html
    * https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html

## Restoring

### Restore the RDS Postgres

To restore RDS Postgres, use the following resource:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromMultiAZDBClusterSnapshot.Restoring.html

Please ensure that you correctly match the following configuration from the DB from which the snapshot was taken
* The connection port
* The VPC and Security Group IDs
* DB Subnet Group
* DB Instance Type
* Any other configuration which might be overridden

This operation might take a while and the DB must show as Available before installing the platform.

### Update configuration and install the platform

**NOTE - Before completing this action, please check the following:**

1. The Restored RDS DB Instance is showing as "Available"
2. All pods which were restored are showing as "Running"

**NOTE - IMPORTANT -- Update configuration to point to the new DB Instance**

It is **CRITICALLY IMPORTANT** to update your configuration to point to the newly restored DB Instance. Failure to complete this step **WILL CAUSE DATA CORRUPTION**.

Please update the configuration in the "Meta Database" section of the Admin Interface to point to the newly restored DB instance.

If your new cluster is routed via a different ingress URL, please also update this in the "Network" section of the configuration.

Wait for the platform to come back online before proceeding to the next steps. All Deployments and StatefulSets should be completely stood up (eg: all Pods should be ready and "Running") and all Jobs should be "Completed".


### Restore ClickHouse Data

The Arthur Platform ships with a Kubernetes CronJob that executes a ClickHouse restore that is scheduled to never run (eg: it's scheduled to run on February 31st, which is an invalid date).

To run a ClickHouse data restore, execute the following commands:

First, we need to get the name of the clickhouse-backup that coincides with the kafka/enrichments/workflow backups that you are restoring:

```bash
$ kubectl get pods | grep olap
chi-olap-installation-arthur-0-0-0                           2/2     Running     0          6d23h
chi-olap-installation-arthur-0-1-0                           2/2     Running     0          6d23h
chi-olap-installation-arthur-0-2-0                           2/2     Running     0          6d23h
olap-installation-zookeeper-0                                3/3     Running     0          6d23h
olap-installation-zookeeper-1                                3/3     Running     0          6d22h
olap-installation-zookeeper-2                                3/3     Running     0          6d23h
olap-operator-8c867ddff-kc22x                                2/2     Running     0          6d23h
$ kubectl exec chi-olap-installation-arthur-0-0-0 -c backup -- clickhouse-backup list
2022/10/04 16:03:20.255378  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER'
2022/10/04 16:03:20.257582  info SELECT * FROM system.disks;
2022/10/04 16:03:20.267149  info SELECT count() AS is_macros_exists FROM system.tables WHERE database='system' AND name='macros'
2022/10/04 16:03:20.271359  info SELECT * FROM system.macros
chi-olap-installation-arthur-0-0-arthur-clickhouse-backup-2022-09-26-19-50-43   3.64MiB   02/10/2022 00:00:16   remote      tar, regular
chi-olap-installation-arthur-0-1-arthur-clickhouse-backup-2022-09-26-19-50-43   3.64MiB   02/10/2022 00:00:16   remote      tar, regular
chi-olap-installation-arthur-0-2-arthur-clickhouse-backup-2022-09-26-19-50-43   3.64MiB   02/10/2022 00:00:16   remote      tar, regular
```

Note that from the above backup, the backup names are in the following format:
`$CLICKHOUSE_NODE_NAME-$ARTHUR_BACKUP_NAME`

So for example, `chi-olap-installation-arthur-0-0-arthur-clickhouse-backup-2022-09-26-19-50-43` can be parsed into:
* clickhouse node name: `chi-olap-installation-arthur-0-0`
* arthur backup name: `arthur-clickhouse-backup-2022-09-26-19-50-43`

Now we need to create the restoration job, and configure it to use the Arthur Backup Name from above:

```bash
$ kubectl create job --from=cronjob/clickhouse-restore-cronjob -o yaml clickhouse-restore --dry-run=client --save-config > clickhouse-restore.yaml

# update the value of the `BACKUP_NAME` environment variable in the `clickhouse-restore.yaml` file
# eg:
#        - name: BACKUP_NAME
#          value: "arthur-clickhouse-backup-2022-09-26-19-50-43"

$ kubectl apply -f clickhouse-restore.yaml
job.batch/clickhouse-restore created
```

### Restore the Kafka Deployment State and Persistent Volumes

The Arthur Platform restores Kafka Deployment State and PVs using Velero.

To execute a restore, run the following commands:

```bash
$ velero backup get -n arthurai | grep kafka
NAME                                                    STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
arthur-backup-2022-09-23t11.23.25-04.00-kafka           Completed   0        0          2022-09-23 11:24:37 -0400 EDT   27d       default            app in (cp-kafka,cp-zookeeper)
$ velero restore create \
    --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-kafka" \
    --namespace arthurai \
    --restore-volumes=true \
    --existing-resource-policy=update
```

Velero will update the Pod Specs and point to the PVs using the EBS Volume Snapshots and restore the kubernetes resources associated with Kafka.

### Restore Enrichments infrastructure

The Arthur Platform uses Velero to restore the Enrichments infrastructure.

To restore, run the following commands:

```bash
$ velero backup get -n arthurai | grep enrichments
NAME                                                    STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
arthur-backup-2022-09-23t11.23.25-04.00-enrichments     Completed   0        0          2022-09-23 11:24:33 -0400 EDT   27d       default            component in (kafka-mover-init-connector,model_server)
$ velero restore create \
    --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-enrichments" \
    --namespace arthurai
```

### Restore Workflows

Restoring workflows is a 2-step process:

1. Restore the workflows from the Velero backup
2. Restore Batch Workflows which are recoverable using an Arthur Admin Endpoint

To restore workflows using the Velero backup, run the following commands:

```bash
$ velero backup get -n arthurai | grep workflows
NAME                                                    STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
arthur-backup-2022-09-23t11.23.25-04.00-workflows       Completed   0        0          2022-09-23 11:24:35 -0400 EDT   27d       default            <none>
$ velero restore create \
    --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-workflows" \
    --namespace arthurai
```

To restore Batch Workflows, run the following comamnds.

In one terminal window, port-forward to the dataset-service:

```bash
$ kubectl get pods | grep dataset-service
arthurai-dataset-service-69dd979d8c-72t7j                         1/1     Running            0          2d17h
$ kubectl port-forward -n arthurai arthurai-dataset-service-69dd979d8c-72t7j 7899
```

Then, in another terminal window, run the following commands:

```bash
$ curl -k -XPOST https://localhost:7899/api/v1/workflows/batch/recover
{"message":"success"}
```

### Smoke Tests and Validation

The restore should now be complete. All data should be restored and in a consistent state as it was from when the backup was taken. Any data sent during or after the backup will need to be re-sent. Perform any validation/smoke-tests to ensure that the platform is operating.

## Appendix

### Running the Velero CLI

Velero provides a Command-Line Interface (CLI) for taking backups and performing restores. The command line interface can be installed locally, or it can be invoked by `kubectl exec` on the Velero Backup Controller pod.

#### Local Installation

Read through the Velero Documentation for how to install velero on your platform:
https://velero.io/docs/v1.9/basic-install/#install-the-cli

Velero uses your KUBECONFIG file to connect to the cluster.

```
$ velero --help
Velero is a tool for managing disaster recovery, specifically for Kubernetes
cluster resources. It provides a simple, configurable, and operationally robust
way to back up your application state and associated data.

If you're familiar with kubectl, Velero supports a similar model, allowing you to
execute commands such as 'velero get backup' and 'velero create schedule'. The same
operations can also be performed as 'velero backup get' and 'velero schedule create'.

Usage:
  velero [command]

Available Commands:
  backup            Work with backups
  backup-location   Work with backup storage locations
  bug               Report a Velero bug
  client            Velero client related commands
  completion        Generate completion script
  create            Create velero resources
  debug             Generate debug bundle
  delete            Delete velero resources
  describe          Describe velero resources
  get               Get velero resources
  help              Help about any command
  install           Install Velero
  plugin            Work with plugins
  restic            Work with restic
  restore           Work with restores
  schedule          Work with schedules
  snapshot-location Work with snapshot locations
  uninstall         Uninstall Velero
  version           Print the velero version and associated image

Flags:
      --add_dir_header                   If true, adds the file directory to the header
      --alsologtostderr                  log to standard error as well as files
      --colorized optionalBool           Show colored output in TTY. Overrides 'colorized' value from $HOME/.config/velero/config.json if present. Enabled by default
      --features stringArray             Comma-separated list of features to enable for this Velero process. Combines with values from $HOME/.config/velero/config.json if present
  -h, --help                             help for velero
      --kubeconfig string                Path to the kubeconfig file to use to talk to the Kubernetes apiserver. If unset, try the environment variable KUBECONFIG, as well as in-cluster configuration
      --kubecontext string               The context to use to talk to the Kubernetes apiserver. If unset defaults to whatever your current-context is (kubectl config current-context)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --log_file string                  If non-empty, use this log file
      --log_file_max_size uint           Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                      log to standard error instead of files (default true)
  -n, --namespace string                 The namespace in which Velero should operate (default "velero")
      --skip_headers                     If true, avoid header prefixes in the log messages
      --skip_log_headers                 If true, avoid headers when opening log files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          number for the log level verbosity
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

Use "velero [command] --help" for more information about a command.
```

#### Executing on the Velero Backup Controller Pod

```
$ kubectl exec velero-699dc869d4-r24bh -c velero -- /velero
Velero is a tool for managing disaster recovery, specifically for Kubernetes
cluster resources. It provides a simple, configurable, and operationally robust
way to back up your application state and associated data.

If you're familiar with kubectl, Velero supports a similar model, allowing you to
execute commands such as 'velero get backup' and 'velero create schedule'. The same
operations can also be performed as 'velero backup get' and 'velero schedule create'.

Usage:
  velero [command]

Available Commands:
  backup            Work with backups
  backup-location   Work with backup storage locations
  bug               Report a Velero bug
  client            Velero client related commands
  completion        Generate completion script
  create            Create velero resources
  debug             Generate debug bundle
  delete            Delete velero resources
  describe          Describe velero resources
  get               Get velero resources
  help              Help about any command
  install           Install Velero
  plugin            Work with plugins
  restic            Work with restic
  restore           Work with restores
  schedule          Work with schedules
  snapshot-location Work with snapshot locations
  uninstall         Uninstall Velero
  version           Print the velero version and associated image

Flags:
      --add_dir_header                   If true, adds the file directory to the header
      --alsologtostderr                  log to standard error as well as files
      --colorized optionalBool           Show colored output in TTY. Overrides 'colorized' value from $HOME/.config/velero/config.json if present. Enabled by default
      --features stringArray             Comma-separated list of features to enable for this Velero process. Combines with values from $HOME/.config/velero/config.json if present
  -h, --help                             help for velero
      --kubeconfig string                Path to the kubeconfig file to use to talk to the Kubernetes apiserver. If unset, try the environment variable KUBECONFIG, as well as in-cluster configuration
      --kubecontext string               The context to use to talk to the Kubernetes apiserver. If unset defaults to whatever your current-context is (kubectl config current-context)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --log_file string                  If non-empty, use this log file
      --log_file_max_size uint           Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                      log to standard error instead of files (default true)
  -n, --namespace string                 The namespace in which Velero should operate (default "arthurai")
      --skip_headers                     If true, avoid header prefixes in the log messages
      --skip_log_headers                 If true, avoid headers when opening log files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          number for the log level verbosity
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

Use "velero [command] --help" for more information about a command.
```

### Working with Velero - Backup

To take a backup of Arthur, you would invoke the CLI as follows.

```bash
$ velero backup create arthur-backup-$(date -Iseconds | tr ":" "." | tr "T" "t") \
    --namespace=arthurai \
    --include-namespaces=arthurai
```

The command above will create a Velero Backup Resource named `arthur-backup-<timestamp>`, which you can check using the Velero CLI:

```bash
$ velero backup get -n arthurai
NAME                                      STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
arthur-backup-2022-08-25t09.59.35-04.00   Completed    0        0          2022-08-25 09:59:35 -0400 EDT   28d       default            <none>
arthur-backup-2022-08-26t10.37.52-04.00   InProgress   0        0          2022-08-26 10:37:52 -0400 EDT   29d       default            <none>
```

For debugging a backup, you can access the backup's logs using the Velero CLI:

```bash
$ velero backup logs arthur-backup-2022-08-25t09.59.35-04.00 -n arthurai | head
time="2022-08-25T13:59:35Z" level=info msg="Setting up backup temp file" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:587"
time="2022-08-25T13:59:35Z" level=info msg="Setting up plugin manager" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:594"
time="2022-08-25T13:59:35Z" level=info msg="Getting backup item actions" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:598"
time="2022-08-25T13:59:35Z" level=info msg="Setting up backup store to check for backup existence" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:608"
time="2022-08-25T13:59:36Z" level=info msg="Writing backup version file" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:192"
time="2022-08-25T13:59:36Z" level=info msg="Including namespaces: arthurai" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:198"
time="2022-08-25T13:59:36Z" level=info msg="Excluding namespaces: <none>" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:199"
time="2022-08-25T13:59:36Z" level=info msg="Including resources: *" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:202"
time="2022-08-25T13:59:36Z" level=info msg="Excluding resources: <none>" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:203"
time="2022-08-25T13:59:36Z" level=info msg="Backing up all pod volumes using Restic: false" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:204"

<truncated>
```

Finally, you can get an overview of the backup using the Velero CLI:

```bash
$ velero backup describe arthur-backup-2022-08-25t09.59.35-04.00 -n arthurai
Name:         arthur-backup-2022-08-25t09.59.35-04.00
Namespace:    arthurai
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.21.14-eks-18ef993
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=21+

Phase:  Completed

Errors:    0
Warnings:  0

Namespaces:
  Included:  arthurai
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0
szExpiration:  2022-09-24 09:59:35 -0400 EDT

Total items to be backed up:  654
Items backed up:              654

Velero-Native Snapshots:  3 of 3 snapshots completed successfully (specify --details for more information)

Restic Backups (specify --details for more information):
  Completed:  23
```


### Working with Velero - Restore

Similar to backup, restore happens using the Velero Cli. A restore takes a Backup object and then executes the restore procedure.

You can execute a restore with the following Velero CLI command:

```bash
$ velero restore create \
    --from-backup arthur-backup-2022-08-25t09.59.35-04.00 \
    --namespace arthurai \
    --restore-volumes=true
```

Just like with the Backup, Velero will create a Restore Velero Resource, which you can inspect with the Velero CLI:

```bash
$ velero restore get -n arthurai
NAME                                                     BACKUP                                    STATUS            STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
arthur-backup-2022-08-26t10.37.52-04.00-20220826110520   arthur-backup-2022-08-26t10.37.52-04.00   Completed         2022-08-26 11:06:16 -0400 EDT   2022-08-26 11:08:00 -0400 EDT   0        35         2022-08-26 11:05:21 -0400 EDT   <none>
```