Spaces:
Sleeping
Sleeping
File size: 42,421 Bytes
ad8da65 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 |
# Arthur Platform Backup and Restore ## Contents * Warnings * Overview * Overview - clickhouse-backup * Overview - Velero * Overview - Arthur (Argo)Workflows * Overview - S3 * Pre-requisites * Installing - Velero * Setup ServiceAccount with IAM roles for Backup S3 Bucket * Setup CRDs + Cluster-level permissions + Backup Infrastructure * Confirm Velero is installed and configured correctly * Configure the Backup Storage Destination to Point to S3 * Configuring - clickhouse-backup remote storage * Backing up * Restoring * Appendix * Running the Velero CLI * Working with Velero - Backup * Working with Velero - Restore * Backup Architecture ## WARNINGS ### PLEASE READ - FOLLOW THESE INSTRUCTIONS EXACTLY These instructions have been tested as written. If you find they do not work for your use-case, please reach out to Arthur Support before modifying them. We cannot guarantee reliable operation if these instructions are not followed exactly as written. ### PLEASE READ - TAKE CARE WHEN RESTORING INTO A NEW CLUSTER When restoring into a new cluster, you must ensure that the new cluster is unable to communicate with any services or data store in the old cluster. If you took a backup on cluster "Apple", and performed a restore into cluster "Banana", cluster "Banana" must point to its own RDS Instance, ClickHouse Database, and Kafka Store (note: it is ok if clusters share an S3 bucket, but not ideal). To ensure this, you must re-configure via the Admin Interface when restoring into a new cluster. Failure to do this **WILL CAUSE DATA CORRUPTION** on both clusters that is unrecoverable. ### PLEASE READ - ENSURE CONSISTENCY WITH BACKUPS If you are either manually taking a backup, or scheduling a backup, you **MUST** take a backup of of the full platform. You **CANNOT** use a ClickHouse snapshot taken at midnight with a RDS snapshot taken at 0400 AM (or any other time). All backup operations must be performed at the same time, and when restoring, the data you are using must all belong to the same backup operation. This is to ensure data consistency across the different data stores. **IGNORING THIS WILL CAUSE DATA CORRUPTION**. ## Overview The overall backup and restore process for the Arthur Platform is as follows: * Backup a cluster * Take a backup of ClickHouse Data * Take a backup of Kubernetes Deployment State and Persistent Volumes * Enrichments infrastructure * Model Servers * Data Pipeline Services * Enrichment / Delete Enrichment Workflows * Kafka Deployment State and EBS Volumes (using EBS Snapshots) * Take a backup of RDS Postgres * Restore the cluster * Restore RDS Postgres * Update configuration and install the platform * Restore ClickHouse Data * Restore the Kafka Deployment State and Persistent Volumes * Restore Enrichments infrastructure * Restore Workflows * Smoke Tests and Validation ### Overview - clickhouse-backup The Arthur Platform stores inference data, data built from the enrichments pipeline, reference and ground truth data in ClickHouse. ClickHouse is an open-source OLAP Database which enables SQL-like query execution, replication, sharding and many additional features. To backup ClickHouse, the Arthur Platform uses a tool called [clickhouse-backup](https://github.com/AlexAkulov/clickhouse-backup). clickhouse-backup is a sidecar-container included on the ClickHouse pods and is responsible for taking backups, performing restores, and coordinating with remote storage (in this case S3) to store and retrieve backups. clickhouse-backup uses built-in functionality of ClickHouse to take backups and perform restores. ### Overview - Velero The Arthur Platform uses [Velero](https://velero.io/) as its Backup and Restore tool. Velero is an industry-standard, battle-tested tool for backing up Kubernetes Resources as well as Persistent Volumes. Arthur uses velero to backup most namespaced kubernetes resources, as well as the EBS Volume Snapshot backups for each PersistentVolumes claimed by the StatefulSets (eg: via PVCs). Backup data (not including EBS Volume Snapshots) is stored in an S3 bucket which is accessible via a ServiceAccount that is provisioned for the Backup and Restore agent. Backup and Restores are managed by Velero using Kubernetes Custom Resource Definitions (CRDs), which are consumed by the Velero Backup Controller and Restic Agents. Velero has a feature which also allows backups to be scheduled, using a cron-like configuration. It also provides `ServiceMonitors` which expose metrics via Prometheus, so that operators can monitor backup and restore status and set up alerts for when backups or restores fail. ### Overview - Arthur (Argo)Workflows The Arthur Platform uses [Argo Workflows](https://argoproj.github.io/argo-workflows/) as a workflow orchestration engine for running certain jobs. Argo installs a handful of Custom Resource Definitions (CRDs) which enable the Argo Workflow services to schedule, execute and update these jobs. Workflows are dynamically managed, meaning that their definitions are not stored in the Arthur installer script. The Backup and Restore operation accounts for this by treating restoration of Workflows on a case-by-case basis, as follows: * Enrichments and Delete Enrichments workflows * These workflows are created to create and tear-down infrastructure necessary for processing enrichments data (eg: kafka topics, pods which manage the data pipeline for enrichments, etc.) * These workflows are idempotent and safe to recover * Therefore, these workflows are backed up and restored just like any other Kubernetes Resource during the backup stage * Batch workflows * These workflows are created to manage batch jobs, which are used by clients when uploading large data files to models (inferences and/or ground truths). * These workflows are sometimes safe to recover * Therefore, these workflows are restored selectively based on what state they were in when the backup was taken * Workflows for which Arthur received all of the data from the client are resumed by manually re-submitting them (this is done via an Administrative HTTP endpoint that needs to manually be called) * Workflows for which Arthur did not receive all the data from the client will need to be re-submitted. Operators restoring the cluster will need to reach out to affected clients to communicate that their batch workflows should be re-submitted. * Reference and Cron Workflows * Reference Workflows are created for monitoring the upload of reference datasets to S3 * Reference datasets that were in-flight during a backup will need to be re-uploaded with the SDK. * Cron Workflows are scheduled workflows which perform some regular processing (eg: triggering alerts for non-batch inferences) * Cron Workflows are meant to be run on a regular schedule. It is safe to wait for the next workflow to be triggered, and therefore, these workflows are not backed up nor restored. ### Overview - S3 The Arthur Platform uses AWS S3 as object storage for storing inference data, reference data, as well as data and trained models for the enrichments pipeline. Arthur recommends ensuring that the AWS S3 bucket used for this storage is configured with Live Cross-Region Replication so that objects are available in the event of an AWS region outage. The Arthur Backup solution does not manage consistency with the S3 bucket and other backup data. The data in S3 is only used in conjuction with data that is stored in Postgres (eg: model definitions), so it's ok if there's data in S3 that isn't represented in Postgres. Therefore, the S3 bucket for a cluster will always reflect the most up-to-date state, regardless of when a backup was taken. To read more about S3 Bucket Replication, check out the AWS Documentation: * https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html ## Pre-requisites The following items must be configured specifically in this capacity in order to use Arthur's Backup and Restore capabilities: 1. Arthur must be configured using external object storage, specifically, S3 2. The access to external storage must be configured using IRSA Annotations 3. In order to use IRSA Annotations, the cluster must be deployed using Amazon EKS If the following are not true/possible for your deployment, please reach out to Arthur Support so we can discuss. ## Installing - Velero The only component that needs to be installed separately from Arthur to perform backup and restores is Velero. Below, instructions are provided for setting up Velero to store backups in S3 using Secret and Access keys. The general overview of the installation is as follows: 1. Create the Velero configuration 1. Create the policy for accessing the S3 Bucket and taking EBS Snapshots and attach to an IAM User 2. Generate the Secret and Access keys for the IAM User 3. Create a Velero-specific credentials file 4. Install Velero 5. Confirm Velero is installed and configured correctly 6. Configure the Backup Storage Destination to Point to S3 ### Create the Velero Configuration The instructions here are taken from the Velero AWS Plugin Documentation, which can be found in Option 1 here: https://github.com/vmware-tanzu/velero-plugin-for-aws#setup 1. Create the IAM user: ```bash aws iam create-user --user-name velero ``` If you'll be using Velero to backup multiple clusters with multiple S3 buckets, it may be desirable to create a unique username per cluster rather than the default `velero`. 2. Attach policies to give `velero` the necessary permissions: ``` cat > velero-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeVolumes", "ec2:DescribeSnapshots", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateSnapshot", "ec2:DeleteSnapshot" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:DeleteObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::${BUCKET}/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::${BUCKET}" ] } ] } EOF ``` ```bash aws iam put-user-policy \ --user-name velero \ --policy-name velero \ --policy-document file://velero-policy.json ``` 3. Create an access key for the user: ```bash aws iam create-access-key --user-name velero ``` The result should look like: ``` { "AccessKey": { "UserName": "velero", "Status": "Active", "CreateDate": "2017-07-31T22:24:41.576Z", "SecretAccessKey": <AWS_SECRET_ACCESS_KEY>, "AccessKeyId": <AWS_ACCESS_KEY_ID> } } ``` 4. Create a Velero-specific credentials file (`credentials-velero`) in your local directory: ```bash [default] aws_access_key_id=<AWS_ACCESS_KEY_ID> aws_secret_access_key=<AWS_SECRET_ACCESS_KEY> ``` where the access key id and secret are the values returned from the `create-access-key` request. ### Install Velero To install Velero, first install the Velero CLI. Instructions for how to do this can be found on the Velero Documentation site: https://velero.io/docs/v1.10/basic-install/#install-the-cli Once the Velero CLI is installed, you can use this to install Velero on the cluster. Please ensure that your kubeconfig is pointing to the cluster, as this is what it used by the Velero CLI to communicate with the cluster. To install Velero, use the following command, ensuring that the path to the credentials file you generated in the last step is provided correctly: ``` velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.6.0 \ --bucket $BUCKET \ --backup-location-config region=$REGION \ --snapshot-location-config region=$REGION \ --secret-file ./credentials-velero ``` ### Confirm Velero is installed and configured correctly To confirm that Velero is installed and configured correctly: 1. Open the Kots Admin Interface and navigate to the "Snapshots" tab 2. Click the "Check for Velero" button (see the screenshot below)  ### Configure the Backup Storage Destination to Point to S3 The final step in configuring Velero is to configure the Backup Storage Destination to point to the S3 Bucket where the backups will be stored. To do this, add a new "Backup Storage Destination" in the Admin Interface and fill in the details for the S3 Bucket and for the Secret and Access keys. ## Configuring - clickhouse-backup remote storage Before continuing, please ensure that your cluster is setup and configured as described in the "Pre-requisites" section above. Configuring clickhouse-backup to store backups in remote storage (eg: S3) can be done in the Kots Admin Interface. If you've performed the configuration steps as mentioned in the "Pre-requisites" section, you should see the "Enable Olap Database Backup Capabilites" option in the "Olap Database" section (see screenshot)  Ensure that: 1. The configuration that points to the bucket is correct * The Bucket Name * The Bucket Region 2. The ServiceAccount is the same ServiceAccount that you've configured with the IRSA Annotation (if you are not sure, enter the default value) 3. The IAM Role that you are using for the IRSA Annotation has the appropriate permissions to read/write/list from the S3 bucket * Note - you can use the same permissions that are described in the Velero Section, just be sure to update the bucket ARN that the permissions apply to 4. The S3 Path is where you want to be storing backups Once you've configured clickhouse-backup, you can validate that the configuration is correct by trying to take a backup. Refer to the "Take a backup of ClickHouse Data" section below on how to do this. ## Backing up Note that you can use the following script to run all of these steps together. Read the `Appendix: Running the Velero CLI` for instructions on how to run velero. ```bash #!/bin/bash set -euo pipefail IFS=$'\n\t' # You need to configure this by getting the name of the storage location # using the velero CLI # eg: `velero backup-location get` storage_location="Put your storage location here" backup_date=$(DATE +%Y-%m-%d-%H-%M-%S); name=arthur-backup-$backup_date echo "Creating a new backup with name $name" echo "Taking a backup of CH data" kubectl create job $name-clickhouse-backup \ --from=cronjob/clickhouse-backup-cronjob ch_backup_jobname=$(kubectl get jobs -o name | grep "$name-clickhouse-backup") kubectl wait $ch_backup_jobname \ --for=condition=complete \ --timeout=30m echo "Taking a backup of the enrichments infrastructure" velero backup create $name-enrichments \ --namespace=arthurai \ --include-namespaces=arthurai \ --selector='component in (kafka-mover-init-connector, model_server)' \ --include-resources=deployments,services \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \ --storage-location=$storage_location \ --wait echo "Taking a backup of workflows" velero backup create $name-workflows \ --namespace=arthurai \ --include-namespaces=arthurai \ --include-resources=workflows \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \ --storage-location=$storage_location \ --wait echo "Taking a backup of Kafka/Kafka-ZK StatefulSets, their EBS Volumes, and related components" velero backup create $name-stateful-sets \ --namespace=arthurai \ --include-namespaces=arthurai \ --selector='app in (cp-zookeeper,cp-kafka)' \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io \ --storage-location=$storage_location \ --wait echo "Taking a backup of the RDS database" aws rds create-db-cluster-snapshot \ --db-cluster-snapshot-identifier $name-snapshot \ --db-cluster-identifier RDS_DB_NAME \ --profile AWS_PROFILE_NAME \ --region AWS_REGION ``` ### Take a backup of ClickHouse Data By default, the Arthur Platform ships with a Kubernetes CronJob which takes a ClickHouse Backup each day at midnight. Please see the warning `PLEASE READ - ENSURE DATA CONSISTENCY` at the top if attempting to take a manual backup. To take a manual backup of ClickHouse Data, you can run the following commands: ```bash $ kubectl get cronjobs -n arthurai NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE arthurai-cron-workflow-cron-job 1 0 * * * False 0 14h 2d18h arthurai-model-health-cron-job 5 * * * * False 0 20m 2d18h clickhouse-backup-cronjob 0 0 * * * False 0 14h 2d18h $ kubectl create job clickhouse-backup --from=cronjob/clickhouse-backup-cronjob -n arthurai job.batch/clickhouse-backup created $ kubectl get jobs -n arthurai NAME COMPLETIONS DURATION AGE clickhouse-backup-cronjob-27735840 1/1 8m35s 14m ``` ### Take a backup of Enrichments infrastructure and Enrichment / Delete Enrichment Workflows The Arthur Platform uses Velero to take a backup of the Enrichments Infrastructure, as well as the Enrichments workflows. The backup is orchestrated manually and requires running a command. The Enrichments infrastructure and Enrichment Workflows are orchestrated as separate backups and will require running 2 separate commands. To take a manual backup of the Enrichments infrastructure, run the following commands: ```bash $ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S); $ name=arthur-backup-$backup_date $ velero backup create $name-enrichments \ --namespace=arthurai \ --include-namespaces=arthurai \ --selector='component in (kafka-mover-init-connector, model_server)' \ --include-resources=deployments,services \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \ --storage-location=$storage_location \ --wait ``` To take a manual backup of the Enrichments Workflows, run the following commands: ```bash $ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S); $ name=arthur-backup-$backup_date velero backup create $name-workflows \ --namespace=arthurai \ --include-namespaces=arthurai \ --include-resources=workflows \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io,secrets,configmaps \ --storage-location=$storage_location \ --wait ``` ### Take a backup of Kafka Deployment State and EBS Volumes (using EBS Snapshots) The Arthur Platform uses Velero to take a backup of the Kafka Deployment State and EBS Volumes. The backup is orchestrated manually and requires running a command. To take a manual backup of Kafka, run the following commands: ```bash $ backup_date=$(DATE +%Y-%m-%d-%H-%M-%S); $ name=arthur-backup-$backup_date $ velero backup create $name-zookeeper \ --namespace=arthurai \ --include-namespaces=arthurai \ --selector='app in (cp-zookeeper,cp-kafka)' \ --exclude-resources=clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io,controllerrevisions.apps,endpointslices.discovery.k8s.io,customresourcedefinitions.apiextensions.k8s.io \ --storage-location=$storage_location \ --wait ``` If you are unable to run the Velero CLI locally, you can execute it from the Velero container. See the section **Executing on the Velero Backup Controller Pod** for more details. ### Take a backup of RDS Postgres The script above provides the command line instructions for taking a backup of a RDS Database (also copied here). Please ensure that the values for the `db-cluster-identifier`, `profile` and `region` are filled in correctly. Note that the command is only compatible for a multi-region RDS Database. If you are using a single-region RDS Database, the command to use is `aws rds create-db-snapshot`. ```bash aws rds create-db-cluster-snapshot \ --db-cluster-snapshot-identifier $name-snapshot \ --db-cluster-identifier RDS_DB_NAME \ --profile AWS_PROFILE_NAME \ --region AWS_REGION ``` For more information, please refer to the AWS Documentation: * Multi-region RDS instance: * https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateMultiAZDBClusterSnapshot.html * https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-cluster-snapshot.html * Single-region RDS instance: * https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html * https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html ## Restoring ### Restore the RDS Postgres To restore RDS Postgres, use the following resource: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromMultiAZDBClusterSnapshot.Restoring.html Please ensure that you correctly match the following configuration from the DB from which the snapshot was taken * The connection port * The VPC and Security Group IDs * DB Subnet Group * DB Instance Type * Any other configuration which might be overridden This operation might take a while and the DB must show as Available before installing the platform. ### Update configuration and install the platform **NOTE - Before completing this action, please check the following:** 1. The Restored RDS DB Instance is showing as "Available" 2. All pods which were restored are showing as "Running" **NOTE - IMPORTANT -- Update configuration to point to the new DB Instance** It is **CRITICALLY IMPORTANT** to update your configuration to point to the newly restored DB Instance. Failure to complete this step **WILL CAUSE DATA CORRUPTION**. Please update the configuration in the "Meta Database" section of the Admin Interface to point to the newly restored DB instance. If your new cluster is routed via a different ingress URL, please also update this in the "Network" section of the configuration. Wait for the platform to come back online before proceeding to the next steps. All Deployments and StatefulSets should be completely stood up (eg: all Pods should be ready and "Running") and all Jobs should be "Completed". ### Restore ClickHouse Data The Arthur Platform ships with a Kubernetes CronJob that executes a ClickHouse restore that is scheduled to never run (eg: it's scheduled to run on February 31st, which is an invalid date). To run a ClickHouse data restore, execute the following commands: First, we need to get the name of the clickhouse-backup that coincides with the kafka/enrichments/workflow backups that you are restoring: ```bash $ kubectl get pods | grep olap chi-olap-installation-arthur-0-0-0 2/2 Running 0 6d23h chi-olap-installation-arthur-0-1-0 2/2 Running 0 6d23h chi-olap-installation-arthur-0-2-0 2/2 Running 0 6d23h olap-installation-zookeeper-0 3/3 Running 0 6d23h olap-installation-zookeeper-1 3/3 Running 0 6d22h olap-installation-zookeeper-2 3/3 Running 0 6d23h olap-operator-8c867ddff-kc22x 2/2 Running 0 6d23h $ kubectl exec chi-olap-installation-arthur-0-0-0 -c backup -- clickhouse-backup list 2022/10/04 16:03:20.255378 info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' 2022/10/04 16:03:20.257582 info SELECT * FROM system.disks; 2022/10/04 16:03:20.267149 info SELECT count() AS is_macros_exists FROM system.tables WHERE database='system' AND name='macros' 2022/10/04 16:03:20.271359 info SELECT * FROM system.macros chi-olap-installation-arthur-0-0-arthur-clickhouse-backup-2022-09-26-19-50-43 3.64MiB 02/10/2022 00:00:16 remote tar, regular chi-olap-installation-arthur-0-1-arthur-clickhouse-backup-2022-09-26-19-50-43 3.64MiB 02/10/2022 00:00:16 remote tar, regular chi-olap-installation-arthur-0-2-arthur-clickhouse-backup-2022-09-26-19-50-43 3.64MiB 02/10/2022 00:00:16 remote tar, regular ``` Note that from the above backup, the backup names are in the following format: `$CLICKHOUSE_NODE_NAME-$ARTHUR_BACKUP_NAME` So for example, `chi-olap-installation-arthur-0-0-arthur-clickhouse-backup-2022-09-26-19-50-43` can be parsed into: * clickhouse node name: `chi-olap-installation-arthur-0-0` * arthur backup name: `arthur-clickhouse-backup-2022-09-26-19-50-43` Now we need to create the restoration job, and configure it to use the Arthur Backup Name from above: ```bash $ kubectl create job --from=cronjob/clickhouse-restore-cronjob -o yaml clickhouse-restore --dry-run=client --save-config > clickhouse-restore.yaml # update the value of the `BACKUP_NAME` environment variable in the `clickhouse-restore.yaml` file # eg: # - name: BACKUP_NAME # value: "arthur-clickhouse-backup-2022-09-26-19-50-43" $ kubectl apply -f clickhouse-restore.yaml job.batch/clickhouse-restore created ``` ### Restore the Kafka Deployment State and Persistent Volumes The Arthur Platform restores Kafka Deployment State and PVs using Velero. To execute a restore, run the following commands: ```bash $ velero backup get -n arthurai | grep kafka NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR arthur-backup-2022-09-23t11.23.25-04.00-kafka Completed 0 0 2022-09-23 11:24:37 -0400 EDT 27d default app in (cp-kafka,cp-zookeeper) $ velero restore create \ --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-kafka" \ --namespace arthurai \ --restore-volumes=true \ --existing-resource-policy=update ``` Velero will update the Pod Specs and point to the PVs using the EBS Volume Snapshots and restore the kubernetes resources associated with Kafka. ### Restore Enrichments infrastructure The Arthur Platform uses Velero to restore the Enrichments infrastructure. To restore, run the following commands: ```bash $ velero backup get -n arthurai | grep enrichments NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR arthur-backup-2022-09-23t11.23.25-04.00-enrichments Completed 0 0 2022-09-23 11:24:33 -0400 EDT 27d default component in (kafka-mover-init-connector,model_server) $ velero restore create \ --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-enrichments" \ --namespace arthurai ``` ### Restore Workflows Restoring workflows is a 2-step process: 1. Restore the workflows from the Velero backup 2. Restore Batch Workflows which are recoverable using an Arthur Admin Endpoint To restore workflows using the Velero backup, run the following commands: ```bash $ velero backup get -n arthurai | grep workflows NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR arthur-backup-2022-09-23t11.23.25-04.00-workflows Completed 0 0 2022-09-23 11:24:35 -0400 EDT 27d default <none> $ velero restore create \ --from-backup "arthur-backup-2022-09-23t11.23.25-04.00-workflows" \ --namespace arthurai ``` To restore Batch Workflows, run the following comamnds. In one terminal window, port-forward to the dataset-service: ```bash $ kubectl get pods | grep dataset-service arthurai-dataset-service-69dd979d8c-72t7j 1/1 Running 0 2d17h $ kubectl port-forward -n arthurai arthurai-dataset-service-69dd979d8c-72t7j 7899 ``` Then, in another terminal window, run the following commands: ```bash $ curl -k -XPOST https://localhost:7899/api/v1/workflows/batch/recover {"message":"success"} ``` ### Smoke Tests and Validation The restore should now be complete. All data should be restored and in a consistent state as it was from when the backup was taken. Any data sent during or after the backup will need to be re-sent. Perform any validation/smoke-tests to ensure that the platform is operating. ## Appendix ### Running the Velero CLI Velero provides a Command-Line Interface (CLI) for taking backups and performing restores. The command line interface can be installed locally, or it can be invoked by `kubectl exec` on the Velero Backup Controller pod. #### Local Installation Read through the Velero Documentation for how to install velero on your platform: https://velero.io/docs/v1.9/basic-install/#install-the-cli Velero uses your KUBECONFIG file to connect to the cluster. ``` $ velero --help Velero is a tool for managing disaster recovery, specifically for Kubernetes cluster resources. It provides a simple, configurable, and operationally robust way to back up your application state and associated data. If you're familiar with kubectl, Velero supports a similar model, allowing you to execute commands such as 'velero get backup' and 'velero create schedule'. The same operations can also be performed as 'velero backup get' and 'velero schedule create'. Usage: velero [command] Available Commands: backup Work with backups backup-location Work with backup storage locations bug Report a Velero bug client Velero client related commands completion Generate completion script create Create velero resources debug Generate debug bundle delete Delete velero resources describe Describe velero resources get Get velero resources help Help about any command install Install Velero plugin Work with plugins restic Work with restic restore Work with restores schedule Work with schedules snapshot-location Work with snapshot locations uninstall Uninstall Velero version Print the velero version and associated image Flags: --add_dir_header If true, adds the file directory to the header --alsologtostderr log to standard error as well as files --colorized optionalBool Show colored output in TTY. Overrides 'colorized' value from $HOME/.config/velero/config.json if present. Enabled by default --features stringArray Comma-separated list of features to enable for this Velero process. Combines with values from $HOME/.config/velero/config.json if present -h, --help help for velero --kubeconfig string Path to the kubeconfig file to use to talk to the Kubernetes apiserver. If unset, try the environment variable KUBECONFIG, as well as in-cluster configuration --kubecontext string The context to use to talk to the Kubernetes apiserver. If unset defaults to whatever your current-context is (kubectl config current-context) --log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log_dir string If non-empty, write log files in this directory --log_file string If non-empty, use this log file --log_file_max_size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --logtostderr log to standard error instead of files (default true) -n, --namespace string The namespace in which Velero should operate (default "velero") --skip_headers If true, avoid header prefixes in the log messages --skip_log_headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging Use "velero [command] --help" for more information about a command. ``` #### Executing on the Velero Backup Controller Pod ``` $ kubectl exec velero-699dc869d4-r24bh -c velero -- /velero Velero is a tool for managing disaster recovery, specifically for Kubernetes cluster resources. It provides a simple, configurable, and operationally robust way to back up your application state and associated data. If you're familiar with kubectl, Velero supports a similar model, allowing you to execute commands such as 'velero get backup' and 'velero create schedule'. The same operations can also be performed as 'velero backup get' and 'velero schedule create'. Usage: velero [command] Available Commands: backup Work with backups backup-location Work with backup storage locations bug Report a Velero bug client Velero client related commands completion Generate completion script create Create velero resources debug Generate debug bundle delete Delete velero resources describe Describe velero resources get Get velero resources help Help about any command install Install Velero plugin Work with plugins restic Work with restic restore Work with restores schedule Work with schedules snapshot-location Work with snapshot locations uninstall Uninstall Velero version Print the velero version and associated image Flags: --add_dir_header If true, adds the file directory to the header --alsologtostderr log to standard error as well as files --colorized optionalBool Show colored output in TTY. Overrides 'colorized' value from $HOME/.config/velero/config.json if present. Enabled by default --features stringArray Comma-separated list of features to enable for this Velero process. Combines with values from $HOME/.config/velero/config.json if present -h, --help help for velero --kubeconfig string Path to the kubeconfig file to use to talk to the Kubernetes apiserver. If unset, try the environment variable KUBECONFIG, as well as in-cluster configuration --kubecontext string The context to use to talk to the Kubernetes apiserver. If unset defaults to whatever your current-context is (kubectl config current-context) --log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log_dir string If non-empty, write log files in this directory --log_file string If non-empty, use this log file --log_file_max_size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --logtostderr log to standard error instead of files (default true) -n, --namespace string The namespace in which Velero should operate (default "arthurai") --skip_headers If true, avoid header prefixes in the log messages --skip_log_headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging Use "velero [command] --help" for more information about a command. ``` ### Working with Velero - Backup To take a backup of Arthur, you would invoke the CLI as follows. ```bash $ velero backup create arthur-backup-$(date -Iseconds | tr ":" "." | tr "T" "t") \ --namespace=arthurai \ --include-namespaces=arthurai ``` The command above will create a Velero Backup Resource named `arthur-backup-<timestamp>`, which you can check using the Velero CLI: ```bash $ velero backup get -n arthurai NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR arthur-backup-2022-08-25t09.59.35-04.00 Completed 0 0 2022-08-25 09:59:35 -0400 EDT 28d default <none> arthur-backup-2022-08-26t10.37.52-04.00 InProgress 0 0 2022-08-26 10:37:52 -0400 EDT 29d default <none> ``` For debugging a backup, you can access the backup's logs using the Velero CLI: ```bash $ velero backup logs arthur-backup-2022-08-25t09.59.35-04.00 -n arthurai | head time="2022-08-25T13:59:35Z" level=info msg="Setting up backup temp file" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:587" time="2022-08-25T13:59:35Z" level=info msg="Setting up plugin manager" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:594" time="2022-08-25T13:59:35Z" level=info msg="Getting backup item actions" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:598" time="2022-08-25T13:59:35Z" level=info msg="Setting up backup store to check for backup existence" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/controller/backup_controller.go:608" time="2022-08-25T13:59:36Z" level=info msg="Writing backup version file" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:192" time="2022-08-25T13:59:36Z" level=info msg="Including namespaces: arthurai" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:198" time="2022-08-25T13:59:36Z" level=info msg="Excluding namespaces: <none>" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:199" time="2022-08-25T13:59:36Z" level=info msg="Including resources: *" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:202" time="2022-08-25T13:59:36Z" level=info msg="Excluding resources: <none>" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:203" time="2022-08-25T13:59:36Z" level=info msg="Backing up all pod volumes using Restic: false" backup=arthurai/arthur-backup-2022-08-25t09.59.35-04.00 logSource="pkg/backup/backup.go:204" <truncated> ``` Finally, you can get an overview of the backup using the Velero CLI: ```bash $ velero backup describe arthur-backup-2022-08-25t09.59.35-04.00 -n arthurai Name: arthur-backup-2022-08-25t09.59.35-04.00 Namespace: arthurai Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.21.14-eks-18ef993 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=21+ Phase: Completed Errors: 0 Warnings: 0 Namespaces: Included: arthurai Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: <none> Storage Location: default Velero-Native Snapshot PVs: auto TTL: 720h0m0s Hooks: <none> Backup Format Version: 1.1.0 szExpiration: 2022-09-24 09:59:35 -0400 EDT Total items to be backed up: 654 Items backed up: 654 Velero-Native Snapshots: 3 of 3 snapshots completed successfully (specify --details for more information) Restic Backups (specify --details for more information): Completed: 23 ``` ### Working with Velero - Restore Similar to backup, restore happens using the Velero Cli. A restore takes a Backup object and then executes the restore procedure. You can execute a restore with the following Velero CLI command: ```bash $ velero restore create \ --from-backup arthur-backup-2022-08-25t09.59.35-04.00 \ --namespace arthurai \ --restore-volumes=true ``` Just like with the Backup, Velero will create a Restore Velero Resource, which you can inspect with the Velero CLI: ```bash $ velero restore get -n arthurai NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR arthur-backup-2022-08-26t10.37.52-04.00-20220826110520 arthur-backup-2022-08-26t10.37.52-04.00 Completed 2022-08-26 11:06:16 -0400 EDT 2022-08-26 11:08:00 -0400 EDT 0 35 2022-08-26 11:05:21 -0400 EDT <none> ``` |