High availability (HA) and disaster recovery (DR)
Vault running on the HashiCorp Cloud Platform (HCP) is fully managed by HashiCorp and provides push-button deployment, fully managed clusters and upgrades, backups, and monitoring. HCP Vault is designed to avoid downtime whenever possible by using cloud architecture best practices to deliver a highly available environment. HCP Vault’s critical operational infrastructure is hosted on AWS across multiple availability zones (AZ), with data resiliency and on-call support to minimize downtimes and support disaster recovery.
This document provides an overview of key built-in capabilities that support recovery efforts in the event of downtimes or a disaster:
- 3-Node Highly Available Clusters
- Data Resiliency
- Encryption Key Ownership
- Backup and Recovery Features
- Support Coverage
- Incident Response
- Disaster Recovery Use Cases Not in Scope
Users that choose HCP Vault entrust HashiCorp to manage Disaster Recovery (DR) and High Availability (HA) of the Vault servers. As part of this managed offering, HashiCorp will use commercially reasonable efforts to maximize the availability of HashiCorp Cloud services, and provide uptime guarantees based on service level agreements (SLA). For details on HCP Vault support packages, visit the Enterprise Support website.
3-node HA clusters
All enterprise production-grade HCP Vault clusters (i.e. Starter, Standard, and Plus) consist of 3 High Availability (HA) nodes spanning different AZs within one region. HCP Vault has a number of orchestration mechanisms in place, including integrated monitoring with leading observability platforms to ensure nodes are health-checked regularly. Unhealthy nodes are quickly identified, triaged, and replaced or remediated as needed. This fault-tolerant architecture allows us to withstand failures of individual nodes in the case that there is an isolated hardware failure or issue within one of the cloud provider data centers. Regional outages are discussed in more detail below.
Data resiliency
All HCP Vault nodes have attached encrypted volumes. Automated snapshots are taken daily for production-grade clusters and stored in an encrypted blob storage in the control plane. Users can initiate more frequent snapshots with push-button deployment from the UI. Snapshots currently reside within the US only.
Encryption key ownership
A unique Key Management Service (KMS) cryptographic key is used for automatic unsealing of Vault and encrypting all user snapshots. This key is managed in the user’s dedicated HCP account using the cloud provider’s KMS and is configured to be trusted by the HCP Vault compute instances. This key is managed using carefully crafted, secure policies and all usage is audited. The key is not shared between clusters.
Backup and recovery features
HCP Vault includes several built-in resiliency features in response to outages. This section provides an overview of typical outage scenarios and best practices for users to consider in order to minimize the impact of an outage.
Platform outages
An HCP platform outage does not impact running clusters. The API and UI may be affected, but clusters will remain intact and continue to operate to support established machine-to-machine Vault use cases such as dynamic secrets generation. During a platform outage, snapshots, seal/unseal, and the ability to generate admin tokens will be unavailable. In addition, replication between existing clusters will work, but setting up a new secondary (or any new cluster) will not be available. Production clusters have automated snapshots daily and users can opt to create snapshots more regularly should they choose. Snapshots are stored up to 30 days after they are created. This is not user configurable.
Best Practice for Admin Tokens: Users can mitigate the risk of not being able to generate admin tokens during a platform outage by setting up appropriate authentication- cluster admin token. See our documentation on Authentication methods for more information.
Region outages
To understand the impacts to HCP Vault for a regional cloud provider outage, it’s important to note how HashiCorp Virtual Network (HVN) makes HCP networking possible. An HVN allows you to delegate an IPv4 CIDR range to HCP, which the platform then uses to automatically create a private network on the cloud provider. Clusters are all placed within the same region the HVN is created in. While HCP Vault is already a redundant cluster of three nodes split across availability zones, it does not support geographic (multi-region) redundant clusters in Active-Active or Active-Standby mode. As noted above, nodes are dispersed across multiple AZs to accommodate for AZ failures.
Cluster deletion and snapshot restore
Once a cluster is deleted, all affiliated resources (including audit logs) are deleted with it, except snapshots which are retained for 30 days after deletion of a cluster. Snapshots are only retained for Starter, Standard, and Plus tiers. Snapshots can be used to recover a deleted cluster, including restoring to a different region. To request restore from a snapshot, please file a support ticket here.
Note
Currently, once a cluster is deleted in Azure, all affiliated resources are deleted with it, including audit logs and snapshots. It is currently not possible to restore a deleted cluster which was hosted in Azure. This functionality is planned.
Sentinel
On plus-tier clusters, Sentinel policies are able to impact admin token generation. In the event that the admin token endpoint and sentinel deletion endpoints are blocked by Sentinel, please file a support ticket here to delete the offending policy.
Support coverage
Clusters are monitored 24/7 with on-call staff available to debug production cluster issues. All production-grade clusters are coupled with either Silver or Gold level support.
Incident response
In the event of a critical incidence, incident response times are stipulated in the Support agreement of the SLA. HashiCorp will use commercially reasonable efforts to maximize the availability of HashiCorp Cloud services, and provide uptime guarantees based on service level agreements (SLA). Audit logs include key metrics that capture activity and performance. You can view a full list of metrics here. Platform events are reported in audit logs within a minute of the event, and can be viewed for up to 365 days. If a cluster is deleted, audit logs are deleted.
Disaster recovery use cases not in scope
While HCP Vault covers a large amount of DR functionality through HA, we currently don't support cross-region failover or cross-region disaster replication. Production-grade clusters are isolated to three nodes within the same region. If a cluster goes down in one region, it cannot be restored to another region. DR secondaries and DR replication are not available at this time.
Conclusion
Organizations that select HCP Vault entrust our expert teams to manage areas of technical operations, including disaster recovery, monitoring, and upgrades. Based on our experience supporting thousands of commercial Vault clusters, HCP Vault brings this expertise directly to users and reduces the manual overhead required to successfully Vault. We will continue to invest in HCP Vault to cover more DR scenarios. HCP uptimes and incidents can be viewed at https://status.hashicorp.com/.