How to: Cross-Region Vault Replication for AWS Security and Compliance

By Flux7 Labs
January 19, 2017

AWS Security

We have been working with a client who operates in the healthcare industry that raises an interesting use case we think you’ll all be interested in: Cross-Region Vault Replication. As you would imagine, this company is subject to HIPAA requirements as well as their own internal security standards to ensure that the healthcare records they keep are safe and secure. In building an AWS microservices architecture for this client (additional background reading on Microservices here), one of the challenges we faced early on was secret management. We chose to use Hashicorp Vault to manage their secrets as it provides an interface to static secrets in encrypted form and dynamic secrets with tight security controls — both of which this firm needed to efficiently manage.

In addition, this company had very specific RPO goals of 5 minutes or less in order to assure customer satisfaction and maintain standards for the highest levels of availability. Their RTO goal was under 1 minute. Said another way, even if a catastrophic event such as an entire AWS region failure occurred, the Vault secrets had to be readily available within one minute and they could not afford to lose more than 5 mins worth of data. This is a very stringent requirement.

To achieve this, we put in three layers of safeguard. First, Vault was setup in a multi-node, high availability fashion with multiple nodes distributed across two AWS Availability Zones. This provided us with immunity against a single node failure or a single AZ failure. Second, these Vault nodes were behind an AWS ELB such that if a single node were to go down, the ELB could forward traffic to one of the other nodes with no disruption. This allowed us to shorten the RTO because while a node may be recovering from a failure, the service could stay up with no disruptions. Third, each Vault node was a part of an AWS autoscaling group to ensure its automated recovery if it went down. Thus, no intervention is necessary if one or even multiple Vault nodes failed. Fourth, and the most unique, is that we set up cross-region replication such that if an entire AWS region failed, the replica Vault cluster in the secondary region is readily available to take requests. We achieved the automated failover by automatically changing the DNS to a secondary cluster. We will focus this blog on this last point, the cross-region component. 

The challenge as readers may well know is that Vault does not support cross cluster replication natively. (HT to Nelo-Thara Wallus for a good description of this problem.) However, Consul, its most popular storage backend, does support cross-cluster replication. At the highest level we decided to build the following:

  • Consul clusters in two regions
  • Vault clusters in both regions that used their nearest Consul clusters as their backends
  • Setup HashiCorop’s consul-replicate to replicate the Consul keys that were stored by Vault in Consul
  • Configure consul-replicate to copy the Key/Values created by Vault

While this solution seems simple at the face of it, it actually does not work. The reason is that Vault saves the information about its cluster members and who the master Vault node is in Consul itself. If we made a blind copy of all Consul keys to the secondary Consul cluster, the Vault cluster in the secondary region will get confused as it will not find the members specified in the Consul storage.

To answer this, we had to make a selective copy of the Key/Values–basically only copying the secrets and their meta-data but not copying any Key/Values that saved the cluster state itself. This allowed the secondary Vault cluster to function as if it is independent (the only mode Vault supports at the time of this writing). This part of the solution was inspired by the work by Justin Ellison.

However, when we implemented the proposed solution, we discovered that it did not work. After a deep investigation, our engineers discovered that the issue was how consul-replicate had changed how it decided which Keys to actually replicate. Once we had this insight, we traced back to the version of consul-replicate where this change was introduced, and then deployed the specific commit of consul-template that did not have the issue.

We ended with the following Dockerfile with consul-replicate: