How to: Cross-Region Vault Replication for AWS Security and Compliance
We have been working with a client who operates in the healthcare industry that raises an interesting use case we think you’ll all be interested in: Cross-Region Vault Replication. As you would imagine, this company is subject to HIPAA requirements as well as their own internal security standards to ensure that the healthcare records they keep are safe and secure. In building an AWS microservices architecture for this client (additional background reading on Microservices here), one of the challenges we faced early on was secret management. We chose to use Hashicorp Vault to manage their secrets as it provides an interface to static secrets in encrypted form and dynamic secrets with tight security controls — both of which this firm needed to efficiently manage.
In addition, this company had very specific RPO goals of 5 minutes or less in order to assure customer satisfaction and maintain standards for the highest levels of availability. Their RTO goal was under 1 minute. Said another way, even if a catastrophic event such as an entire AWS region failure occurred, the Vault secrets had to be readily available within one minute and they could not afford to lose more than 5 mins worth of data. This is a very stringent requirement.
To achieve this, we put in three layers of safeguard. First, Vault was setup in a multi-node, high availability fashion with multiple nodes distributed across two AWS Availability Zones. This provided us with immunity against a single node failure or a single AZ failure. Second, these Vault nodes were behind an AWS ELB such that if a single node were to go down, the ELB could forward traffic to one of the other nodes with no disruption. This allowed us to shorten the RTO because while a node may be recovering from a failure, the service could stay up with no disruptions. Third, each Vault node was a part of an AWS autoscaling group to ensure its automated recovery if it went down. Thus, no intervention is necessary if one or even multiple Vault nodes failed. Fourth, and the most unique, is that we set up cross-region replication such that if an entire AWS region failed, the replica Vault cluster in the secondary region is readily available to take requests. We achieved the automated failover by automatically changing the DNS to a secondary cluster. We will focus this blog on this last point, the cross-region component.
The challenge as readers may well know is that Vault does not support cross cluster replication natively. (HT to Nelo-Thara Wallus for a good description of this problem.) However, Consul, its most popular storage backend, does support cross-cluster replication. At the highest level we decided to build the following:
- Consul clusters in two regions
- Vault clusters in both regions that used their nearest Consul clusters as their backends
- Setup HashiCorop’s consul-replicate to replicate the Consul keys that were stored by Vault in Consul
- Configure consul-replicate to copy the Key/Values created by Vault
While this solution seems simple at the face of it, it actually does not work. The reason is that Vault saves the information about its cluster members and who the master Vault node is in Consul itself. If we made a blind copy of all Consul keys to the secondary Consul cluster, the Vault cluster in the secondary region will get confused as it will not find the members specified in the Consul storage.
To answer this, we had to make a selective copy of the Key/Values–basically only copying the secrets and their meta-data but not copying any Key/Values that saved the cluster state itself. This allowed the secondary Vault cluster to function as if it is independent (the only mode Vault supports at the time of this writing). This part of the solution was inspired by the work by Justin Ellison.
However, when we implemented the proposed solution, we discovered that it did not work. After a deep investigation, our engineers discovered that the issue was how consul-replicate had changed how it decided which Keys to actually replicate. Once we had this insight, we traced back to the version of consul-replicate where this change was introduced, and then deployed the specific commit of consul-template that did not have the issue.
We ended with the following Dockerfile with consul-replicate:
Once in place, the solution lit up. We just needed to unseal the Vault and start using it in the primary region. We simulated a failure of the primary region by terminating the cluster and were able to unseal the secondary cluster and access the keys written even seconds before the primary cluster was destroyed. The time it took to recover was nearly zero seconds and so was the recovery point.
In conclusion, the pairing of Consul with Vault helps this client achieve what at first might appear to be dueling goals: maintaining an aggressive RPO by deploying and replicating its system cross-region while ensuring system security with Vault secret management. However, the combination of Consul’s highly scalable, fault tolerant architecture with Vault’s advanced secret management features means that this healthcare company gets the best of both worlds. By using Consul as a backend to Vault, this organization gets Consul’s distributed storage of data at rest, and coordination with Vault’s auditing, and management of dynamic and static secrets.
If you are interested in learning more about this project, you can find it discussed here by Mitchell Hashimoto, Founder of HashiCorp in the HashiConf 2016 Keynote.
This client is just one of many for whom we have tailor-designed AWS security solutions to meet their specific technology and business goals. You can find additional background here on how you can balance security with aggressive availability and agility goals with Vault.
* Note that this is not a supported HashiCorp deployment as it breaks in certain circumstances.
Did you find the insights in this article helpful? Please sign up below to get regular news and analysis like this to your inbox.