Re:Invent Review: Deploying Scalable SAP Hybris Clusters using Docker
At this year’s re:Invent, Flux7’s CEO, Aater Suleman, had the great pleasure of presenting with Hemanth Jayaraman, Rent-A-Center’s director of DevOps. (You can watch the full presentation here.) We shared with the audience the story of how we worked with Rent-A-Center to help them address their challenge to architect, deploy, and manage a mission-critical SAP Hybris ecommerce platform that could scale to 6+ million users a month.
Together with Rent-A-Center, the DevOps consulting specialists at Flux7 created an AWS DevOps-based approach that helped deliver the solution to market faster, in a secure, highly available, PCI-compliant fashion. As proof of the solution’s flexibility, Rent-A-Center shared that the ecommerce system saw a 42% increase, with more than nine million hits, over Black Friday without missing a beat.
As readers likely know, there are two distinct schools of thought on the best approach to moving an organization to DevOps – one that touts the value of having DevOps built in across IT on day one and another advocating for a DevOps Center of Excellence that then spreads DevOps across the organization. Rent-A-Center felt that the latter was a good fit for their organization given their level of maturity and existing agile practices. With growing maturity based on prior AWS engagements, Rent-A-Center quickly felt confident migrating their ecommerce platform to the AWS cloud.
This strong foundation resulted in the next need for the business. According to Mr. Jayaraman, “The business wanted a complete, 360 view of the customer and also wanted to be able to enable some self-service capabilities and for customers to be able to rent online for the first time.” Rent-A-Center selected SAP Hybris as a platform of choice to achieve these goals. As SAP Hybris is a stateful application and you can’t auto scale a stateful solution, the team needed to make it stateless in order to take the full power of AWS autoscaling features.
As you can see here, the project goals spanned business and technology, with the site first and foremost needing to support many million users per month, with the critical ability to scale to support Black Friday. PCI compliance and the security requirements that go with it was also a top priority as were specific development goals that would help increase the team’s maturity and agility in its ongoing effort to foster constant improvement.
The architecture was both unique to Rent-A-Center’s stated goals and incorporated many architecture and AWS security best practices. A few highlights that we found particularly interesting:
- Every single component is multiple Amazon Availability Zones, which means that even if an entire data center were to fail, the system still will not require human intervention.
- Backing up this containerized environment is an Aurora database, which was a natural fit given its high availability, scalability and security features.
- With an architecture that needed to be PCI audit ready, Amazon WAF is in place, which is used in conjunction with Lamda to build rules on the fly that will automatically begin filtering. Additionally, there are two levels of encryption for credentials and role separation. First, credentials are saved in an S3 bucket with server-side encryption enabled. Second, even if someone had access to the S3 bucket, the team used AWS key management service to encrypt everything in S3, providing two layers of protection.
- AWS Certificate Manager was relied on heavily. Rent-A-Center and Flux7 used so many AWS services, that they didn’t even have an outside SSL vendor; even the SSL certs were being generated by AWS Cert Manager.
- As an ecommerce platform, auto scaling is important due to changes in traffic patterns. Yet, with containers it’s not as easy as simply scaling EC2 instances. In the case of Rent-A-Center, there are two layers. The bottom layer features the number of EC2 instances on which the container cluster is running. The top layer is the number of containers running in a given service. The two layers needed to work in tight conjunction with one another to ensure effective auto scaling was achieved. While we’ve written in depth on the AWS autoscaling solution, (including the script used for this) we refer you to the “Amazon ECS Service Auto Scaling Enables Rent-A-Center SAP Hybris Solution” blog from Troy Washburn, Sr. DevOps Manager at Rent-A-Center, and Ashay Chitnis, Flux7 architect, for additional details.
Hybris Node Discovery
Hybris is a state-ful application, built with a “pet” (vs. herd) mentality, where every node has a fixed IP address. Moreover, Hybris nodes need to be aware of each other. Given the nature of AWS and autoscaling, the team had to find a new way to manage node discovery. The solution was simultaneously creative and simple. The team loaded a list of IP addresses of the other nodes into a database on the fly and they could be loaded from there. Now, with all the IP addresses of the different nodes available in the database — even if they are changing — the nodes are able to get what they need and continue operating from there.
While this solved the first problem, a second issue arose which was Hybris needing to know the host IP, not the IP address of the container it is running in. However, by definition a container is not allowed to know the IP address of its host. The solution included a startup script for every container. When it started, the script found out through querying of the metadata on the EC2 instance (which is luckily available inside the container) to find out the IP address of the host, which we made available to the Hybris application as part of a config file.
With the solution, Rent-A-Center has made large strides toward having a 360-degree view of its customers. Now whether a customer is online or in the store, Rent-A-Center is able to identify them and has a single, unified view. In addition, the team incorporated many best practices for PCI and is now completely PCI audit ready. For additional background on Rent-A-Center’s use of AWS Security by Design principles, you’ll enjoy Hemanth Jayaraman’s AWS Summit presentation, “Compliance in the Cloud Using Security by Design”.
Last, many IT outcomes were achieved for furthering innovation. To help achieve team and business goals, the project helped Rent-A-Center establish Infrastructure as Code, become more agile with flexible infrastructure, and establish automated delivery of infrastructure, code, containers and security rules. This growth in innovation will allow the development team be more productive, effective and efficient so they can deliver ongoing value to the business.
Q: As Hybris is a monolithic application, and as you are using auto scaling and bringing them down, how are you doing that?
A: We can do session replication. We can put the session replication in an elastic Redshift cluster. And that’s how the session state gets maintained. If you do session replication we can still roll through code. And since the session is replaced, even if one of the containers goes down, the user does not lose their session. So, it is all about maintaining session replication outside in a separate Redshift elastic cluster.
Q: What did you use for the shared media that’s created when you initialize Hybris? It’s a large container. Do you package it all up and ship it every time you build it?
A: That’s all in the S3 bucket. Right now, in the initial phase, Hybris is serving it and in the next evolution, we will serve static and media directly from S3. As far as the size of it goes, it is unfortunately a monolithic application. It is a single JVM and you would have to ship it. We can break up the Solar Surge and other things into different containers but the code Hybris itself is a large JVM.
Q: Is software licensing affected at all?
A: No, Hybris is licensed by CPU core so in an auto scaling group, if we did have licensing restrictions, we would restrict that in the autoscaling group. We would say, ‘stop when you reach a certain, maximum number of CPUs’. From a licensing perspective, we didn’t run into any issues.
Q: How did you choose Ansible?
A: We decided to use Ansible because we are very familiar with it. We found it to be a lot more flexible as it is agentless. While other similar products require the use of an agent, running over SSH and agentless was a big factor in choosing Ansible.
Q: Do you want to share the magnitude of traffic on Thanksgiving and what the system saw?
A: During Black Friday Rent-A-Center saw a 42% increase to its ecommerce site, over 9 million hits. We were able to scale this infrastructure to support that. Kudos go out to the architecture and business teams that we didn’t even notice a blip. We maintained the same response time to the page loading and Hybris through scaling with the large increase in traffic. And, none of our engineers needed to spend the night!
For additional AWS news insights, tips, tricks and analysis direct to your inbox, subscribe to our blog here: