According to data from the 2020 Gartner CIO Survey, 31% of CIO respondents expect Artificial Intelligence-Machine Learning to be a game-changer for their organization, one of the Top 10 “Game-Changing” Technologies. (Gartner “2020 CIO Agenda: Industry Perspectives Overview,” Jan-Martin Lowendahl, Andy Rowsell-Jones, Chris Howard, Tomas Nielsen, Brad Holmes, 7 January 2020. Gartner subscription required.) While the list of potential use cases for AI and ML is extensive, just like the parable of the man who built his house on stone rather than sand, a quickly built foundation can suffer from unnecessary time and cost overruns, security, and/or scalability issues.
Conversely, a sound AI/ML outcome is best built upon a solid infrastructure platform that has security, agility, and scalability built in. Moreover, as the business faces increasing pressure to rapidly and efficiently pivot to new priorities, a flexible framework that supports agility, innovation, and the latest technologies can be a critical component.
There are several advantages cloud computing poses for AI/ML applications. Cloud providers will give you near-term impact without long ROI cycles, allowing your data teams to focus on driving innovation. Cloud does so with:
- Skilled teams that take care of hardware needs, allowing you to abstract the architecture, giving you greater flexibility and the ability to move faster from pilot to production. As Jan van der Vegt points out in his Towards Data Science Medium article, “A slow deployment process hinders user acceptance and decreases trust in the application.” And, a lack of trust in data applications can be disastrous for an ML or AI program.
- Fast adoption of new hardware technologies that can continuously speed model testing times and reduce solution drift caused by outdated systems. For example, at re:Invent 2019, Andy Jassy announced that AWS is able to run frameworks like PyTorch, Tensorflow and mxnet 20-22% faster than even the most advanced on-premises systems.
- Extreme resource scalability that allows your team to test regardless of any spikes in need they may have.
- Last, the cloud allows you to build advanced infrastructure with automation like CI/CD that further speeds output and uniformity that decreases risk.
Building the AI/ML Stack
While it may be obvious, it should be said that building a solid AI infrastructure requires the involvement of the IT or DevOps team. The team should focus on two core goals: Data Operations and the AI ecosystem.
The initial goal here is to prepare for data ingestion. Each of the public cloud providers offers a rich set of tools and infrastructure to easily ingest your data sets. With AutoML, these tools will automatically analyze your data to identify important features. However, imported data is only part of the equation. As your data teams work with their models, they will require storage to handle extremely large amounts of data. In AWS, a best practice data operations approach would include using autoscaling groups of storage nodes and an S3 bucket to store data sets as well as things like automated deployment of Amazon EFS clusters to store the output from different AI/ML jobs. Other cloud providers offer similar mechanisms with slightly different names.
Data Lake Driven Intelligence
A manufacturer of large equipment sought to get data from its connected devices to customers to improve their quality and productivity in real time. As a result, they enabled a rapid IoT infrastructure with a DevOps workflow that is agile, leverages AWS and Ansible while maintaining tight security controls to meet privacy laws. With DevOps process automation and infrastructure as code, the manufacturer seamlessly sent data to an AWS data lake where the ingested data is analyzed and presented in a digital platform for customers to drive continuous data-driven improvement.
A key question for the infrastructure team to answer is how to connect your imported data sets and analytics pipelines into the AI environment. Let’s answer that question through an example.
The Toyota Research Institute IT Operations team, created a self-service catalog for data scientists that allowed them to automatically create and destroy dedicated AI/ML clusters using AWS Service Catalog.
When a data engineer selects a Service Catalog product, it automatically employs an AWS CloudFormation template to provision a P3 or P3dn GPU cluster. The cluster, in turn, pulls data down from Amazon FSx for Lustre, which has been synced with data from Amazon S3. (When Amazon FSx for Lustre is created, it pulls the data out of Amazon S3.) The data engineers use PyTorch to train and improve models. This fully automated process allows data scientists to automate the provisioning and leveraging of deep learning frameworks like PyTorch to recreate and improve models while providing a common ecosystem for AI/ML needs, rather than creating a new infrastructure for each unique business problem.
Personal Peptides Speeds Genomic Analytics with Optimized AWS Infrastructure
Personal Peptides sequences, analyzes, and compares DNA against large data sets, requiring massive amounts of computing resources in order to return information within a reasonable timeframe. Yet, managing patient DNA patterns requires meeting HIPAA requirements and protecting client information along the supply chain. To help Personal Peptides meet these dual challenges while maintaining competitive advantage through efficient use of computing resources, we helped the firm blueprint a cloud-based machine learning solution that allowed the company to remain agile, with a highly secure infrastructure that provides an additional level of data protection along the document chain.
Specifically, the application architecture included compute clusters for machine learning models, scalable task queues, search indexers and auto-scaling workers. The requirements of these applications, as well as the business requirements, needed to be carefully mapped and an AWS architecture was configured to meet those needs. The final solution uses a combination of enterprise-grade messaging from RabbitMQ, the Celery pre-commerce platform, MongoDB and a proprietary mix of Amazon’s Web Services to ensure security, privacy, and scalability. By using a massively parallel compute cluster, NoSQL and indexing within Amazon Web Services, Personal Peptides is able to conduct genomic analysis and produce recommendations within a viable timeframe and cost.
Clinical Research Firm Achieves Secure, Elastic, Machine Learning Environment
This privately-held clinical research organization provides services across the entire drug development process to pharmaceutical, biotechnology, and medical device organizations. The company wanted to update the system its internal team of research scientists used for data analysis as the team’s large data-related demands had outgrown its on-premises system.
As a result, the company created a service-agnostic landing zone where its services deploy. To build the landing zone, the team created an AWS CloudFormation template that defines VPC, Subnets, NAT Gateways, Internet Gateways, Security Groups, and NACLs. In this way, the company is able to track changes and reproduce the same landing zone in different environments — production, development, and staging — growing consistency and reliability.
CloudFormation templates are also used to define the infrastructure for RStudio and Galaxy, ensuring the right number of instances, S3 bucket configurations, and more. All CloudFormation templates are available in the company’s new CodeCommit service repository. Based on work by Matt Chambers at Vanderbilt University, the team built a single Galaxy cluster within AWS using CfnCluster. And, the team created a single Galaxy cluster that would scale based on the number of jobs in the queue.
Using SLURM for scheduling, the cluster allows all the firm’s scientists to use the same cluster that auto-scales for jobs, rather than CPU usage or other compute metrics, ensuring scalability that directly addresses their research needs. In the past, the organization had to configure its environment manually and if it was out of capacity, bring it all down and rebuild it. Following the new AWS Galaxy cluster implementation, the company now has a fully scalable cluster where resources are able to spin up and down based on jobs in the queue, maximizing scientist productivity, reducing costs, and ensuring long-term success.
With an automated AI cloud ecosystem, teams can create a solid infrastructure that lets data scientists concentrate on data science – not the technology that supports it. This approach saves data engineers time by removing dependencies on the Infrastructure team to get the base infrastructure. And, this allows scientists to start testing models faster and in turn more quickly solve core business problems — whether it be vehicle safety, algorithmic trading, medical diagnostics, or predictive maintenance.
Accelerate AI/ML Projects
Automation and cloud best practices are critical to accelerating progress. To start on the right foot and get to the finish line faster, many organizations elect to work with an experienced consultant who can design and implement a cloud AI/ML infrastructure for their unique needs while ensuring cloud security, automation and other best practices are built in.
About Flux7, an NTT DATA Company
Our experienced consultants can help your project get to market faster with reduced maintenance and greater productivity that allows you to spend more time on strategic initiatives. We help you unleash the potential of AI/ML for your business, by pairing your team with the right resources at the right time, meeting project goals with agility that results in a secure, scalable, reliable cloud infrastructure. Just as importantly, we teach your team along the way, leaving you with the tooling and skills to effectively analyze data that can unearth trends and expand business-impacting insights.
Are you ready to adopt AI/ML and looking for a partner that can help guide the journey? Start the discussion today.