April 5, 2023

AWS Comprehensive Cloud Machine Learning Package

AWS provides a comprehensive toolbox for building and deploying models, and managing MLOps, while being integrated into the broader AWS environment

Digital representations of blue cloud icons rain binary 0s and 1s.

Author: Luis Martinez

Today, we’re lucky to have many options for cloud-based machine learning. We previously discussed accelerating the machine learning lifecycle using various components of Google’s suite of offerings, as well as the low-code/no-code machine learning frameworks available in the cloud.

Likewise, Amazon Web Services (AWS) is one of the most popular cloud services available today, offering a variety of options for developing and deploying machine learning (ML) models. AWS is widely recognized as an industry leader by analyst firm Gartner in its 2022 report evaluating cloud AI developer services and is used by many industry-leading organizations such as Intuit, 3M, and T-Mobile.

For any organization considering developing their machine learning capabilities, the AWS suite provides a range of solutions that can be tailored to the organization’s needs. These tools can be used either individually or as fully managed platforms, offering flexibility to organizations at different points in their ML journey. In this article we explore a typical machine learning lifecycle and highlight some relevant AWS tools along the way.

1) Data Preparation / Engineering

Before any data science or machine learning work can be done, data values reflecting both input features and corresponding target values need to be gathered. These data sets often come from disparate sources and require labeling and transformation into a usable format for machine learning and analysis, all in a secure location.

AWS Kinesis is a fully managed streaming data service used to collect, process, and analyze data. Kinesis is a powerful tool, great for handling real-time data at scale such as logs, metrics, clickstreams and consists of 4 components:

Kinesis Streams: low latency streaming ingestion
Kinesis Analytics: perform analytics in real-time on streams using SQL
Kinesis Firehose: write streams to S3, ElasticSearch & Splunk
Kinesis Video Streams: solution for streaming video

Likewise, AWS Glue provides tools for cataloging data, inferring schemas, transforming, cleaning, and enriching data. Glue Data Catalog is a service that can be used in conjunction with S3, Redshift, and Amazon RDS to build a metadata repository of the tabular data store in S3. This can either be scheduled or performed on demand.

Glue ETL allows you to transform, clean, and enrich data. You can generate ETL code in python or Scala or provide Spark or pySpark scripts. Like Glue Data Catalog, this can be done against S3, Redshift, and Amazon RDS, as well as a Glue Data Catalog. The jobs can either be scheduled or run based on user defined triggers. You can filter, join, map, and drop data, and perform format conversions for .csv, .json, Avro, Parquet, ORC, and XML data.

Finally, AWS provides a variety of data storage solutions, the cornerstone of which is their S3 data lake (Simple Storage Service) mentioned above. This highly scalable object storage service is used to store and retrieve any type of data including images, videos, documents, and of course, training and validation data used for machine learning. Amazon builds in flexibility here by supporting a variety of storage tiers for different use cases. You can also implement lifecycle rules to automatically move objects between storage classes to minimize storage costs.

2) Exploratory Data Analysis & Data Labeling

Once data has been stored, ingested, and secured, your organization can now begin analyzing and, if needed, labeling the data. Amazon offers powerful tools to accomplish both. Athena is an interactive serverless query service for S3. Unlike Redshift, the data doesn’t need to be loaded, it can query data in .csv, .json, .ORC, Parquet, or Avro format.

Quicksight is a serverless cloud-based analytics service that allows all employees in an organization to build visualizations, perform ad-hoc analysis to rapidly derive business insights from data, and is accessible anytime on any device. This service is ideal for building dashboards and tracking KPIs and can also perform anomaly detection and forecasting.

Data labeling is the process of adding one or more meaningful labels to raw data. SageMaker Ground Truth is a service for labeling data for use in machine learning. This tool can integrate with your organization’s labeling workforce or use AWS Marketplace to access a global network of professional labelers. As the human labelers work, Ground Truth creates its own classification model, and will only send images to the human labelers where the model is uncertain. This can help reduce the cost of labeling.

3) AWS Machine Learning Model Training

For a full end-to-end experience, AWS SageMaker offers a fully managed platform to build, train, and deploy ML models. An integral part of the ML process involves selecting or developing a suitable algorithm to train the model using the prepared data set. SageMaker includes pre-built algorithms, notebook instances for development, and allows for easy scaling.

Alternatively, AWS Marketplace offers a host of pre-trained models and algorithms that can be easily integrated into your own applications, and cover just about every possible machine learning task you can imagine including:

Classification
Regression
Natural Language Processing (NLP)
Computer Vision (CV)
Forecasting
Clustering/Dimensionality Reduction
Anomaly Detection
Recommender systems
Reinforcement Learning
Hyperparameter Tuning

For even more customization, Amazon offers you the flexibility to build your own custom algorithms and models using popular frameworks like tensorflow/keras/MxNet.

4) Evaluation and Deployment

Once the models have been trained, the next step is to evaluate and deploy to production. AWS SageMaker includes several built-in evaluation metrics such as accuracy, precision, recall, and F1 score to use in model evaluation. Of course, users can also define custom metrics and integrate them into the training and validation process. Additionally, SageMaker includes built-in hyperparameter optimization features which can automatically tune the hyperparameters of the model to improve its performance.

Once trained and validated, AWS SageMaker allows you to create and manage model endpoints to deploy and quickly scale trained models for inference. Endpoints allow you to define an inference pipeline that specifies how input data is preprocessed and flows through the trained model to generate an output.

These endpoints can then be used to generate predictions or recommendations for new data in real time. You can also configure endpoints to elastically scale to serve as much or as little traffic as needed. This allows for automatic changes in computational resources based on changes in demand.

5) Monitoring and Maintenance

Having made it this far in the process, you can now sit back and enjoy the fruits of your labor, keeping a watchful eye on model performance with tools such as AWS CloudWatch. CloudWatch allows you to monitor and troubleshoot ML applications running on AWS infrastructure. CloudWatch can collect and track metrics, logs, and events from a variety of AWS resources and applications. In addition to your ML models, CloudWatch can even provide real-time monitoring of custom applications and services running both in the cloud or on-premises.

Finally, Amazon CloudTrail provides a record of actions taken by a user, role, or AWS service in your AWS account. It can capture all API calls and other events made by or on behalf of your AWS account and deliver the files to a specified S3 bucket.

Combined, these tools create a powerful view of model performance and usage that can proactively identify and address anomalies before they become issues.

Summary

AWS provides an extensive range of machine learning and data science solutions that have been successfully implemented across various industries. Whether it is a fully managed or standalone solution you are looking for, AWS offers a comprehensive suite of tools that can seamlessly integrate into any organization’s existing infrastructure, allowing them to efficiently and effectively deploy ML models to address diverse business needs.

To learn more about leveraging the power of AWS and the cloud to propel your data science capabilities, contact our AWS experts and start your journey today, or visit our Technology Services page for more information.

Luis Martinez is a Manager at RevGen Partners, specializing in identifying business improvement opportunities and their impacts on the Customer and Employee experience. He is passionate about empowering our clients to navigate challenges and enabling change.