Delivering Trusted Data with the Informatica Data Governance Technology Suite
How the Informatica Intelligent Data Platform can integrate into your business' data governance solutionRead More
Author: Joseph Romani
Now more than ever, harnessing the power of data to find insights about operations, products, and customers is key for an organization to stay ahead of the competition.
Azure is a cloud computing offering from Microsoft that provides a variety of tools to build data science solutions including data storage, cloud computing, and machine learning. It’s used by organizations across a range of industries including banking, insurance, healthcare, and manufacturing.
The main components of their data science capabilities are:
Azure Machine Learning Studio allows for rapidly building and training machine learning models. It offers notebooks to code in python, no code automated machine learning, or a drag and drop interface which allows for the combination of prebuilt machine learning steps and custom python functions.
Azure Databricks is a data analytics platform built on Apache Spark and supports python, Scala, R, Java, and SQL and includes several popular data science frameworks and libraries including TensorFlow, PyTorch, scikit-learn, and Shiny.
Databricks also allows for notebooks to be shared between team members to facilitate collaboration, and allows for the use of Horovod, an open-source distributed training framework for use with neural network python frameworks such as TensorFlow, Keras, or PyTorch. Horovod is especially useful for big data exceeding the memory of a single machine.
Which one is right for your situation? This comes down to the size of the dataset and the computational requirements. Azure Databricks is better suited to handling data distributed across multiple nodes but is generally overkill if the data can fit on a single machine or can easily fit on a pandas data frame. While using Azure Databricks may be less intuitive for new users, the computing time savings will often make up for it.
Azure also allows us to put those machine learning models to use. Once trained, a machine learning model can easily be deployed from either Azure Databricks or from Machine Learning Studio. In Azure Machine Learning Studio, simply register a previously trained model, and deploy the model using either:
Finally, use the endpoint to make predictions from your data. For a web service application, we make a request to the URL that was created when we deployed the model, for a batch endpoint we pass the location of the data that we want to make predictions on to it. This data can be either stored locally or in the cloud.
Azure has a rich tool set for machine learning operations. Models and datasets can be tracked, which is key for understanding variations in model performance. Tracking model and dataset history and lineage is also necessary for auditability.
In both Azure Databricks pipelines and in Azure Machine Learning Studio, we can use MLFlow, an open-source library for managing the machine learning experiment lifecycle, to track experiment runs and metrics. An experiment refers to a collection of training runs, while metrics are measures of how well the model fits the data, i.e., root mean squared error, accuracy, recall, etc.
Azure also has tools to help ensure fairness in models, a critical element for ensuring regulatory compliance. Model interpretability tools allow for model transparency. Azure supports using the Fairlearn python package to analyze models and investigate the disparity in predictions and prediction performance for sensitive features in a dataset.
A sensitive feature may be age, ethnicity, gender, or anything that is a proxy for a protected class. Fairlearn provides an interactive dashboard widget which integrates with Azure Machine Learning studio so it can be shared across a team.
In summary, Azure offers powerful tools for developing and deploying machine learning models while also managing machine learning operations. These services provide a data science jump start, so your engineers and data scientists aren’t starting from scratch. For organizations considering data science cloud options, Azure provides a rich suite of services that can scale and extend as your needs grow and evolve.
If you would like to learn more about leveraging the power of Azure and the cloud to propel your data science capabilities, contact our Azure experts to advise you on your journey or visit our Technology Services page for more information.
Joseph Romani is a Senior Technical Consultant at RevGen, Microsoft Certified Data Scientist Associate, with over seven years of experience in data science.