Authors: Adam Gao & Jesse Henson
The first step in operationalizing data is setting the right goals and asking the right questions. While aligning on goals can be deceptively difficult, the concept itself is often straight-forward. However, many companies tend to neglect the seemingly ambiguous next step: understanding the viability of their data.
Gauging data maturity prior to doing deep technical work can save a lot of time and frustration down the road. Knowing the readiness of a business’ data infrastructure will help determine:
- Ability to define mission critical key metrics
- Ability to extract insights
- Ability to produce predictive models which drive better decisions
Consider the following scenario:
The company uses a price forecasting model so the sales team can improve margin. However, after some time, the pipelines from which the data is sourced are changed. Ultimately, stakeholders must decide if rebuilding the model according to the new infrastructure is worth the projected returns.
Note in the above that the business problem is well defined (i.e., improve sales margin), and that the viability of the data infrastructure directly impacts sales performance. In this business case, evaluating data maturity is key to the success of the sales initiative.
How can we evaluate data maturity?
There are multiple aspects of data maturity to consider, including runtime considerations, feature engineering, and integration of sources. The biggest issues when it comes to data maturity and viability are usually related to data quality.
In evaluating the data, we want to address the following issues:
Quantify data quality with metrics: How much faith can the business have in the accuracy of its data?
Take a business with multiple siloed data sources. After analysts comb through many data sources to get a general picture of what’s available for use, stakeholders need a way to quantify the business readiness of the available data.
Clarify unexpected caveats: Are there data quality questions that need to be asked?
A naïve inspection of the data may miss special conditions set by external vendors. A standardized method for raising questions about these conditions should be applied initially to save hours of fumbling down the road.
Prioritize: Which parts of the data do I want to spend my limited time looking at?
Prioritization of data initiatives is often complicated by complex architecture and seemingly arbitrary configuration of pipelines. Part of a good data offering is to convert a messy landscape into a concise map with clear directions.
[Have you considered what a data team can do for your business?]
Simple initial data quality checks include:
Anomaly detection for numerical data
Get rows with data 3 standard deviations from the mean (this could be rolling mean for time series). Often, there will be rows of data where anomalies exist. Sometimes it is expected, and a business simply has a service provider that “just does things a certain way”. Other times, anomalies need to be flagged or removed. Either way, decisions on anomalous data are best made at the data source before processing is done further down the pipeline.
Evaluate if a non-numerical feature has too many unique values compared to the number of instances. Measures may eventually be taken to reduce the unique count of the feature, or the feature may be dropped. Evaluations on cardinality is one quick way to evaluate the viability of a feature early on, clarifying the viability of future data science offerings.
Text answer quality
Frequently, there are multiple versions of the same value for a text feature: e.g., ‘fixed\shared’ and ‘fixed shared’ both represent the same value but have different formats. A custom data mapping may be devised to clean this kind of data if the meaning of the text is clear. The reparability of text answer quality may vary vendor to vendor, and ultimately impacts the viability of data solutions.
Report missing data
Empty cells of tabular data are a common occurrence, and a decision needs to be made on how rows with empty cells are actioned. Are any particular variables necessary? A bird’s-eye view on row count per feature is an immediate way to help evaluate the quality of data before proceeding further.
While simple data quality evaluations are critical, a deeper evaluation of data maturity should also include feature inventory and feature analysis:
- Does the company have the sources necessary to get dependent and independent variables?
- How robust is the company’s data architecture and how viable are data engineering offerings? Does the infrastructure viably support the extraction of critical dependent/independent variables?
- What dependent variable reflects the business case directly (e.g., churn metrics, revenue)? How is its data quality? Help businesses define their dependent variables.
- What variables influence the dependent variables? Which independent variables matter most (e.g., user metrics for customer experience)? How is the data quality of the independent variables?
- Does the data structure need to be changed and can it be done in a scalable way? For example, if a data source has variable column count (e.g. month) which needs to melted, this might not be simple to scale for the business requirement.
[Success Story: Creating a Single Source of Truth with a Governed Data Lake]
What would an ideal outcome of a data viability evaluation look like?
A “yes” or “no” to answer the question of viability, and if the offering has a Return on Investment (ROI) worth pursuing
Stakeholders often wonder “can we do it?” when it comes to the business problem in mind. Perhaps significant resources have already been invested into the data infrastructure. How does the remaining required effort stand in relation to the ROI?
A sizing of the ROI
A good high-level view includes scope. In addition to a specific number, the major components that go into costs, revenue, and money saved should be sized for the potential initiative. For example, stakeholders will appreciate a report on the costs and projected returns of a server migration, broken down by component.
Qualitative list of top pain points in data quality
Many business problems are going to have unique challenges that don’t fit cleanly into a universal framework. Generally, an investigation into the configuration and sourcing of available data should identify data quality bottlenecks, which helps keep the roadmap to success clear and actionable.
Data engineering/Data quality requirements and suggestions
After an initial diagnosis of pain points, actions may be recommended as prerequisites for the business initiative. This may include quality of life changes such as database configuration suggestions. Having a clear description of specifications helps data teams form plans with confidence.
Data quality metrics organized by feature and data source
Data quality metrics around completeness, variance, and anomalies assist in the prioritization of data sources. Metrics serve as strong arguments for business decisions.
Feature inventory categorized by dependent and independent variables
Before initiatives and data sourcing can be prioritized in relation to each other, having a map of the data landscape in list form helps prevent personnel from overlooking crucial information.
If the data is considered viable for the business problem at hand, and the ROI is worthwhile, the next step is to consider methods to engineer the data to specifically solve the identified problem.
Not every effort will end with the best-case scenario where the data fits viability criteria. No matter the case, the company will have learned more about its data and the issues at hand and will have a more organized technical roadmap for future initiatives.
Curious about what data science can do for your company? Contact us today or visit our Analytics & Insights page to learn more about our services.
Adam Gao is a data scientist at RevGen, passionate about executing quantitative solutions to empower real-world decision making.
Jesse Henson has a master’s degree in AI and machine learning, and has several years of experience in the data industry. He is passionate about shaping the future of data and AI technologies.