Build a Governed Data Lake

Increase data agility and improve business intelligence

Author: Andy Vold

The demand for instant results is prominent into business today – just like in every other corner of our lives. Businesses heavily depend on data, and new sources of data are arriving online faster than ever before. So, how do we create an environment in today’s fast-paced, fast-changing world where users can gain business value via reporting and analytics? Build a Governed Data Lake.

Moving Beyond Traditional Business Intelligence

Let’s face it, traditional business intelligence (BI) systems take months – and sometimes years – to develop into a usable platform. While these platforms remain crucial to business reporting, they are often slow to change, apply business logic to the source data, and contain only a fraction of an organization’s overall data.

These days, we need greater flexibility and agility to succeed. Users need a data environment that enables exploration without having to wait on IT; and if this data environment does not exist within the organization, analysts will start to create their own. Doing so can lead to a handful of issues such as data redundancy across the organization and multiple software purchases as each department or user select a tool that works for them. More important, this takes the users away from doing their job: The important task of asking questions of the data.

Data Governance Framework

One very viable option for improving access to data is to have your IT department build and support a Governed Data Lake. The lake is an environment where users can access large amounts of raw data. It contains an enterprise-wide data catalog that assists users in finding the data they need based on business-term keyword searches. It provides the tools and a sandbox environment to explore and develop analytic models, and – if needed – move these models into a production scenario. Furthermore, it enables the reuse of data transformations and queries.

What the Governed Data Lake is not is a replacement for traditional BI systems or an operational data store (ODS). The lake provides a safe, self-service environment that simultaneously promotes collaboration among business units.

Since today’s data sources vary considerably, the Governed Data Lake should be able to support unstructured data. The key to agility and flexibility is an environment that does not require work up front to model. You want to land the data quickly into the Governed Data Lake and allow the users to structure it based on the requirements of their specific use cases. Hadoop file systems (HDFS), and the open source Hadoop tools, such as Sqoop and Hive, can handle the work. However, these tools are not a fit for every organization. The other available options include virtualization and cloud-based solutions. The important part is that the technology fits your organization and works for your users.

Data Lake Implementation

There are different approaches to building the Governed Data Lake. Typically, a business will have use cases that haven’t been implemented. Find one or two of these use cases and start building the foundation. Pick the tools and implement the first set of data. Train the end users and let them start fishing. Over time you can add in additional data sources and additional users. It is important to add value and recruit users who will adopt the Governed Data Lake and be supporters of it going forward.

The need for data within your company is not going to stop, and users will continue to find this data on their own if IT does not supply it to them. Considering these inevitabilities, the Governed Data Lake – if built correctly and adopted – will prove to be a great benefit to your organization.

Learn more about data and technology solutions from RevGen