post-banner-img

Data Lake Basics

May 27, 2021
Praecipio

With Atlassian's upcoming release of Jira Data Lake for Jira Software Cloud, it's a good time to review the jargon we might stumble on in the reporting and business intelligence (BI) space. So let's jump into the (data) lake!

One word of caution: the BI industry has many players with varied opinions. Some terms get used and reused in multiple ways. One example is the emerging use of "lakehouse" - a combination of "data lake" and "data warehouse." Here we'll stick to as close to canonical as possible but expect to see terms used differently as you research.

Why does BI even matter? What are KPIs?

Your organization has systems (e.g. computer applications) which create and contain data. That data is extremely valuable for fact-based decision making in your organization. 

A CTO or CIO is able to more effectively allocate help desk head count with ready access to accurate metrics (also called Key Performance Indicators, or KPIs) like Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR). (Note: MTTR is a tricky acronym. As Atlassian notes, there are at least four common incident management metrics that share this abbreviation! This stuff can be confusing...)

To provide these valuable, up-to-date KPIs to decision makers, we turn to BI. This industry is a dizzying array of technology components which take various approaches to achieving BI's primary objective: turning raw data into actionable insight. Often, we need to integrate multiple BI components to get from point A (data in the source system) to point B (reports used for decision making).

BI solutions often leverage a data lake or data warehouse to store business data.

What is a data lake?

A data lake is a central store of raw business data. The data lake is not typically used by the source systems whose data it contains.

The lake is designed to be accessed by tools like Tableau, PowerBI, and Qlik in order to analyze and produce insights from the data. We'll call these analysis and presentation applications "BI tools." To continue the lake analogy: if the BI tool is a fishing rod, then the data is the fish.

A data lake typically uses a file store technology but when it comes to Jira Data Lake, we don't really need to know much about the underlying tech because Atlassian Cloud takes care of choosing, configuring, hosting, and maintaining it for us. One less thing on our plate? Great!

All we need to do is connect our BI analysis and presentation tools (Tableau, PowerBI, Qlik, etc.) to Jira Data Lake. Boom! We're ready to start creating reports, graphs, dashboards, and whatever else we need to answer questions for our organization.

How is a lake different from data warehousing?

As mentioned earlier, some BI solutions use a data warehouse instead of a data lake. Some use both. While the line has blurred between the two, lakes are usually more unstructured than warehouses.

The initial data lake concept encouraged organizations to dump all of their raw data into the lake, including data from relational databases, flat files (e.g. CSV files), videos, and more. The promise that smart software and ever-increasing computing horsepower would eventually create solutions for accessing the overwhelming amount of data in the lake hasn't really come to fruition quickly enough. And many data lakes turned into data swamps. Lakes these days, like Jira Data Lake, are more purpose-built and have better designs for preventing a descent into swampland.

A data warehouse is more structured and normally designed with transformation processes on the front- and/or back-end that clean, normalize, and handle any other standardization before presenting it to our BI tools. These processes are represented by the "T" (Transform) in some more acronyms: ETL (Extract Transform Load) or ELT. The result is more predictable and accurate, but the cost and time to create these transformation processes is much higher.

Why use a data lake?

Why invest in this effort to centralize data in lakes or warehouses? Our BI tools can often connect directly to our application's database. Wouldn't it be easier to skip the lake/warehouse?

Eliminating the data lake or warehouse would simplify our solution design but experience has shown multiple issues with the direct-connect approach.

The most critical issue is often the potential load a BI tool can place on an application database. BI queries often require large swaths of data which can only be fulfilled through heavy workloads on the database. In addition, BI tools often don't optimize queries for performance. BI workloads can cause database contention and application stability should always be prioritized over BI needs. With today's easy-to-use BI tools accessible to a larger and less technical audience, this issue has only become more prevalent. Connecting our BI tools to a data lake prevents risking any application stability issues.

The next most common issue we see is needing to combine data from multiple systems. Since your organization doesn't just use one system, combining data across the organization is how so many powerful insights occur. For example, tying Jira KPIs to financial data is one way leaders can more easily understand technical metrics. But financial data is stored in the accounting system, not Jira. A direct connection to an application's databases only allows access to that system's data, preventing cross-system data analysis. While some BI tools allow you to perform "cross-database joins," performance is often unacceptable and some links are just not possible. Often the data from different systems needs to be cleaned and standardized before it can be linked for analysis. Doing this in a data lake/warehouse is far more efficient than attempting it "at runtime" in BI tools. When we first centralize our data we have the ability to combine data from as many systems as needed.

BI is all about trends over time. Some applications don't maintain much, if any, historical data. A direct connection to these systems doesn't allow for time-based analysis. The historical data simply doesn't exist. Lakes allow us to snapshot data at regular intervals in order to perform valuable time-based analysis.

Finally, with cloud apps like Jira Cloud, we don't have the option to connect directly to the application database. The only data access is often through APIs which can be slow for analysis and suffer from many of the same issues mentioned above. Jira Data Lake provides performant, safe data access.

Data lakes arose from the need for flexibility. No two organizations use the same systems or have the same data needs. Your organization's data needs will also change over time. The direct connection to an application database is too tightly coupled and doesn't provide enough agility to provide BI insights.

If you're wondering if this powerful new tool is a good fit for your organization, or have any questions about anything Atlassian, contact us, one of our experts would love to help!

Put Your Atlassian Tools to Work

Optimize Your Atlassian Stack with Praecipio