5 min read

Data Lake Basics

By Kye Hittle on May 27, 2021 9:02:00 AM

Blogpost-Display image-May_Data Lake Basics

Terms and Themes

With Atlassian's upcoming release of Jira Data Lake for Jira Software Cloud, it's a good time to review the jargon we might stumble on in the reporting and business intelligence (BI) space. So let's jump into the (data) lake!

One word of caution: the BI industry has many players with varied opinions. Some terms get used and reused in multiple ways. One example is the emerging use of "lakehouse" - a combination of "data lake" and "data warehouse." Here we'll stick to as close to canonical as possible but expect to see terms used differently as you research.

Why does BI even matter? What are KPIs?

Your organization has systems (e.g. computer applications) which create and contain data. That data is extremely valuable for fact-based decision making in your organization. 

A CTO or CIO is able to more effectively allocate help desk head count with ready access to accurate metrics (also called Key Performance Indicators, or KPIs) like Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR). (Note: MTTR is a tricky acronym. As Atlassian notes, there are at least four common incident management metrics that share this abbreviation! This stuff can be confusing...)

To provide these valuable, up-to-date KPIs to decision makers, we turn to BI. This industry is a dizzying array of technology components which take various approaches to achieving BI's primary objective: turning raw data into actionable insight. Often, we need to integrate multiple BI components to get from point A (data in the source system) to point B (reports used for decision making).

BI solutions often leverage a data lake or data warehouse to store business data.

What is a data lake?

A data lake is a central store of raw business data. The data lake is not typically used by the source systems whose data it contains.

The lake is designed to be accessed by tools like Tableau, PowerBI, and Qlik in order to analyze and produce insights from the data. We'll call these analysis and presentation applications "BI tools." To continue the lake analogy: if the BI tool is a fishing rod, then the data is the fish.

A data lake typically uses a file store technology but when it comes to Jira Data Lake, we don't really need to know much about the underlying tech because Atlassian Cloud takes care of choosing, configuring, hosting, and maintaining it for us. One less thing on our plate? Great!

All we need to do is connect our BI analysis and presentation tools (Tableau, PowerBI, Qlik, etc.) to Jira Data Lake. Boom! We're ready to start creating reports, graphs, dashboards, and whatever else we need to answer questions for our organization.

How is a lake different from data warehousing?

As mentioned earlier, some BI solutions use a data warehouse instead of a data lake. Some use both. While the line has blurred between the two, lakes are usually more unstructured than warehouses.

The initial data lake concept encouraged organizations to dump all of their raw data into the lake, including data from relational databases, flat files (e.g. CSV files), videos, and more. The promise that smart software and ever-increasing computing horsepower would eventually create solutions for accessing the overwhelming amount of data in the lake hasn't really come to fruition quickly enough. And many data lakes turned into data swamps. Lakes these days, like Jira Data Lake, are more purpose-built and have better designs for preventing a descent into swampland.

A data warehouse is more structured and normally designed with transformation processes on the front- and/or back-end that clean, normalize, and handle any other standardization before presenting it to our BI tools. These processes are represented by the "T" (Transform) in some more acronyms: ETL (Extract Transform Load) or ELT. The result is more predictable and accurate, but the cost and time to create these transformation processes is much higher.

Why use a data lake?

Why invest in this effort to centralize data in lakes or warehouses? Our BI tools can often connect directly to our application's database. Wouldn't it be easier to skip the lake/warehouse?

Eliminating the data lake or warehouse would simplify our solution design but experience has shown multiple issues with the direct-connect approach.

The most critical issue is often the potential load a BI tool can place on an application database. BI queries often require large swaths of data which can only be fulfilled through heavy workloads on the database. In addition, BI tools often don't optimize queries for performance. BI workloads can cause database contention and application stability should always be prioritized over BI needs. With today's easy-to-use BI tools accessible to a larger and less technical audience, this issue has only become more prevalent. Connecting our BI tools to a data lake prevents risking any application stability issues.

The next most common issue we see is needing to combine data from multiple systems. Since your organization doesn't just use one system, combining data across the organization is how so many powerful insights occur. For example, tying Jira KPIs to financial data is one way leaders can more easily understand technical metrics. But financial data is stored in the accounting system, not Jira. A direct connection to an application's databases only allows access to that system's data, preventing cross-system data analysis. While some BI tools allow you to perform "cross-database joins," performance is often unacceptable and some links are just not possible. Often the data from different systems needs to be cleaned and standardized before it can be linked for analysis. Doing this in a data lake/warehouse is far more efficient than attempting it "at runtime" in BI tools. When we first centralize our data we have the ability to combine data from as many systems as needed.

BI is all about trends over time. Some applications don't maintain much, if any, historical data. A direct connection to these systems doesn't allow for time-based analysis. The historical data simply doesn't exist. Lakes allow us to snapshot data at regular intervals in order to perform valuable time-based analysis.

Finally, with cloud apps like Jira Cloud, we don't have the option to connect directly to the application database. The only data access is often through APIs which can be slow for analysis and suffer from many of the same issues mentioned above. Jira Data Lake provides performant, safe data access.

Data lakes arose from the need for flexibility. No two organizations use the same systems or have the same data needs. Your organization's data needs will also change over time. The direct connection to an application database is too tightly coupled and doesn't provide enough agility to provide BI insights.

If you're wondering if this powerful new tool is a good fit for your organization, or have any questions about anything Atlassian, contact us, one of our experts would love to help!

Topics: blog management tips data business-intelligence data-lake jira-data-lake
2 min read

Get early access to Atlassian Data Lake for Jira Software

By Kye Hittle on Apr 23, 2021 2:00:00 PM

Blogpost-display-image_Jira Data Lake Preview

What's a data lake?

Read up on the basics in our explainer.

At Praecipio Consulting we understand that the data contained within your Atlassian tools is a critical asset for your organization. To help customers more easily access their Jira data, Atlassian has developed Data Lake! As of March 2021, Data Lake is available to preview in Jira Software Cloud Premium and Enterprise.

Warning! Beta software should not be used for production purposes. Breaking changes are likely as Atlassian tweaks this functionality based on user feedback. Not all Jira data is currently available and permission levels are limited but Atlassian is quickly working through its roadmap. In addition only English field names are available, as of now. Therefore, any information presented here is subject to change.

Data Lake allows you to quickly connect the best-in-class business intelligence (BI) tools you've already invested in to query the lake directly.

Compatible BI Tools include:

  • Tableau
  • PowerBI
  • Qlik
  • Tibco Spotfire
  • SQL Workbench
  • Mulesoft
  • Databricks
  • DbVisualizer

Jira-Data-Lake-preview

Data Lake uses the JDBC standard supported by many BI vendors. Supporting an open standard provides tremendous flexibility and power in reporting on your Jira projects.

Once you've identified the components of your BI solution, you'll follow three basic setup steps:

  1. Configure the JDBC driver
  2. Connect your BI tool(s)
  3. Navigate the Jira data model

You'll need your org_id and an API token for your Jira Cloud instance. Except for creating an API token (if you haven't already), there's no config required within your Jira instance. There are instructions for connecting to various BI tools in the Atlassian community Data Lake Early Access group. In addition, you'll find posts and diagrams to assist in answering business questions using Jira's data model.

If you're a Premier or Enterprise customer and would like to access the Early Access Program for Data Lake, complete this form to request access. You can also post questions and feedback for the devs in this group.

Are you interested in unlocking the power of data stored in your Atlassian tools? We're a Platinum Atlassian partner with years of experience helping customers leverage their Atlassian investment for even more value, so get in touch!

Topics: jira atlassian blog enterprise jira-software atlassian-products business-intelligence data-lake

Praecipio Consulting is an Atlassian Platinum Partner

This means that we have the most experience working with Atlassian tools and have insight into new products, features, and beta testing. Through our profound knowledge of Atlassian environments and their intricacies, we can guide your organization as you navigate these important changes.

atlassian-platinum-solution-partner-enterprise

In need of professional assistance?

WE'VE GOT YOUR BACK

Contact Us