Blogs

9 min read

The True Cost of Data Storage

Mar 11, 2020 9:00:00 AM

TheCostofData

Technology continues to increase the efficiency of our everyday lives. Take light bulbs, for instance. In my short life, a 60W incandescent bulb has been reduced to a 9W LED bulb. Eventually, technology reaches the point of affordability, which in turn increases the demand for the more efficient product.

Efficiency & Consumption

Efficiency gains lead to more consumption of a resource, as illustrated in the graph below depicting Jevons paradox.

image2020-2-11_10-3-34

Figure 1: Jevons Paradox 

I see Jevons paradox at play in the size of Atlassian's customers' home directories. The often-mistaken idea that "storage is cheap" is a common excuse to forego storage diligence. "Hey, just get more storage," they say. Data hoarding (currently 2.5 quintillion bytes of data per day!) extends far beyond the realm of Jira and Confluence, which are just one of many places where we collect and store our data treasures. However, I’ve thought a lot about the business impact of storing all of that data, and most recently, I have been contemplating the environmental impact of it as well (which I will get into later).

What Is Your Data Growth Rate?

The thing about year-over-year data growth is that it can't continue to infinitely expand when it consumes finite resources, with the largest limiting factor being disk access speed. For example, we want our Jira data to be quickly accessible, but as data compiles and takes up space, disk access speed slows down. Everyone expects technology to save the day when the status quo runs out, and there are some really interesting new ideas, like storing data in DNA, for ways to store information. Regardless, the growth rate of our data-sets is out-pacing our ability to store them.

With growth, we focus on doubling periods, and you may know that a doubling period = 70/(growth rate). So, if your 401k grows at 7%, it will double in 10 years, and if it grows at 35%, it'll double in two years. This works when you're making money, but it doesn't if you're spending it. Another important thing to note is that every doubling period is greater than the sum of all previous values:

2n

Total

Sum of all that have ever been

0

1

1

1

2

3

2

4

7

3

8

15

 

Figure 2: Doubling value is greater than the sum of all previous values

The doubling quantity is greater than the total of all of the values that came before it (23 > 22 + 21 + 20 or 8 > 4+2+1), which means that in order to continue growing, one will need to consume more than ever before with each doubling period.

How is Your Data Serving You?

In my opinion, our customers overvalue their data and you probably do too. This is a result of habit-forming applications and people valuing their work more than that of others. Stop reading for a moment and ask yourself, "What data am I storing, and what has it done for me lately?"

For example, your Jira instances have been around for longer than a few sprints and most of your issues are closed, but you still keep them anyway. Once several years pass, Jira ends up being filled with closed or abandoned issues, which requires performance tuning and even more hardware to keep scaling. Some of that performance at scale is because you have big problems to solve, but not all of your issues necessarily bring you value. (We'd be happy to help you with scaling  - difficult problems are a good use of expert consultants.)

The overwhelming majority of your issues are closed. They will never be looked at, and they will never serve you. However, they do cost you real money. Here's where you say, "But when I need to look back at that one thing, then it'll be the most important issue we have." Will it? Are stories from sprints four years ago serving you in the present? If you are not mindful of the data that you are holding onto, then things get cluttered and the quality of your data significantly diminishes. Eventually, your data becomes the proverbial needle in the haystack: the more hay you store, the less likely you are to find the needle lost within it.

You can’t foresee how future technologies will utilize old data, but that does not justify the cost of keeping data you’ll probably never use. The real costs of data-hoarding adds up quickly in the form of:

  • More complex software features

  • Bigger, faster, and more servers

  • Need to purchase additional storage

  • Expensive engineers to squeeze out ever-diminishing returns

Ultimately, our systems suffer because they’re expected to perform optimally while storing an enormous amount of old data. All of the computer power in the world will never be able to outrun the pace of exponential growth.

The Cost of Your Data

Data hoarding results in real costs both financially and environmentally. Making our data centers more efficient only drives higher consumption. Increased disk density and speed only encourages us to store more data. Only we, the human beings, who fear the ramifications of the “delete” button, can control what we store to justify the cost.

Take a look at the environmental impact that data storage can cause:

  • "In its 2013 sustainability report, Facebook stated its data centers used 986 million kilowatt-hours of electricity—around the same amount consumed by Burkina Faso in 2012." All of those data stories are probably 60% pictures of people's pets and 40% comment threads of people arguing with your aunt across the country. Again, low-value stuff. 
  • "A 2015 report found that data centers and their massive energy consumption are responsible for about 2 percent of global greenhouse gas emissions, putting them on par with the aviation industry." Given my claim that most of this data no longer serves a purpose in active systems (not backups or other low-power media), holding on to it is comparable to flying empty airplanes around just so people can look for the neat, fluffy line across the sky.

Marie-Kondo Your Data

A general rule of thumb says that if you search for something that you recently got rid of, then you are doing the right amount of purging. I would advocate for doing something similar with your data. If you want a softer approach, then archive old data into AWS Glacier or some other accessible and affordable storage, and set a reminder to delete it later. If you haven't looked at that data in six months, it’s likely that you’ll never need it again. Trust your gut on this one, it won't steer you in the wrong direction.

Attachments and logs usually take up the most space, and you can use the handy tool logrotate to keep your log directories lean. Explore your home and shared home directories for the worst offenders that are clogging up your storage. 

Custom integrations are another source of inefficiency in large instances. It can get so bad that the standard recommendation is to relegate REST traffic to a single Data Center node so that humans don't have to suffer the performance impact. Scripts using the REST API are notoriously inefficient and poll far too often to get a pseudo-real time user experience. Monitor your access logs and work with your team of developers to encourage them to be better consumers. Event-based architectures are more efficient and provide high-quality data.

Here are some ways that you can do a data purge in Jira and Confluence:

Confluence

Apps like ViewTracker provide insight into which content is used. With this tool, you can at least archive, better yet delete, unused and no longer relevant spaces.

Jira

Closed issues, completed projects, and anything that is not active or still "warm" (e.g. items dating back to previous reorganizations) are unlikely to have any real value and should at least be archived, better yet deleted.

Thank you for making it this far. Now, take a deep breath, and let go of your attachments.

 

Resources:

(Fig 1) https://en.wikipedia.org/wiki/Jevons_paradox

(1) https://www.theatlantic.com/technology/archive/2015/12/there-are-no-clean-clouds/420744/

(2) https://www.youtube.com/watch?v=O133ppiVnWY

(3) https://www.youtube.com/watch?v=F8ZJCtL6bPs

(4) http://www.mnforsustain.org/bartlett_arithmetic_presentation_long.htm

(5) https://www.mic.com/p/the-environmental-impact-of-data-storage-is-more-than-you-think-its-only-getting-worse-18017662

 

Written by Christopher Pepe

Upcoming Webinars

Past Webinars

Case Studies

Blog