Learn how to take a more interpersonal approach to the DevOps journeyDevOps is a journey. Often, companies will place a strong emphasis on tools, automation, and making big changes - fast. To start the DevOps journey, teams don't need to begin with the tools or automation, or even making big changes. The key is to start small and with an interpersonal approach. Chief Technology Officer, Christopher Pepe, will take us through a more interpersonal approach to DevOps, and how teams can begin their DevOps journey today.
Full Transcription:
Alright let's go ahead and get started! Good morning everyone, it's been a while since I've got to do a webinar, and I always like it when I get to come back and do these. My name is Christopher Pepe, those of you that have watched these webinars in the past will will know me from from back in the day. I used to get to do a lot of them and now I kind of just get to do guest spots every once in a while. It's always fun to come back and tell some of the stories about what we're working on.
Today I'm excited to bring you the topic, really the wind that we had or the story that we had in kind of part of our DevOps journey. Before we get to the actual presentation itself, we'll run through our our standard housekeeping things. Presentations are always a lot more interesting, if you have questions along the way that can be about technical details of what we did or trouble that you're having with your instances or anything else just feel free to ask me anything: to do that you'll use the GoToWebinar questions panel, you'll find that in your GoToWebinar panel kind of down near the bottom. Just go ahead and type in the questions there, once we get towards the end I'll take a look through those and try to answer as many of them as I can. If we get a lot of the similar type of question I'll kind of try to compact them all down into a single theme.
Our next webinar is coming up on November 7th, that's gonna be with Amanda Babb, she also presents a lot of these, she's always got very rich content especially in the kind of Project Management, Portfolio Management type space and this one specifically is going to be planning with Portfolio with the Atlassian Portfolio product, so there'll be a lot of really good content in there. Amanda is SAFe certified and she works in a lot of those types of projects, so if that's a topic that that you're interested in you'll certainly walk away from it with a lot of take-home info.
We're Praecipio Consulting, we've been around for well over a decade now, we're an Atlassian Platinum Solution Partner. There's only a handful of those in the US, and we were amongst the first. The overwhelming majority of our projects are related to Atlassian and we work a lot in the DevOps, IT ops and ITSM SAFe and Agile spaces; that's where most of our projects are really revolving. We also do a lot of the technical backend work as well. The merges, upgrades, migrations all of that kind of stuff within the Atlassian suite. We also handle custom development training: all those implementations I mentioned, licenses, we do a lot of upgrades, merges, migrations, performance assessments and performance tuning; and we also have managed hosting and managed services as well. The story that we're gonna be talking through today is something that we did within our managed hosting space.
The majority of our clients are in the United States. We have hundreds of those ranging from coast to coast, and you can see the the little yellow dots showing kind of concentrations where a lot of those are: that's ranging from everything to small to medium businesses up through fortune five companies. Our clients include the world's top retailer, top media company, top automotive retailer, largest healthcare services company and the largest electronics manufacturer, to not name a few, as well as just about everyone else in between the Atlassian tools really touch every size company and every industry so we really get to see a lot of different things.
So with that I wanted to bring you a simple story and I do have a reason for wanting to bring a simple story rather than a kind of a big mind-blowing technical flurry. My real goal, my real intent in this presentation is just to remind you that especially when it comes to DevOps, but really with most things, that many small things is better than one big thing and is gonna be the story of many small things. The inspiration behind this is: we're in conference season right now, I just got back from Dot Conf, which is Splunks conference. I'll be going to Reinvent in a few weeks, and there's a couple more in the middle, so if any of you are going to any of those conferences and you want to meet up feel free to drop me a line I would love to get to meet up with you and chat with the what you've got going on.
One of the things that really always strikes me whenever I go to, especially a DevOps oriented tech conference, is the keynotes and most of the presentations are presenting these really amazing ideas, you know, really kind of out on the edge of what people are doing. And it gets you really inspired and really fired up and you want to go back and bring all these cool new things that you learned back into your organization and we're gonna, you know, have all this orchestration in place, and we're gonna set up Canaries, and we're gonna do all this other stuff, we're gonna make everything awesome; but really what we observe, you know, in the majority of the projects that we work in is most people don't have the mastery that they need to really take advantage of a lot of these very bleeding-edge things. It's a great goal to work towards but first you really have to master a number of different domains before you're really ready to add all that additional complexity.
One of the stories or the examples that I'll point to is if you look at a tool like Puppet and say we're gonna use Puppet to manage the configuration of our Apache servers: in order to do that you have to very intimately understand how to configure Apache in the first place, so then you can abstract that out into Puppet code so the Puppet can do it for you. Adding all of that automation on top adds another layer of complexity and doesn't make your life any easier: it makes scaling easier but it makes everything else harder. So you really have to make sure that you have all of your tools and processes and knowledge really dialed in before you go and add that additional complexity, because otherwise you're just making your problem even harder. So I don't dislike those, I you love all of those talks and I walk away with a lot of ideas but I often find that those bleeding-edge ideas really start to hit practical application, and a lot of organizations sometimes even years later.
This again is just to kind of bring you back down and ground you and remind you what are the things that you can do today and actually get real value today. What's the next little step that you can take, the next incrementation that brings something better to your team. To highlight that, I found this quote from QA symphony, they're one of our other partners in the testing space, and one of the quotes that I found in one of their presentations was "DevOps stresses communication, collaboration and integration between software developers, other stakeholders in the software delivery lifecycle" If you stop and you look at that quote, that says that DevOps is people and processes. Now we're a process company so obviously that speaks well to us, but I think that that also highlights a lot of the pieces of DevOps that often get pushed aside because they're just not as sexy: the orchestration and all of that stuff, the tools are really cool, but this quote has nothing to do with tools tools aren't even mentioned in it; really it's the people and the processes, that's really what DevOps is about, getting people to communicate more effectively. Which I personally think is probably the single most important part of DevOps and that's the first thing for a company to master, is that communication piece, everything else is great but until you can communicate effectively you're not gonna get anywhere, and then having good repeatable processes to track the work that you're doing and how people actually repeatedly deliver the same quality. Those are the first things to really get in place and that's what I think doesn't get focused on enough, or that we should focus on at least more within DevOps.
With that I want to share a story about a tool, even though I said that we shouldn't really focus on the tools that much! But really is the process involved in it as well, and this is really the story of how we adopted Splunk internally. This is a tool that I knew we wanted to use, we've been kicking the tires on for a long time, we had even installed it! We had it running, we were indexing stuff, we just weren't really looking at it. We were kind of still stuck in our old ways, of the ways that we used to do things, and then we had an incident come up and it turned out that Splunk was a great fit for helping us solve and manage that problem.
It's a good real-life example because we're actually maintaining a production service and we have real-world considerations, which is better than the kind of game day simulations that a lot of times we will use to adopt a new tool or understand a new technology. It was also a pretty simple use case, we actually already had a pretty good idea how we would approach it from the traditional way, so we knew what we would have done and now we just had to apply new patterns to it and see if that was better for us. The other big thing was the reason we hadn't really got any traction was Splunk is it's actually a pretty intimidating tool to approach, it's incredibly powerful, it's actually a lot like Jira in that way. It is a very malleable tool, you can kind of solve any problem with it, but it's also a lot to bite off, and kind of staring at that big white canvas knowing where to start is always the hardest part. Then once you actually start you realize that all the instructions are in Russian and I don't speak Russian! So that makes it all the harder. This was a really simple thing that we can kind of stumble our way through to gain a little bit of confidence.
So here's the problem: we have a managed hosting service that we provide, it's kind of a niche hosting service specifically for Atlassian applications. Purpose-built hosting environment and all of the tooling and the team are really centered around managing Atlassian applications. It's a little bit different from, actually it's very different from a SaaS model, but it's different from other organizations that have a Saas model even like Atlassian themselves, because we don't write the the products that we're managing, so we're managing at a very high level applications which we don't really have the ability to impact, so if there's bugs or if there's additional tooling that we need we have to figure out a way to get that in there without actually controlling the codebase itself.
In this particular case we had brought on a new customer, we had brought on several Jira instances, a couple of Confluence, couple other tools which is what often happens we take over an instance, when we take over an organization's hosting, because we sit kind of in-between hosting it yourself and running in the cloud. We're a good place for people to end up because we kind of let you do anything and even if you really, really insist on it you can do things that are going to impact your performance, although we try to steer you away from that and give you the the best advice possible but oftentimes we're taking over instances of Jira and other tools that have kind of grown wild in an organization for a long time and are very unstable and have a lot of problems. So because we're just forklifting over all of their data, we're also forklifting over all of their problems and it's a pretty stressful time for our operations team. It's really hard to take on a new unstable instance and try to start to kind of untangle that that ball of spaghetti and make sense out of it and make it stable again.
So as I mention in this case we have at least five instances of Jira, plus other tools all: of them are unstable. Some of them are massive and there's one particular instance which at least for this presentation I'm gonna pretend was a data center instance. I don't recall what it was back at that time but it's a better story if it's data center so that's what we'll tell today!
So we've got a data center instance of Jira which is by all accounts, APM is showing everything looks fine, that's Application Performance Monitoring. So the application is actually running fine, is responding fine, users are having a great time with it; however every day or so it just falls over and no real obvious cause as I'm sure any of you that are admins or have had to deal with it Atlassian admins are aware. In the Atlassian logs it is either instantly immediately obvious exactly what the problem is or you have no clue at all what the actual problem is and you have to use other techniques to figure out the problem and remediate it.
So in this case we had a good symptom in the log but we didn't actually have any real understanding of what was going on, and then again it wasn't performance related. Most of the time what we'll see is a performance problem and those are usually pretty easy to hunt down even if there's nothing in the logs, but in this case we just had a ghost and every once in a while one of the nodes would fall over. This is a big problem for our user base because whichever segment of the users are to that node whenever Jira falls over it tends to just stop working. It doesn't actually shut down or do anything else that would cause it to come out of load balancer rotation or otherwise kind of drop out of the cluster; it just sits there acting poorly. So everyone that's attached to that node just has this terrible experience, so users are upset.
The other thing is this is happening at two, three, four, five in the morning. The team is already kind of running ragged because all of these instances are falling over all the time and we have all these problems that we're trying to untangle. Getting woken up in the middle of the night to go restart a node is just one more thing to have to deal with. This is a kind of an ugly state to be in! This is where we decided, you know, that rather than having to VPN in, SSH to the boxes - all five of them- rip through all of the log script, through the Atlassian logs and the Apache logs and try to correlate all that stuff across them. Let's try to use Splunk to do that, because Splunk is supposed to be good at doing that kind of stuff, we've heard.
So let's try to figure it out! The very first thing that we did was we found a thing called search in Splunk, so we clicked on it and it presented us with a search bar: we've all used the Internet long enough to know that we can just start typing terms in there and see what's coming back. Pretty quickly through little hints in the user interface, we figured out that we could isolate our searches to a particular host or set of hosts, and then we can just throw terms in there, and that worked well enough, it's not the best way to search but it worked well enough. So here we have a single node and we're looking for the word error in it and with that we get out of the logs that there's a too many files open error and this is something that some of you may even have seen before. It's not all that uncommon, it's not a very common error, but it's not all that uncommon of an error either. And it's not unreasonable, sometimes an application that needs to be able to open more files at a given moment, then the operating system is kind of set by default to allow it to happen. So we always just try to kind of increase that and see what happens; a lot of times that will fix whatever the problem is and that's really the, you know, that's kind of the certainly the old-school way and typically still the first approach to anything is whenever there's a problem and you don't know what to do just guess and see what happens.
In the early days, this is actually how we did all of our performance tuning. Jira is not running good? Then make the heat bigger see what happens! And put on a bigger box! Get more cores! Just guess, throw stuff at it and see if the problem goes away. Now we have much better approaches and much better tooling to understand that that's actually not the correct approach, it's not the best approach, but in other areas that's still the first thing, it's always the the quick knee-jerk reaction. So we tried to guess, we upped the limit and Jira stayed up a little bit longer but it still fell over. Then we upped the limit again, and Jira stayed up for a little bit longer but that fell over again. Then we upped it again and it fell over again. So at the point that we hit about 12,000 file handles allowed to be open. So Jira is opening 12,000 files at any at any given moment, it's got up to 12,000 files open. At this point we stopped and said, you know what, we're not gonna outrun this problem no matter how big we make it, all we're doing is buying ourselves time and probably injecting some other problem if we keep making this file handle bigger, that's going to start to impact other parts of the system.
So we're not gonna be able to outrun this. This isn't just a case where some app likes to open a bunch of files and it'll close them all later when it gets around to it. We're gonna actually have to deal with this one! So that's where we say: you know if you can't outrun your problem it's time to turn around and just face it! How are we gonna actually approach this? We have a really good piece of information, we've got a good symptom, we actually have no idea of the root causes but we have a great symptom. We know that something is opening files and not closing them, and we also know that we have this near linear relationship between the number of open file handles that are allowed to be open and how long Jira stays up.
So first what we want to do is improve things for the users and second what we want to do is improve things for our admin team. This is the approach that we took: because we know what our file limit is and because we can measure the number of files that are open at any given moment, we now know approximately how long we have until we fall over. So we have this tool that we built called Moneo and Moneo was our version of a shipit project: one of our engineers saw a problem, figured out a way to fix it, wrote this tool and then presented it at our annual gathering. And since then we've used this as a point in time monitoring tool. So if any of you have done a performance assessment with us we'll come out to your site, we'll run this tool against your systems.
Moneo is basically a data aggregator. It pulls together a whole bunch of information, some things that your standard monitoring and APM tools would gather, some things that you can get from Atlassian like the disk speed tests, and then some other metrics that we've just noticed seem to strongly correlate with performance. So we'll pull back all of those metrics and this aggregator goes out, gathers all of that data for us and dumps it into a report. Now we wanted to find an easy way to build a measure of our open files and it turns out that there's a Python module that does that, psutil. It's pretty easy to go and get the number of open files that are running, the number of open files that a running process has open. Now I can gather that data and then I took advantage of the only thing that I knew about Splunk at the time and that is if you write your logs in key value pairs then Splunk will automatically take that key and convert it into a field or expose it in the user interface as a field, So it sent something that you can operate on.
Otherwise a lot of times you have to tell Splunk how to interpret your log files, you know, what the structure of them is so it knows which pieces of data go with what thing, and it knows how to pull that data in. So here I've got at Jira open files equals some number, and now I can feed that into Splunk and it should just pop up in there and now we can operate on it, now we have a way of visualizing how many open files there are any given moment. I started off by saying that Moneo is a assessment tool that we wrote, we had adopted it for monitoring internally. It's something that we're still kicking around the idea of making it a product so that we can actually put it out into other people's environments as well. If you're interested in that either ask questions at the end or feel free to follow up with me directly after the webinar.
But that little trick in Splunk made it easy for us to take all of the metrics that were gathering and piped them directly into Splunk. And we piped them to other tools as well. So after waiting for a little while I was able to then just jump into Splunk and then go and look at at a specific host. If you look in the kind of bottom left corner of the image that's on your screen, you'll see interesting fields. This is something that Splunk picks up out of the files, out of the data that's flowing into it and kind of suggests to you: you may also be interested in these fields. And one of them is At Jira open files: so clicking on that brings up a a pop-up window, a dialog that has a bunch of data, a bunch of information about that particular field, and it's a numerical field so there's a bunch of statistical information about it and other things like that. There's a bunch of canned reports that come with it, maximum value over time, minimum value over time, rare values, things like that.
I started clicking around into each of those and none of the charts really looked like what I expected it to look like. It didn't it didn't look like what I was expecting to see. And this is a moment where I think, especially when you're adopting new tools or new methodologies, new approaches to solving problems, it's really important to start with a data set and a problem that you otherwise intuitively understand. I was classically trained as an engineer and I spend a lot of my early career as an engineer and one thing that old engineers are very fond of saying of young engineers is how they will just blindly believe anything that comes out of a computer, and how back in their day they had to do it a much harder way. But the real kernel of truth that's in that curmudgeony statement is that if you don't really intuitively understand your data set, you will just blindly believe whatever comes out of it.
If you get some amazing seeming result that seems too good to be true you may just believe it anyway, but if you have a pretty good idea, if you have done this the hard way or you know what you're expecting to get and you get something that's wildly different from what you expect then you know to inspect a little bit closer. Is this really what my data is saying? Am i processing my data the wrong way? Is there something else that's going on? Even when you get the results that you expect it's always good to question them because it still could be wrong, it still could be wrong even though it's what you expected. So I think it's really important to make sure that you know the results that you're getting are correct and accurate. And what I was what I was seeing in all of these canned reports wasn't what I wanted.
But what it was doing was populating the search bar for me and this is a trick that I used early on in Jira when JQL first got introduced. If you would run the kind of standard canned searches and then switch to Advanced Mode it would pre populate the search bar with JQL for you. And then you could start to see what the syntax looks like, how should I be structuring this, how should I think about in this case SPL. You know at the time that we were doing this, I had never even heard the term SPL, which i think is the Splunk processing language and that's kind of the the query language that you put into their search bar to do stuff like get all of your data and pre-process however you want to process it, and then present it in a visualization layer however you want to do that. It can be really, really complicated, Getting it getting it pre-populated and seeing generally how I should be structuring it, and then a little bit of searching around online I was able to get just a time series chart of our open files which was really what I ultimately wanted to, just see how many files are open right now and what is our trend look like so I can kind of in my brain kind of predict out how much longer do we have.
And this is pretty much what we would see, the open file handles were were largely linearly opening over time, and then we would reach some threshold where we would go in and restart the JVM, that would reset the open file count to zero and it would start growing again. Once we got this, once we got to the point where we knew how many files were open to any given moment, we were able to do two things that were pretty impactful: first off because we could go and restart the node whenever we wanted our admins were much happier, so rather than getting called in the middle of the night or at the very least just having that fear of being called in the middle of the night, all the anxiety that that produced, that went away because now we know exactly the state of each of our nodes, we go and run this search and we can see how each of the nodes looks and if one of them is getting kind of high, like this one's at about 11,000 or so, this is go and restart it right now, and we can do it in the middle of the day and we can plan it out so now it becomes a planned activity rather than a response to an incident. So our admins are a lot happier! The other thing is because when you shut down a Jira instance and restart it, that actually causes it to drop out a load balancer rotation automatically and then once it's back up it rejoins the cluster, so now any user that was attached to this instance when we restarted it, they just get redirected to a different instance.
The overwhelming majority of our customers that are running data center are also using some sort of single sign-on integration, so when they get redirected to a different node they don't get challenged for an authentication, they don't really in any way otherwise know that they got redirected to another node, and the only thing that we're suffering is that performance is potentially a little bit degraded. At the very least we have a little bit less capacity, while that one node is restarting. But again as long as we time that well, that usually isn't a problem either.
So our users are back to not knowing that anything is wrong, the system is up and responsive. We still don't know what the problem is, but our admins are happier as well. The only downside is they have to come and run this particular query against each of the nodes. So the next thing that we figured out was how to build dashboards, and then we just simply built a dashboard that contained the the five charts of each of the five systems: we could watch those. We then took the most junior person on the ops team and we gave them the job of monitoring the dashboards and going and restarting the nodes when they needed to. So now rather than the whole team, including the most senior engineers, being in this highly reactive mode of just trying to keep things up so we could generally try to keep our customer as happy as we could, now we have one person that's responsible for just making sure the cluster stays up everybody else can now go off and either focus on whatever else they were working on before they got pulled into this fire or go into the actual root cause analysis which is what we did.
At this point now that the fire is under control, we can go and try to figure out what the problem is. Because of what this presentation was I didn't put in any additional slides for any of those but just quickly so that you know, the next step that we did was taking thread dumps and taking heat dumps. Now in this case this customer had premier support so we also engaged Atlassian and we worked with premier support. If you have premier support that's a really powerful way to be able to work with a partner, so if you're engaged with a partner if you're engaged with us and you have something like premier support, that's a really powerful way to be able to solve your your problems very quickly.
We've actually taken some incredibly technically difficult problems, teamed up with Atlassian, in some cases we've even teamed up with other app vendors, and collectively all tackled some incident or some problem that was occurring for one of our customers, and we're usually pretty quickly and easily able to get to a root cause and remediate it. And because we have Atlassian involved, because we have an Ad vendor involved, when when it's, you know, their problem, then we can also get a long-term solution put into place. So not just figure out what your problem is and restore the application, but also get that problem fixed. That's what we did in this case!
We were able to single out a single app that was, I forget if it was actually even being actively used, I think it was just when pages were loading, some kind of wonky thing that you wouldn't expect, but there was an app that was opening temp files and never closing them and it was just a bug in the app. Once we were able to isolate things to that particular app, we just went and disabled it. At that point we saw the number of open files being open level off, and then that just stabilized, and then in turn whenever those nodes got restarted the levels drop back down to zero and they grew to the normal more manageable levels with that app disabled.
We were able to talk to that app vendor, describe the bug to them, which they then fixed, and then the new version of it we ran through testing and in the test environment everything looked good, so we installed into the production environment, re enabled it and all that functionality was restored as well. Now the entire system was back to its old capabilities! However it's a lot more stable now and again the users are happier and our admins are happier.
There are a couple other things that really came out of this, beyond just how we fixed that one particular instance. For us, it was a good win, it was a good opportunity to see that changing your processes and trying different ways of doing things really can bring about positive change. In this particular instance, we were able to take this very reactive stance where we had to constantly be reacting to problems in that instance, and move to a much more proactive one. Similarly piece by pieceM and all the other Jira instances that we're running, we're able to take each of the problems that were occurring and pivot them from reactive to proactive, so we get to a much more stable place. I think this was maybe even almost two years ago now that we did this, and now all of those instances went from falling over every day, having lots of performance problems, being very slow to actually being very stable even though they've continued to grow both in user base and the amount of data that they're that they're handling.
Tackling each of these problems one at a time and solving them to completion has given us a much better result. Moving into that proactive stance has allowed us to keep the applications a lot more performant. Obviously they still crash, they still have problems, but we have a lot better tooling in place and a lot of times we can detect those problems before they're impacting the entire user base, sometimes even any users.
The other big thing is that our operations team has a new way of working. We solved this particular problem and we've applied this pattern to pretty much every other problem that we have seen. Whenever there's something that we can describe with SPL or otherwise kind of identify that this is the problem that's happening, we'll try to build a dashboard and we even use alerts a lot now as well. We've even started to be able to move into a space where we can take automated action based on things that are occurring, so we can even solve a couple of problems without anyone needing to intervene other than just coming in afterwards to make sure everything looks like it was ok. We still take a pretty hands-on approach anyway, but as much as possible we try to kind of automate those things away.
And now rather than people having to VPN in an SSH and grep through logs they have a single unified interface to go in search in. And because SPL allows us to describe all of our log files and how to operate on them, in a similar way we can look at Apache logs and Atlassian logs and SSH logs and every other kind of log and every other kind of data that we're ingesting, we can look at it the same way with the same approach. That's been a really transformative thing for us and I think a really powerful thing.
I intended this to be a pretty short presentation so while we still have a lot of time left within the webinar I don't have any more content, I'm happy to take any questions that you have! Let me go ahead and open that up and see what's there. I'm going to move on to the next slide I'll keep looking for for questions if they come in, but I just want to remind you that Planning with Portfolio with Amanda Babb will be coming up November 7th, if you have any questions about what that's going to be, what the content's gonna be ahead time feel free to reach out to us. Again if you're interested in knowing anything more about any of the tools I talked about any of the processes, my thoughts and approaches and philosophies on DevOps, feel free to reach out to me and I'd love to have a conversation with you.
Thanks to a comment by Brian, I'm glad to hear that that you're using similar kind of thoughts and processes internally. Just to kind of reiterate the biggest message that I have for folks when they start thinking about DevOps is: don't really worry about containers, orchestration or tools, or taking your monolithic application and breaking it apart into micro services. All that stuff is really great to do but if you really focus on communication, on small wins, on taking small incremental steps, continuous improvement: those are all things that are gonna get you a lot farther a lot faster. And really waiting for that one Big Bang is a goal you may never get to. You may and that's great if you do, but you may never get there. If you take a lot of little steps along the way then you'll find that you get a lot of things done. That's the "you move a mountain one stone at a time" kind of philosophy.
Well good, I don't see any other questions coming in, so I say we go ahead and wrap up. I thank you for your time and attention and I'll be happy to be back the next time I get to present and until then, please enjoy Amanda's presentation on Portfolio for Jira. Thank you everybody.