DevOps Digest: Episode 1

Welcome to DevOps Digest, brought to you by cloud consultancy, Vivanti. A news and analysis podcast, we cover the trends, thought leadership and announcements happening in today’s DevOps space. Topics span from Continuous Integration and Continuous Delivery (CI/CD), Microservices and Infrastructure as Code; to Monitoring and Logging, App Replatforming and Pipeline Automation, Collaboration and more.

Let’s get stuck into episode one. The recording and modified transcript can be found below:

And, we’re off!

Lachlan James 00:00

Welcome to DevOps Digest – a news and analysis podcast about everything DevOps, brought to you by cloud consultancy Vivanti. I’m your host, Lachlan James. I’m joined by Vivanti Principal Consultant and long-time DevOps aficionado, James Hunt. Thanks for joining us, James. Are you as psyched as I am for episode one?

James Hunt 00:19

As a long-time listener – first time hoster – yes, yes, I am Lachlan.

Broadcom buys VMware in $61B splurge: What does it mean for DevOps?

Lachlan James 00:23

Well that is the spirit we want to hear James. So on that front, let’s dive straight into it. So, the first article for today, comes from the Wall Street Journal, titled Broadcom to Buy VMware in $61 Billion Bet on Unsexy Software. And, the breaking news is that microchip powerhouse, Broadcom, has announced a $61B – that’s billion with a ‘B’ – deal to buy VMware.

Broadcom to Buy VMware in $61 Billion Bet on Unsexy Software (Wall Street Journal)

So, the Journal’s coverage kind of characterized the announcement as part of Broadcom CEO, Hock Tan’s ability to build an empire out of acquiring ‘unsexy’ tech solutions and cutting costs basically – which doesn’t sound all that flattering. So, I think there’s four interesting questions coming out of that, which I’d like to kind of go through with you now, James. So what does this mean for the future of VMware’s offerings and its customers? I mean, is it time to jump ship or the other sort of positive implications at play here?

James Hunt 01:13

I mean, anytime you have major vendor acquired by other major vendor, most people say: ‘Oh, it must be time to leave that time to go off, and it’s finally time to go implement that Open Source Xen Hypervisor. The reality of it is, VMware has been sold how many times in the last decade? And people still use vSphere ESX. They still use vCenter, they still use vSan, the tech family – the product family – is not going anywhere. There’s gonna be a whole bunch of coverage on what Wall Street’s doing, and the financial instruments at play, and who owns what and how much money does Michael Dell walk away from this sale stuffing into his pockets. [That’s to be expected], given that he owns something like 50.4%, between him and his investment firm Silverlight, I think it is. Silver something; silver lined pockets. But for most people, it’s not going to do much of anything, right? Your partners are still going to be your partners. You might get a better return as a shareholder. I mean, that’s kind of Hock Tan’s whole deal. It isn’t so much buying unsexy software, but buying software that people have given up on from [a development perspective]. The mindset is: ‘We can’t make this any better’, and [he then] somehow [is] squeezing more and more profitability out of it. He’s actually really good at it.

James Hunt 02:30

The article you’re referring to actually did a really great visualisation of that, where they showed the line graph of what their expenses were, up to the point of purchase, and then steadily decreasing until the next purchase of his debt load. It’s just amazing. The financial aspect is probably way more fascinating than the boots-on-the-ground, day-to-day changes that are going to happen for VMware customers. So I’m positive in [my] outlook on VMware. Nothing’s gonna change.

Lachlan James 03:01

Yeah, I think to be fair, when you’re talking about moving cogs that are that large, that’s probably a pretty assessment.

James Hunt 03:07

Yeah, you’re in the billions. Right?

Lachlan James 03:10

Yeah, I mean, nothing’s happening overnight. No, that makes sense. I mean, it’s still obviously a really significant sale, but I mean are things gonna fall apart or change catastrophically in the next, you know, two years?

James Hunt 03:18

I don’t think so. I think it’s beneficial to Broadcom to keep the software business in place, because the article says it will triple the size of their software business. And I mean, that’s a massive opportunity for them and their shareholders. So, I think they’re gonna double-down on some of the more risky projects that VMware has kind of shied away from in the past. Only time will tell, but it’s probably good stuff.

Lachlan James 03:47

Yeah. Well, that’s actually a good point, because the second question coming out of this was actually around what you’re just getting onto there. So devops.com actually did a little write-up about this, pre-empting what was happening two days before the official announcement: Could Buying VMware Bring Broadcom Hybrid Cloud Bona Fides?

Could Buying VMware Bring Broadcom Hybrid Cloud Bona Fides? (devops.com)

[They] are basically saying that this deal allows Broadcom, which is largely a hardware based business, the chance to buy its way into the software game. And, you know, at a time when many tech stocks are really under a bit of pressure, talk to me about why that’s important, James.

James Hunt 04:13

ah, I mean, tech stocks are under pressure, sure, but it’s really just the pendulum swinging from the bundling to the unbundling and back, right? We do this in the tech industry. And if you’ve been here long enough, as you and I both have, you’ve seen this before. We saw the massive ‘let’s pick-up all the startups and and smash them into larger and larger companies’. And then the invariable swing back reaction is, well, now we need innovation labs, and we need startup incubators. And all of the the people who got bought, you know, the people who started the companies that got bought by larger firms, they’ll go spin out new startups and the cycle repeats. I think we’re just in a phase now where, with the cooling economy, the continuation of pandemic times, you’re seeing a lot of Wall Street types go, ‘I would much rather have eggs in fewer baskets, that I’m much more confident about the stability [of]’. So you’re gonna see a lot more of this bundling of tech companies into larger conglomerates. And then in like two or three years, you’re gonna see another round of startups, all solving the same problems in different ways.

Lachlan James 05:21

Yeah, that’s probably true. And sort of, I think that leads us on to the next part, you’ve probably gone someone answering it. But I mean, what do you think this means for the DevOps tools market generally? Does this play into a broader narrative about consolidation? Or maybe an abandonment of innovation? Or do you think it’s just one of those cyclical things? Is there a trend here?

James Hunt 05:41

I think it’s cyclical. As far as DevOps software, I’ve only ever seen one company actually build DevOps tooling software and make money at it. And that was HashiCorp. Right? And what they make their money on isn’t technically DevOps, right? Its security, its Vault, they catapulted into the spotlight off of Terraform and Vagrant and the like, but really, I don’t think there’s much of a market in the business sense for DevOps tools. Because DevOps tools are so integrative, it’s difficult to build a generic solution that works for everybody that isn’t so insurmountably expensive that nobody can afford it. I think what you’re seeing, with stuff like the Broadcom purchase of VMware, is larger conglomerates trying to pan out their portfolio of offerings. And I think if we look back over the last 30 or 40 years, these things happen when we’re undergoing a commoditization of the lower levels of our stack. Right? VMware did this. They bought tonnes and tonnes of small companies, around late aughts into the early teens, in an attempt to build out their cloud platform. And now you’re just seeing Broadcom marry the software side of the virtualization, with the hardware side of delivery, to try and be the only one providing you computes – because everybody else has moved on. I mean, when’s the last time you speched, a server requisition to interact it in a datacenter? Years, if not a decade or more, right? I mean, the only people doing that are the Amazons and the Googles of the world. So I think what we’re looking at is really a paradigm shift, as containers really start to take off in the far corners of industry of the IoT realm, people are less and less worried about virtualization and less and less worried about hardware.

Lachlan James 07:28

Yeah, I think that’s pretty fair. And just to pick up on one of the points that you were making there. You sort of made the interesting comment that not many people have actually managed to build big companies that turn profit based of DevOps tools, which is an interesting thing. And so I just wanted to combine that idea together with something that was said about Hock Tan in the Wall Street Journal, so you know, if developing sexy software is a waste of time when it comes to DevOps, do DevOps solutions need to be sexy to be successful? Where’s the user experience come into process oriented tools, like those in the DevOps space? Because you’re right. That’s how he’s gone about building an empire. It is essentially through unsexy tools. So where does user experience come into this?

James Hunt 08:14

I mean, you’re awfully thirsty over there, Lachlan [laughing]. I think the sexiness of DevOps tools is like saying, the sexiness of information processing and why is Excel king in the financial world. In that, there’s a baseline of technology, right? There’s a bunch of utilities that people use, but really, the value in something like Excel isn’t in Pivot Tables, or in formulaic cell-based construction, or import / export from CSV. The value of Excel, for most people, is the freedom to build the thing they need with minimal outside involvement. And I feel like we’re now [exploring] what you might class as the DevOps tooling space. So CI/CD, version control… like when’s the last time we had a discussion about what version control system to use? Right? ‘Oh, cvs’, ‘no, svn’, ‘No, Git’. Like, no. It’s probably Git. It won. So I think the value isn’t so much in the commodity, and the pieces and the utility, so much as what people can now do with these [things]. Watching people basically glue together these very high-level components – to do crazy one-off automation stuff, just because it was fun and interesting, and it scratched an itch – is to me the defining moment of DevOps tooling. And for that, you’re not going to be able to build it as a software engineer. You have to take off the engineering hat and put on the end-user hat.Does the way most DevOps teams approach Secrets Management need to change?

Lachlan James 09:46

Yeah, no, no, I think that’s fair. I think that makes sense. Alright. So, sticking with this kind of theme a little bit. I’m just gonna move on to the second article that caught my eye during the last week. It kind of begged the question: Does the way most DevOps teams approach Secrets Management need to change? So Don Macvittie, on devops.com, argues that while we have solutions to scan most aspects of software – from scanning apps for vulnerabilities, to source code for data leaks, to networks for holes – we need something better when it comes to Secrets Management. At least, that’s his argument. So he suggested that, while there’s apps that find secrets as part of a feature set, there’s nothing all-encompassing, which scans for secrets-based vulnerabilities outside a single domain.

The Scanner We Really Need
Lachlan James 10:37

This piece coincides with a study published by Rackspace Technology actually, which found that in 2022, only 42% of IT leaders apparently are willing to take a risk and implement cloud technology, which I can’t find nuts in the age that we’re in. And I’m gonna treat that a little bit of caution. These studies that get done by vendors obviously have an agenda to them. But still, you can’t fudge numbers that higher, right? It might be a little overblown, but still. As reported on devopsonline.co.uk, it then went on to say that, 40% of IT leaders who were surveyed, believe that security and data privacy concerns are preventing the business from benefiting from cloud strategy and innovation. So, again, in 2022, that kind of sounds pretty amazing to me.

UK companies to face challenges when adopting cloud tech

So James, I mean, in light of stuff like this, do we need a new tool, which helps to uncover you know, all the places that secrets might be hiding? From flat files, to databases, source code, emails? Or, is this an area that’s already sufficiently addressed by the market today, and DevOps teams just need to be more aware of how to customize existing Secret Management options they’ve got out there and scanning tech to better meet their needs? What’s going on here?

James Hunt 11:50

Yes and no. I would say we need better tools. But I would also say we have most of those tools. Specifically, one of the things that struck me about Macvittie’s article is the premise that we already have tools for scanning for security vulnerabilities, we have tools for scanning for all these other things, we just don’t have tools to check to make sure the developers are safely handling secrets and code. And I disagree with that premise. The security scanning we have now, that most teams are doing, is stuff like Snyk, right? It’s a bit like saying, we can detect viruses in the wild because we’re running McAfee or ClamAV. You’re looking in those cases with a tool that has seen everything that it’s been told about and can detect those [things]. It’s basically saying: ‘Oh, yes, I recognise this thing that has already happened. Container image scanning is primarily around known package vulnerabilities: You’ve got this version of this npm package, you’ve got this crate in your Rust manifest, you’re using these system libraries in your OS image. Therefore, because we can cross reference to those version numbers and names against a known disclosure of vulnerability lists, like a CVE or mitre or something like that, we can say ‘you are probably vulnerable to this class of attacker, this particular type of exploits’. What we don’t have, is things as simple as Buffer Overflow detection. I cannot take an image, build something, and then say: ‘Does this thing have any Buffer Overflows?’ and have any tool that I trust and believe say ‘yes’ or ‘no’, because static analysis of code is nowhere near where it needs to be. And dynamic analysis is essentially the halting problem. So we’ll never probably ever solve that one.

James Hunt 13:45

So I would say we don’t actually have the level of security scanning, which Macvittie’s bringing in the image container sets, that he wants to have for Secrets Management. We do have things like detecting on a Git push, [to determine] if someone has dumped an AWS secret into the repo. We have things that, not only will detect that but, will also reach out to the systems involved and revoke that credential to try and stem the bleeding and contain the blast radius. The jump that he wants us to make in the article is that we really need to move to a point where we can point at a piece of software, or the source code for that software, and say: ‘You should do this instead of this because that will be more secure’. And I don’t know that we’ll ever get there because that’s a very, very difficult problem to solve. Now, that’s not to say we don’t need to improve the state of Secrets Management. We very much do.

Lachlan James 14:47

Right and this I think, for me, takes us on to like what I think was the second question coming out of this and it links to what you’re saying there about: ‘Will we ever get there?’ You know, what’s that look like? And so, if technology isn’t the answer, then how do you better address Secrets Management and the associated vulnerabilities? Is it more about instilling better cultural practices and expectations, rather than the need for more tools? Is that simply what it comes down to?

James Hunt 15:20

I think we have the tools that we need today, right? Every single cloud provider has some sort of Secrets Management, right? Whether it’s Google Secret Management, Amazon’s KMs… They all have the ability to take some sort of encryption material or encrypting key, which they attest they will handle with the utmost caution and security. And they kind of have to, because they’ve got hundreds of 1000s, if not millions, of customers depending on them. And your app team has your customers. So, I trust Google to take care of creds in KMs, I trust Amazon to take care of that, because their business is on the line. So we have the tools for that. I think what we lack is.. you said a culture. And I think ‘culture of practice’, or ‘culture use’, is a really appropriate term there. Because it’s not so much that people don’t know they shouldn’t do this. It’s that they do it and nothing bad happens. Right? And that’s unfortunately the nature of security. It’s a bit like running a red light, right? You can run a red light hundreds of times before you hit somebody and either you die or they die or somebody is life alteringly injured. And up until that point, it seems like it’s okay. [You start thinking] it’s probably a bit overblown. Maybe people are making too big of a deal out of this. And then the bad thing happens. And we as human beings are really, really bad at those statistic eventualities that cause really bad things to happen. So I think of culture practice, and more… understanding of how simple it is to prevent a whole class of attacks, just by using Secrets Management in the worst possible way, which is to do it at boot, right? And I’m sure this isn’t the last time we’ll talk about secrets. I’m sure we’ll get to the Tower of Secrets eventually.

2022 ‘State of Developer-Driven Security’ survey

Lachlan James 17:16

Absolutely. And this is actually a super nice lead in to the next write-up I was having a look at – that whole idea about humans being really good at justifying risk, if something doesn’t immediately happen, without actually thinking about the consequences until it happens.

Lachlan James 17:42

So, sticking with this theme for the next story: Secure Code Warrior recently released its 2022 State of Developer-driven Security Survey. And this report basically found that developers actions and attitudes towards software security are really in a fair bit of conflict. So what I mean by this is, whilst most developers – according to this report – acknowledge the importance of a security-led approach to software development, only 29% of developers believe that active practice of writing code free from vulnerabilities should be prioritised. That sounds pretty low. And 86%, apparently, do not view application security as a top priority when writing code.

Secure Code Warrior Survey Finds 86% of Developers Do Not View Application Security As a Top Priority
Lachlan James 18:30

So James, what’s the answer to this challenge? The survey most commonly cites management related barriers. Developers say that meeting unreasonable corporate deadlines, and a lack of in-house security training, are leading factors that saw 67% of devs knowingly shipping vulnerabilities in their code. So do you think that’s a fair assessment of many management cultures at software companies? Or do dev teams need to just go and take more responsibility for instilling a security-first mindset?

James Hunt 19:06

Yeah, it’s a very chilling statistic, and it’s one that doesn’t make me happy. But there’s a couple of things I think I’d like to unpack there, not so much with the management culture, but as an industry of software engineering. We’ve had this this panacea, silver bullet approach to things like security for a very, very long time. One of the things I see – and I follow a lot of InfoSec people on Twitter because they’re absolutely wonderful people and they have fascinating side projects – is that things like languages being memory safe are simultaneously valuable and detrimental. For example, Rust makes it difficult to write memory leaks, whereas C, it’s kind of just a matter of course that you’re going to leak memory and that that’s going to eventually lead to something like Heartbleed. Or, a remote code execution or something that is so far outside of what you’re trying to build, you don’t notice it when it happens. And when it does, it really, really breaks stuff. But the downside of things like, building memory safe languages… the downside is it has caused a culture, or a an understanding among most practising software engineers, that security problems ought to be solved at lower levels. If we just had more secure libraries, the thinking goes, or if we just had operating systems that would enforce inter-process access control. Or if we had languages that prevent a whole class of bad security programming practices, then I as the working software engineer, would not have to worry about it. And unfortunately, a lot of them see some of this going on, and they think; ‘I can do that now’. Because next week, next month, next year, things will get better under the hood. But that’s just not reality, unfortunately. And I think there’s something to be said for understanding the limits of what the tools can and cannot do from a security perspective. And then, taking a step back, and realizing that – as a company – nobody wants to say ‘no’ to a customer. No one wants to push-off a customer and say we need three more weeks because we’ve got to finish this thing through a security review. The customer is going to say: ‘Well, does it work? Well, can I have it now?’ And nobody wants to say ‘no’ to that.

James Hunt 21:28

So I do think it’s a failing of the management culture. But I think it’s a failing all around. Companies that value security; it’s not a deadline-impacting thing. It gets woven into the actual deliverable schedule. If you know you’re going to do security, you do security all throughout, right? And a lot of people, a lot of engineers, will say: ‘Oh, well, we don’t have time to come up-to-speed on what’s secure and what’s not secure’. Or, ‘We don’t have time to understand things like how does fuzzing work? Are we doing this correctly? Is this effective? Are we catching stuff? And, if we’re catching stuff, are we catching everything? And, if we stop catching stuff, does that mean there’s nothing left to catch?’.

James Hunt 22:41

I’ve written millions of lines of code in my career. And I have run fuzzing against a lot of it. And sometimes you get to a point where you’re not getting any additional crashes and bugs. And then you tweak one small thing in the input corpus, and a whole host of things gets flushed out. And knowing what to change, where, [and] how to tweak it, is an acquired skill that a lot of companies either can’t hire for or don’t have a compelling reason to hire for for the long-term. So I think there is a management thing, but it’s also an industry thing overall.

Lachlan James 23:19

Yeah, I think that makes sense. And there’s one other thing that sort of sprung out in this study. It was around competing priorities; particularly pressure to create more features – we all know about that one, and having more devs routinely work across multiple projects, multiple apps. So lots of content switching going on there. These are cited as contributing factors in regards to shipping vulnerable code. So in your experience, though, what should DevOps teams be doing to address those specific factors, those competing priorities? Because they’re always going to be there, I think, and they’re probably going to excel, if anything. So what should DevOps teams be looking to do to better handle those competing priorities?

James Hunt 23:58

I mean, if you’re in an adversarial situation, where you have a very hard time as a team pushing back on the structures that make it difficult for you to do secure coding, my advice is always to… build the security work into the work that you’re already doing. Your schedules will slip a little bit. Most people doing Agile, as long as they’re delivering features… if you slow down a little bit, you’ll be fine. Very few customers will get mad about a little bit of stuff here and there, especially if you can find and document them. And I like to run the ‘what if’ scenarios on those cases. [For example], I used to work in monitoring – what we now call observability. We didn’t have production impacting things because the monitoring found issues as the code and deployments work their way through lower environments, we caught things like out of disk. out of memory, out of CPU, before they escalated and impacted whole clusters. And I feel like security practices and software engineering are that way too, you’re doing a thing, right? You’re you’re doing more in depth code reviews, you’re taking a completely different approach to understanding the architecture than just a functional one or a performance one. And the payoff for that is invisible. Right? To use another analogy, it’s insurance. My Life Insurance Company; I never talked to them. And I’m happy for that. Because if I have to talk to my life insurance company, it means I’m dead. And I don’t want to go. So to fix that, as a working software engineer, the best advice I can give you is to build-in a way of doing a ‘what if’ projection as you’re doing those security reviews. We call it a Remote Code Execution, ie: Here’s what we think would have happened, had this been out into production. We could have had someone come in, pivot, and then they have access to the production database, they have access into the session storage, they can impersonate administrators. And then we fix that.

James Hunt 26:41

The other thing that comes out of that, going back to an earlier thing we talked about with secrets, is you can game that whole thing out and say: If there was a remote code execution, which eventually there will – let’s not sugarcoat that, eventually someone will get in. Does the Secrets Management help them, hinder them, Or does it have no effect on them? And you kind of get that as you game out the types of things that could have happened that you caught. And that lets the business and customers understand [that] this is part of software engineering. We had the same battle 10 years ago with unit tests, right? Well, why would you build unit tests? Just have customers test it. We have a QA department, they’ll test it. But the number of regressions that got caught, by just low level unit tests, was enough that people stopped worrying about deploying to production every hour of the day.

Happy inaugural DevOps Day!

Lachlan James 27:34

Alright, James, well, you might ask: What is this indeed!? So, next up, let me start by saying a slightly belated happy inaugural DevOps Day, James. Whoooo! Are you as pumped as I am?

James Hunt 27:46

Since I didn’t find out about it until we started to record this, I’m gonna go with moderately! National DevOps Day; it was May 24, they announced it the day before? Way to be agile. So talk to me about DevOps Day…

First Annual National DevOps Day
Lachlan James 28:10

So the date itself, I’m not sure is gonna catch on. But let me let me unpack this and explain why I bothered. Last week, Transposit, a company delivering connected workflow solutions in the DevOps space, released its 2022 State of DevOps Automation Report. So that’s really what we’re going to talk about here. But they also used the opportunity to invent a new day on our increasingly cluttered collective calendar! So they’ve dubbed it National DevOps Day. The May 24 date is designed to help, in their words, ‘elevate awareness around the topic’, thereby improving the way that DevOps philosophies are practised in the real world. It’s an interesting marketing ploy. You know, when in doubt, create a new you national holiday. So, James: will you be kicking-up your feet on May 24 each year from this point onwards?

James Hunt 29:11

No, and I will tell you why. I observe the July 26 holiday, which you may or may not be familiar with. It is National Systems Administrator Day. And the only reason I celebrate it is because there was a Canadian comedy group, called Three Dead Trolls in a Baggie, that did a song to commemorate the first National Systems Administrator Day. Go Google it on YouTube. I believe there’s a clip somewhere out there and it’s a fantastic song about appreciating the people who can make your email go away and come back if you anger them. But I am on the fence on DevOps Day, primarily because I don’t know if I’ll be able to keep DevOps Day, a marketing holiday invented by one company, separate from DevOps Days – a long-running federation of conferences all over the US, Europe, and Asia Pacific. I lord the chutzpah of making your own national holiday.

State of DevOps Automation Report: Pandemic silver linings?

Lachlan James 30:38

I agree. I don’t think this is gonna catch on. But, the findings from their State of DevOps Automation Report are interesting, though. So if you read between the lines, because there’s a whole lot of marketing spin in the report, the survey found that 62.5% of the 1000-plus IT ops and DevOps professionals who responded experienced an increase in the frequency of service incidents that affected their customers over the last year. That’s the scary headline, right? Which they want to push. But, if you actually dig a little bit deeper, critically, this is actually a 27.5% decrease from the previous year’s report. So that’s probably the takeaway, right? So the report sort of links adjusting to the massive digital transformation, brought on by the pandemic, as an explainer [for that decrease].

2022 State of DevOps Automation Report
Lachlan James 32:05

So James, I think this gives rise to a really interesting question: The pandemic obviously has been horrible in a lot of ways. But, is it actually possible that it’s also had a positive influence on how companies are practising DevOps today?

James Hunt 32:20

That’s a heck of a take, Lachlan [laughing].

Lachlan James 32:25

I’m reading between those lines, right, like the takeaway from that is that there’s actually been a massive decrease during the pandemic.

James Hunt 32:31

Right. And I think what has happened there is that, because of work from home or remote work, people have realised that systems play a much larger role in their day-to-day operations than they had previously thought. So I think it’s two-pronged: One, there’s a greater realization that understanding the systems, and the interplay of those systems, is very important to how businesses function because the systems continued to be there when the pandemic started and throughout this whole hellish period of history. The the other prong, I think, is that, when people got to working from home, they now had the time to do small things. To make their lives at work easier. Less time spent going to meetings, less time spent in hallways, less time spent driving to the office, coming home from the office, driving out to lunch, arguing about what to order, etc. [This] gives you probably two to two and-a-half extra hours a day. And if your existing output levels are kept the same, we’ve seen higher productivity. I think what’s happened. People have done small things here and there. Things that have been piling up for years. Let’s automate this, let’s finally put a monitor on that. And given the necessity of empowerment, that working from home kind of foisted on an entire workforce, now all of a sudden you don’t have to go get approval. You just did it.

James Hunt 34:52

I’m a huge believer in Kaizen, in that small incremental improvements to the day-to-day make a huge difference. And we’re now two years out from the beginning and pandemic, and those things are starting to pay-off in the productivity space. I think that’s what’s happening. I would hesitate to lay the praise at the feet of the Coronavirus.

Lachlan James 35:05

Yeah, I don’t think you’d go as far as saying, ‘Good on you, pandemic, well done!’. But I think that’s probably right, largely those two things: Those big changes in the way in which we work. So one, forcing system related issues to be top-of-mind because of the complete reliance on them. Coupled with, as you said, people actually being able to reclaim a heap of time during the day. So not only are those things top-of-mind, and so you don’t have an excuse for, you know, putting them in a corner and forgetting about them. You’ve actually got a bit of time to address them.

Lachlan James 36:01

So the other part of this report that I also thought was interesting was this: There’s been a big drop in the proportion of organizations reporting service incidents that impact clients. But, the vast majority of survey respondents did report that incidents are becoming more complex, and taking longer to resolve, which I thought was really interesting. So the top three factors that were cited to underpin that was: Lack of unified communications with teammates; Lack of visibility into dependencies and what teams or people are responsible for in terms of code or infrastructure; And thirdly, processes being harder to follow while working remotely. So James, I think the last point here, you know, regarding lots more remote work, goes a long way to explaining why those incidents are taking longer to resolve. I think that kind of makes sense. What I’m really interested in is what do you think is happening in the DevOps world that’s causing service incidents to increase in complexity?

James Hunt 37:03

It’s maturity, right? You’ve now solved the easy problems, right? You picked all the low hanging fruit, right? There’s a theory, it’s more of a an explanation, of the difficulty in the oil and gas industry, of extracting more and more fossil fuels from the crust. A long time ago, you used to be able to walk along the countryside and pick-up coal. Just pick it up, take it home, pop it in the oven, and cook your food. But now we’ve done all the easy ones. And we’ve gotten all the top surface extractions done. So now it’s costing more and more energy, more and more time to get the deep stuff. And I think that’s where we’re at with service delivery and DevOps. [For example], we’ve solved version control. So that’s not a thing. Everything goes in version control. Whereas, 10- 15 years ago, only the code went in version control. Everything else was handled somewhere else. You’ve got the CI/CD pipelines, where you’ve got pretty much everything integrated. You can take a prototype from idea to something working in an afternoon. And then as you poke changes at the code, it’s updating in production. The explosion of side projects alone in the tech space is, to my mind, evidence of that. Because it doesn’t take a massive amount of effort to do these things, more of them get done. And the nature of humans is that we will use all of the tools that we have at our disposal, and will combine them in new and interesting ways. And now you have a distributed system. And if you thought systems thinking was hard, distributed systems thinking is a whole nother ballpark, or a whole nother playing field or whatever sports metaphor you want to use. So yeah, I think we’ve solved the easy problems and now what’s left are the hard problems. And because we’re now moving faster and faster and faster, we’re going to hit those hard problems, more and more and more.

Lachlan James 39:29

No, I think that makes sense. And just to go back half a step: You didn’t seem entirely sold when I was going through the things that underpin why service incidents are taking longer to resolve. So complexity obviously plays a part. So we’ve talked about the complexity and probably why that’s the case. That kind of makes sense. Okay, so if the pandemic isn’t, I guess, entirely to blame for why things are taking longer in this regard, what other factors do you think are at play here?

James Hunt 40:19

Again, I’m gonna go back to the speed and velocity and the explosion of systems. Way back in the day, if something happens to the e-commerce site, who did you call? The e-commerce team. They handle it. There’s only five of them, you have their phone numbers, you can call them up after hours, they have a well-defined escalation process. You get an engineer on the line and they know the system because it’s a monolith, it’s their system. Now, you’ve got DevOps people in the cloud. You’ve got the active people pushing services in a microservice architecture, you’ve got routing in between all those, you’ve got message queues, you’ve got brokers, you’ve got REST endpoints, you’ve got TLS certificates. The reason it’s taking longer to solve the more complicated problems is because they involve a lot more unknowns in the brain of a single engineer. The number one problem I’ve seen with large Kubernetes microservice architectures is figuring out where to go to find the damn logs. And if you’ve managed to build a centralised logging system, good on you. The problem now is how do I find the logs I want in this mess of all the logs from the system? What’s the search query? Is it this? Ss it that? Then eventually, you realize it’s failing because of an API call. And the API call is failing because of an expired cert. And the expired CERT is there because we popped a service out a year and a day ago, and didn’t put it in Encrypt, or we didn’t put it inside of our certificate management. Because, it was temporary, right? There’s nothing so permanent as a temporary solution. And, given that we can push more and more services into production, with minimal ease, more and more temporary stuff makes it into prod.

James Hunt 42:17

But again, complexity. Finding who and what and where has gotten much more complicated. I’d actually argued that the remote work makes this easier, because it’s a lot easier to get a hold of someone and not have to deal with their spinning online time, right? So in pre work-from-home, if you had to call the e-commerce team, somebody had to drive into the office. I remember being on-call for servers in the data center. My on-call duties were: Try not to go anywhere; don’t get too wasted. Make sure you’re around so that – if we call – you can drive in, key card into the data centre, and do what you need to do. But that was still 20 minutes. Whereas now, it takes two minutes; three minutes if you can’t find a laptop, right?The Ballmer Peak

Lachlan James 43:24

So as a follow-up: Like, I mean, now that you don’t have to drive anywhere, can you do that while you’re drunk?

James Hunt 43:30

I mean, there’s the Ballmer Peak, right? The Ballmer Peak from Windows ME. Lachlan is looking at me like he has no idea what I’m talking about.

Lachlan James 43:45

No, it sounds great, though. Yeah, good. Love it…

James Hunt 43:48

So when they were shipping Windows ME, there was a theory that alcohol lowered the inhibitions [and would therefore help] programmers write the code that needed to be written So there’s an xkcd comic of the Ballmer Peak:

xkcd comit about the Ballmer Peak theory

As you get closer to the Ballmer Peak, your ability to code goes exponentially up until you hit the Ballmer Peak. Then it drops-off exponentially fast as you get too hammered to do anything logical. And right at that peak, is when Windows ME got most of its code written.

GitLab 15, unveiled

Lachlan James 44:31

Alright, so next up, and finally, James… GitLab, the pervasive open source DevOps service, just launched GitLab 15. And basically, it’s the next major iteration of what it does. It’s slightly egotistically named as ‘The One DevOps Platform’; anyway.

GE: GitLab 15 Provides Replacement for Do-It-Yourself DevOps with The One DevOps Platform

According to their own press, GitLab 15 helps companies develop and collaborate better around business critical code. And it sort of spreads this out across five main areas. It says Visibility and Observability, Continuous Security and Compliance, Enterprise Agile Planning, Workflow Automation, and support for Data Science Workloads. And so, GitLab claims this new release will allow organizations to discard DIY DevOps toolchains, and bring teams together from planning to product, to ship better code, faster. Sounds nice, sounds nice…

Lachlan James 45:34

So James, there’s a lot of buzzwords going on in GitLab’s announcement, but also potentially a lot of promise as well. In this release, GitLab attempts to connect the resolution of well-known DevOps inefficiencies and immaturity challenges, many of which we’ve touched on today, with this launch. And in fact, GitLab cites a recent Bain and Company report, Taming the Wild West of DevOps, which suggests that 90% of in-house tech teams view DevOps as a top priority, but only around 12% consider their own DevOps practices to be what they’d call mature.

Taming the Wild West of DevOps

So how close do you think GitLab 15 will go towards closing that DevOps maturity gap?

James Hunt 46:10

Well, I will say nothing brings disparate teams of people working inside of companies together like forced standardization on a single tool! They will all unite in defiance of that order. I like GitLab, I don’t have anything bad to say about GitLab technology. They have a really nice, compelling alternative to GitHub, which is their main claim to fame. That said, I don’t think that any one tool is going to do much of anything to pull forward someone’s maturity on DevOps, any more than having the latest in Mac hardware, or Windows PCs, or whatever is going to make you a better computer user or a more savvy digital person. And like I said, I think that the tooling landscape around DevOps is mature enough, [so] that if you want to change your processes, and you want to change your culture, you have plenty of options to choose from. And the options aren’t what’s holding you back.

James Hunt 47:46

What’s holding you back is people not buying into the story. It’s not having a compelling end goal. And I think there’s also a bit of, I wouldn’t call it myopia. But, people look at their own stuff, and they go: ‘This is crap’. And then they look at other people’s blog posts, they say those guys have it going on, and they know what they’re doing. There’s something to be said for that sort of imposter syndrome at a team level. Right? [That mindset that says]: ‘We’re not as good as we could be at DevOps, we’re not doing this well, we’re not doing that well’. There’s always a million things you can improve, right? And I don’t know that a tool platform is going to fix that. I am curious if support for, quote unquote, ‘data science workloads’ means the ability to draft GPUs into your pipelines. It’s a very interesting way of saying ‘we signed a deal with NVIDIA‘. But GitLab has always been big on workflow automation. And what I find with most people with workflow automation is the tools rarely hold them back — their imagination holds them back.

White knight or status quo?

Lachlan James 49:06

Yeah, I think that’s probably a pretty fair assessment. I suspect it probably does relate chiefly to that [NVIDIA] partnership that you were talking about there. Moving on, just to another article that was written about the same launch, which I thought was was interesting. So ComputerWeekly.com wrote a review as part of their Open Source Insider blog. Columnist, Adrian Bridgwater, highlighted the issue in DevOps, with regards to developers in the real world knowingly shipping dodgy code – for a range of different sorts of reasons.

GitLab 15 drops slops on DIY DevOps flops & muttonchops
Lachlan James 50:18

He says: “Too often, businesses identify vulnerabilities late in the development process, increasing costs, and risking security breaches and business disruptions. This dynamic compounds the new attack vectors introduced by complex modern applications and cloud native deployments.” And Bridgwater goes on to say that, “With its latest release, GitLab, aims to strengthen its ability to help enforce compliance across the entire software development lifecycle, and provides built-in security scanning, and compliance auditing features… empowering DevOps teams to deliver value while improving how they work.”

Lachlan James 50:56

So that’s pretty high praise there for a vendor. And I’ve got to say, I’ve worked with a lot of the folks who write for the TechTarget stable [of publications] for a lot of years, and it’s a bit of a rarity for those journos to, I guess, put out praise like that. So do you see a meaningful difference being made here, James, or is knowingly shipping buggy code purely a behavioural issue that comes down to bad culture and mixed priorities?

James Hunt 51:36

I think there’s something to be said for having out-of-the-box security posture built into the platform you’re working on. I did a lot of work in my in past lives with Cloud Foundry as a Platform-as-a-Service. And while Cloud Foundry primarily sold itself to developers for deploying apps into production, the other side of that was the concept of the build pack, which meant that security and compliance with base level things was handled by the platform teams. So if you had 1000 app engineers writing code in Ruby and Go and Java and all these other languages, you maybe had 20 platform people who were doing things like making sure that the thing was spinning, as well as running security audits against the app inventory. And Cloud Foundry made that very easy. That gave platform teams the ability to say: ‘Whoa, there’s a vulnerability in the build pack. Let’s patch that, upload a new base of how Java apps are built, and then replay all the code that’s been given to us over the last year and a half by the devs.’

James Hunt 52:53

That enablement of that practice allowed a lot of big companies to really step-up the game in patch management and vulnerability scanning. When Docker implemented with the Docker daemon, the Snyk security scanning, that was hugely enabling. While I did say, earlier in this episode, that the scanning for vulnerabilities is really kind of like fighting yesterday’s war… It’s valuable to catch the dumb stuff you’re going to do as a developer. Oh, I didn’t realise that this library had massive vulnerabilities. And, if it’s early enough in the process, I can switch course, right? One of the things that the article mentioned was that if you find the security vulnerability at the tail-end of the deployment or the development process, you’re kind of [between] a rock and a hard place situation, right? You can either push the schedule, which makes the stakeholders unhappy, or you can just go into production and then put a backlog thing that says ‘make sure that thing’s secure’, which will make the stakeholders unhappy later down-the-line. Having the technology to catch the dumb mistakes is valuable. What I don’t think is going to happen is this magical compliance and security Renaissance. Right? Of now, everything’s secure and everything’s compliant.

James Hunt 54:26

There’s a bunch of people out in the industry talking about SBOM – Software Bill of Materials, saying ‘if only we could recreate the exact concoction of our images, and we could then cross reference that with all the CVEs’. Yes, that is a valuable tool, but by itself isn’t going to solve much. It will give you some visibility, but you have to have the practices. I can’t tell you how many teams I’ve talked to and worked with where their pipeline is: Download the code, right? Do a git clone, then run the unit tests, and then run a yarn audit (or some other software thing), and then ignore the results, and then push it into production. Why are you ignoring the results? Why is this line commented out in your build pipe? Or why do you have an ‘or true’, so that it doesn’t bail out the job? Well, the answer is invariably: ‘We don’t have time to fix those now, we’ll go back when we have time’. That’s a cultural problem, right? You’ve robbed the tools, you’ve got the opportunity, but you’re stepping around it out of expediency and convenience. And that goes back to that culture of practice. If you don’t value security, if you don’t value compliance, you’re going to work around the system – no matter how great the system is going to be. Or, you’re going to get rid of the system, because it’s not flexible enough.

Lachlan James 55:55

Because it doesn’t let you go all the way to an end, reveal problems, then allow you to ignore them?

James Hunt 55:58

Because, at that point, someone’s going to see that the system is holding-up a delivery to production. Right? And that’s how it’s going to be viewed. That’s how it’s going to be communicated. It’s going to be socialised. And then you get this outcome and response: ‘Oh, we don’t use that tool anymore, because they caused us a delay, they cause this type of an outage’. What happens if you’re trying to get a patch out and the patch to fix a breaking incident in flight gets flagged by the security system? Well, you’re probably going to bypass the security system! Because, you can’t tell the customers: ‘I know you’ve been without data for seven days now, but we’re just trying to figure out how to patch this low level Linux kernel thing that keeps triggering the CI/CD pipeline. That doesn’t fly.Until next time

Lachlan James 55:59

No, that’s a fair assessment. Alright. Hey James, it’s been awesome having a chat. I think that’s gonna do us for Episode One of DevOps Digest. So it is – I know you love this word, James – it is the fortnightly podcast brought to you by Vivanti consulting. For all of you listening-in, and you’re not familiar with a ‘fortnight’, it is every two weeks you can expect to see one of these.

Lachlan James 57:05

So James, thank you very much for your time.

James Hunt 57:09

Anytime, Lachlan.

Lachlan James 57:10

Oh thank you, that warms the cockles of my heart; whatever those are. And for those of you watching, thanks so much for tuning in. To receive regular DevOps news and analysis, you can subscribe to our YouTube channel or go to vivanti.com/contact-us, and sign-up to our mailing list. Alright, it’s been a pleasure to have your company and bye for now.