The Secure Developer | Ep 67

Security Chaos Engineering - What is it and why should you care?

with Aaron Rinehart

About this episode:

In episode 67 of The Secure Developer, Guy Podjarny talks to Aaron Rinehart, CTO and co-Founder at Verica, a continuous verification company that uses chaos engineering to make systems more secure. Aaron has been expanding the possibilities of chaos engineering in its application to other safety-critical portions of the IT domain, notably cybersecurity. In this episode, we learn more about Aaron’s diverse background from developer into chaos engineering.

Chaos engineering is a powerful practice where experiments are run to build confidence that a system operates as expected. While the practice shapes the way that large-scale systems are built, it is underutilized in the security space. Verica, a continuous verification company that uses chaos engineering to make systems more secure, is looking to remedy this shortfall, and its co-founder and CTO, Aaron Rinehart joins us today. Aaron has been expanding the possibilities of chaos engineering in its application to other safety-critical portions of the IT domain, notably cybersecurity. In this episode, we learn more about Aaron’s diverse background. Having worked as a developer before making his move into security, he understands systems intricately, giving him unique insights. We then dive into chaos engineering, the proactive approach it takes, and the intentional feedback loop it provides. Aaron believes that these experiments are great learning moments because there is not a high cognitive load that comes with unplanned system failures. After, we turn our attention to how chaos engineering ensures systems’ stability is accelerated in a controlled and managed way. Along with this, we explore why it’s not necessary to wait for production to test different security controls, what security chaos engineering offers instant response teams, and some fascinating use cases. Be sure to tune in today!

Tags:

Chaos Engineering

DevOps

DevSecOps

Episode Transcript

Guy Podjarny: Hello, everyone, welcome back to The Secure Developer. Thanks for tuning back in. Today, we’re going to dig into security chaos engineering which is a topic we just scratched the surface on at a previous episode with Kelly [Shortridge]. And to do that, we have really a top expert I guess, you know, you could say you wrote the book on it, writing a book on it which is Aaron Rinehart who is the CTO and co-founder of Verica, Aaron, thanks for coming on the show.

Aaron Rinehart: Thanks for having me.

Guy: Before we dig in, you know, we’re going to go deep here, hopefully on what security chaos engineering is and why it exists but before that, tell us a little bit about what to do, maybe what Verica does a bit and a bit of the journey that got you here.

Aaron: We are the creators of chaos engineering, where my co-founder is Casey Rosenthal, he created chaos engineering at Netflix around the teams, around the traffic team. I am most notably known as being the first to ever apply what they were doing at Netflix to cyber security, wrote the first open source tool called chaos slam when I was at UnitedHealth Group.

At Verica, we believe we’re building a series of tools that evolved off of the work both Casey and I did in chaos engineering. We classify our tools as continuous verification. What we mean by that is that it’s a series of tools that help people understand their own systems. Chaos engineering is one toolset within there, Canaries,[inaudible 0:02:39] tool sets.

Things that help people and usually verify that the system is what you think it is and it’s operating as intended. That’s what we do.

Guy: How did you get to this place of co-founding Verica? How did you get into security in the first place?

Aaron: Let’s start with Verica. I co-founded Verica with Casey, about a year and a half ago. We wanted to try to bring a more mature toolset to the space. The space lacks a lot of tooling and we wanted to evolve both my work from the UnitedHealth and his work at Netflix with chat and bring more sophisticated tools that can help people navigate these problems and complexity and modern software.

Prior to that, I was the Chief Security Architect at UnitedHealth Group. I was part of leading DevOps transformation. Did some work again in cloud transformation – feels like every place that we go with cloud transformation. Actually, one of the main drivers of chaos engineering is cloud transformation and we’ll cover that later.

I wrote the first open source tool that applied Netflix’s chaos engineering to cyber security that was ChaoSlingr. We can talk about that and the use cases if you’re interested. Prior to that, I spent time working in the DoD, US Department of Defense, Department of Homeland Security, actually a stint of about four years at NASA, working in safety and reliability engineering so it’s not too far of a stretch from resilience engineering and I’m a former US Marine.

Guy: That’s interesting, you could kind of claim you know, you went from military to safety and kind of reliability to security to finding a merger of the two in security certification, I guess then.

Aaron: In terms of career paths, I started off in network engineering and now I got into sort of sys admin, system engineering type of work and then for some reason, couldn’t get jobs doing that, so I was hired on as a peoplesoft developer and I became a database engineer. And I started writing frontend code and then I started writing web apps and then I ended up being a software engineer for about 10 or 11 years at NASA, I had the opportunity to give in to security because nobody else wanted to do it.

I remember the first time I ran Nessus, which is a scanning tool, I followed the tutorial on the internet and I put the exact IP addresses that was in the tutorial and I ended up scanning the German government from NASA, it was quite an interesting day. That was –

Guy: One way to get exposed to it.

Aaron: That was my intro to security. That’s right, NASA is closer again in security. I had a fairly accelerated career in cybersecurity, given the fact that I had just a wide engineering toolset. I was able to sort of sift through how things work and whether things were secure or not a lot quicker than other people because I was a builder most of my career. And there’s only so many ways to build and so many ways to secure what you built and that accelerated my ability.

Guy: That’s actually like a great perspective on the advantage, I guess, instead of going from engineering background and kind of maybe the wide variety of ways to build things to security and how that boosts you up.

Let’s dig in to the substance, you know? We’ve been promising. Before we go into security chaos engineering, can you give us a little bit of a primer to what chaos engineering is in the first place?

Aaron: Chaos engineering is the idea of proactively introducing turbulent conditions into a distributed system to build confidence that the system operates the way you think it does, right? The key thing with it is proactive, is that a lot of times, we don’t really know something’s broken until it’s broke and you had to incur pain in order to know there’s something wrong and go back and fix it.

What chaos engineering does, it allows us to introduce these turbulent conditions, these faults, these failures in the system proactively to try to surface failure, proactively, so we can fix it before it manifests into problems. And the process of doing that, you build confidence that the system works the way you think it does.

Just a little context around chaos engineering is that I had never seen – I’ve asked this question with several other people in the chaos engineering space. I don’t think anyone has ever seen a chaos engineering experiment which has any injections that we inject, succeed the first time. What that usually means is that, you never do a chaos experiment you know is going to fail, right? Because if you know it’s going to fail, just fix it because you already know it’s not going to work. You’re trying to ask a computer the question, “Hey, when X occurs, I designed Y to be resolved and is that happening?” right?

You instrument that. You’re asking the computer a question, very often, we’re wrong about how our systems work, it’s not any engineer’s fault, it’s the fact that the size, scale, the complexity and speed that were building things today is very difficult for humans to mentally model that behavior.

Guy: I guess failures here vary. Is it mostly about virtually unplugging a network cable or sort of taking down a server or does it get fancier than that?

Aaron: It can get a lot more fancier than that. Most open source tools out there really focus on sort of the basic failures, similar to what Chaos Monkey did 11, 12 years ago. It all began with Chaos Monkey and Netflix’s story. Actually, here’s something that I like to remind people of is that chaos engineering came about during Netflix’s cloud transformation from DVD’s to Amazon streaming. I love it when people are like, “Aaron we can’t do the chaos engineering, we can’t even do the DevOps and we’re just starting to begin our cloud transformation.”

That’s exactly what Netflix used it for and what most people use chaos engineering for. Most cloud transformations, people underestimate the time it’s going to take to do it, they complained at all the right skills and the right people and if you hire Amazon to build it all for you, then they’re worried about not having the competency understanding and they way underestimate the resources given the cost, right?

That’s why chaos engineering was so valuable is that as you were building, you’re building out your applications and you’re building whatever products or business you’re intending to build out there. You continuously verify it as the application and what you built with the question. “Hey, do you work the way you’re supposed to?” And you create that feedback loop, right?

Constant feedback loop of, hey, yeah, what we’re building is actually working, it builds that confidence and it reinforces skillsets, when it doesn’t work, you’re able to get context of maybe what was exactly configured right, what needs to be fixed. Chaos engineering is really about putting well-defined problems in greater context in front of engineers.

It turns out, when you give engineers the greater context and better definition of problem, it’s easier to solve those problems, and that’s really kind of what it’s about.

Guy: I’m a big fan of the chaos engineering concept and how it gives you a feedback loop in an intentional fashion, you know? To sort of see what happens if. It’s great to sort of hear you and fully kind of relate to this is applicable really kind of with any system. That said, when you think about when do you inject that failure, people are a little bit afraid of actually taking down their systems.

So, I’ve injected this failure, right? I pulled out this sort of virtual cable or severed some connections for some disk access and my system will actually be down. How do you see the people run chaos engineering primarily in production? I know Netflix today at least is kind of a very well-known for cutting things off in production.

Or is it something where most of the world is still doing this and kind of pipeline or staging surroundings?

Aaron: There is in a new O’Reilly book that just came out three weeks ago. I think there’s a chaos maturity model in the book. I think it’s written by Nora, I’m not sure if that was Nora Jones who wrote that or not. You don’t have to do chaos engineering in production to get the value from it. I mean, ultimately, that’s when the outage or the security incident and breach is going to happen. We all know there’s a drift between what’s in stage and what’s in prod.

But you can learn a lot about your environment through stage. I’ll tell a brief story, it’s not my story, it’s Casey’s story but it was one of the first banks to ever do chaos engineering, invited Casey out to do a tech talk on chaos engineering to the team and they were going to do the first game day exercise. It came to exercise is actually a manual where team gets in a room and they inject the failure and they observe it and observe whether or not they’re right or wrong. The system worked the way they thought it did.

In this case, they’re going to bring down a Kafka node. They say, “Casey, you know we’re a bank, we have real money on the line, we can’t do this chaos engineering.” And he said, “All right, it’s going to happen on production.” “Sure, we’ll do it on stage.” They went for it. They got everybody in the room. Casey had done his tech talk and doing this gamed exercise, they go forth and bring down the Kafka node in stage, can you tell me what happened?

Guy: Did production go down?

Aaron: Production went down. That illustrates the issue, right? Is that resilience is not something a system has, it’s something a system does. It’s the humans that create the resilience in the system. In this case, with all the complexity, like I said, in speed and scale and how we’re doing things, they crack the change of pointers information. It’s a human thing, it happens, right?

We should expect that that kind of stuff is going to happen and have mechanisms to reassure ourselves that we’re prepared when those kind of things happen. Think about this: There was no outage, there was no incident, everybody who could bring Kafka back was in the room.

I grew up on a farm in Missouri so it’s kind of like we do controlled burn of a field. It’s like you know, you don’t just light a match and go to town. You notify the county, you bring out the EMT, you have fire department, everybody’s there in case something happens, right?

That’s when you do chaos engineering. And you’re not freaking out because during this outage, people freak out. Their cognitive load is consumed by, “My god, this could be breach, CEO is on the phone telling me I got to get this thing back up and running, somebody’s going to lose their job over this.” People are kind of just freaking out. The focus is getting it back up and running, not fix the problem, it’s getting back up and running. They’re losing money probably, call centers getting overloaded and that is not a learning environment you want to be learning in.

It is always interesting, you see that most messed up changes happen during incidents that would never be approved outside of an instance scenario, right? You’re changing that variable? You made me go through six weeks of change review to change that, you don’t even understand what you’re changing but you can change it during the incident?

What happens is, because we do that, one problem will beget another problem. From war room to war room, to incident to incident and outage to outage.

You know, here’s an interesting story and then I promise I’ll shut up on this point, is that I was out at one of the largest financial payment processing companies, talking to this Chief Engineer, he was telling us about this legacy system that they depend on. The core company’s application that makes us money, runs on this thing, it’s stable, it’s known, the engineers are competent with it, it really has an outage, it’s our legacy trusted system. Because they need to extend it for other capabilities and purposes, they want to move it over to Kubernetes. A very untested sort of platform for them, they don’t have confidence in their skillsets. I was a bystander in this conversation with Casey and this gentleman.

I was thinking about it. “How does a system become stable? Is this stable from day one?” At one point that legacy stable system was probably just like a Kubernetes. It was unknown, untested, our systems become stable through a series of unforeseen events. Unforeseen, we could have foreseen it, we would have just fixed it. Thiese unforeseen outages and incidents informed us how the failures in our system that we needed to fix but we encountered pain through our learning.

Chaos engineering, you could think about it, I do a lot of writing on this particular envelope of knowledge and the security chaos is this O’Reilly book that’s coming out in the summer. How does your system become stable? Chaos engineering is a way to sort of accelerate that in a controlled and managed way, it is to proactively inject the conditions by which you expect your security to fire or to be triggered.

In short, okay, yeah, it does actually work, okay, it is working, right? It does catch a misconfiguration. It does catch an open port that shouldn’t be open. It does catch a permissive account or role collision, it does catch these things that we expected. Then what you do is you automate those tests and become more of an aggression kind of thing.

Once you’ve proven that the experiment is actually this successful.

Guy: Love it and also love the passion. You know, very clearly. You know, you talk about it, you’re introducing failure, you’re creating that feedback loop of understanding ‘what if’ scenarios for that failure.

Chaos engineering has historically not been known for security application, right? It’s been known as an aspect of resilience and initially its tie into security has been more in that intersection of availability and security. So, like being out of service or things like that. But I think when you talk about security chaos engineering you are talking about broader applicability, right? So, what is that slice of chaos engineering or is it a slice or adjacency that is security case engineering? What is different about that from chaos engineering?

Aaron: Well, I would say Casey he likes to stay in that when I brought the security used cases to chaos engineering he started saying, “Well, Aaron, you know I hadn’t thought about it until you brought it to us in Netflix that system safety there is two sides of that in that coin of system safety. That there is security and availability.”

But the use cases are a little bit different. It is really the same technique as failure injections, fall injections. That is also what separates it from maybe like a red team or purple teaming or like BAS, breach and attack simulation toolset. I can expand on those a little bit. I tried do that in the O’Reilly book and I wrote an article on opensource.com called, Red Fish, Blue Fish, Purple Fish, Chaos Fish, that talks about the differences just because I got that question all the time all of which are valuable by the way and don’t get it twisted. The more objective information we can derive about how our system functions that’s good. And red team, purple teaming, instrumentation is key.

But so, the use cases around security chaos engineering I mean really this was all an experiment at UnitedHealth Group, right? My original use case was control validation. So, people would come to me with hard engineering problems, architecturally. They’d bring a data architect and a solutions architect come to me with two different diagrams of the same system.

The different diagrams represented was they are all mental models of how they believe the system functions. But holistically, that is rather what the picture really was. So, they abide by these recommendations, you need a firewall here, you need this thing here and that thing here. Placement matters, configuration matters, how it’s tuned, all of these things matter. I never got to see that. It’s not that somebody who wants to just – An ivory tower architect to say give them advice that may not actually work.

So, what I wanted to do is sort of skip the line and ask the computer the question and under these conditions, do you fire, do you work how I designed you to work? And that was really the series of use cases we came up with around architecture and control validation.

Mostly what the security chaos engineering use cases are focused on is really misconfiguration. And misconfiguration is a result of not a human making an intentional mistake, per se. It is more like – There’s a speed, scale and complexity. It is easy to be out of sync or to not have an accurate depiction of understanding of what you’re securing, what you are configuring because that can change. It is very stateful and what is state-less. When we approach security typically, a lot of the time, it’s a state and that state changes on a regular basis.

If you look at the majority of let’s say malicious code out there, the majority of malicious code is junk code. It really is. It needs some kind of – majority. There is a small percentage of it five, six to 10% of the highest of it sophisticated nation state kind of stuff. Good engineers were hired to do that. The majority of it is just junk. It is stuff that you never execute. If you read through it, you need to go like the AB sites or already malware, the sites where it will break down the steps. Usually some kind of low hanging fruit that has to exist for the rest of it to execute, right?

What security chaos experiments are really focused on is accidents and mistakes that engineers make when they are trying to build things. So we’ll inject the misconfigured port or inject a permissive account. We inject the things by which we expect the security to fire. What we are trying to do is also detect when these kind of conditions happen because we know they’re going to happen because of – like I said the speed, scale and complexity – It is easy to lose track of what the system actually is. It is easy to make those mistakes. What we are doing is trying to proactively discover them before an adversary can.

Guy: So, this makes a lot of sense. But I am trying to clarify something. Is the security chaos engineering chaos engineering for security controls? Is it always around saying,” Hey, I have a security control here that is supposed to catch an open port as supposed to catch this misconfiguration,” and what you’re doing with the experiments is to see if the security control triggers. Is there another use case? Because when you think about weaknesses in the system you know it could be that open port might imply some traffic floating in, right? Or things of that nature. Is it mostly around that form or is it first and foremost on testing your security controls?

Aaron: Well there is several use cases and the control validation is just one. The first time you run an experiment is not only looking to see what works but also what else might work. For example ,when we first kind of ChaoSlingr, which basically is the main used case that we went with ChaoSlingr. It is an open source tool that we wrote. So, we picked a use case that a software engineer, a network engineer, a system engineer, an executive could all understand, which was PortSlingr. PortSlingr basically introduced an unauthorized or misconfigured port.

So, on reason to have is all the time still and so for whatever reason it could be that somebody followed a ticket wrong and a software engineer didn’t understand network flow. It could be for lots of reasons. It could be the engineer just made a mistake when they executed the change. They started injecting misconfigured ports on our AWS.

So, UnitedHealth Group had commercial and non-commercial software. We are very new to AWS as a company and we started injecting these configured ports and we expected our firewalls to immediately detect it and block it and be a non-issue. And so, what happened was that worked about 60% of the time. The firewalls we used caught and blocked it at about 60% of the time and we expected that they’ll work 100% of the time.

So that is the first thing we learned is that it was actually a configuration drift issue and how we are configuring things between commercial and non-commercial environments.

Guy: They plugged it in 60% of the instances of the set ups would have caught it while the other 40% didn’t?

Aaron: Exactly. Oh, and this is so random. It wasn’t all of them. But it was interesting that that occurred. So furthermore, what happened was is that our cloud native commodity configuration management tool caught and blocked it every time. They were barely paying for it, caught and blocked it. So that was the second thing. The third thing we expected to happen was that the sock would get notified of these events from both of those tools should provide log data to a sim.

When we used a sim, we had our own security special logging tool that we developed at the company, But that did detect it and called it an event. The event was sent to the sock, sock got the alert but what the sock couldn’t tell was which AWS instance it came from. Remember, we were very new to AWS at the time. As an engineer you can say, “Well, Aaron, you can map back the IP address to figure out which instance it came from and where it came from.”

Well, there is a thing called SNAT right? SNAT will intentionally hide what the real IP address is and you can spend 30 minutes to three hours figuring out where it is and if that were an active incident and let’s say we have a million dollars a minute in downtime on that system that’s very expensive. But we have never incurred that pain. There was no outage, there was no incident, we learned these things. I mean all we had to do was we add metadata and the pointer to the vents and then it was fixed. And we learned all of these things. It wasn’t just the control that we validated. Well, we did validate it. We did learn. We got context that we needed in order to fix it. We were able to look with a winder lens instead of a set of blinders on the problem because there was no outage. We weren’t freaking out. There is no CEO asking for 15-minute updates.

Guy: It sounds powerful they know they could test the security control itself, you test what is the incident response. So even in the cases that something did trigger did it go all the way to get all the information to actually make that visible.

Do you see, maybe it was back then or maybe today and maybe this comes back to a sort of breached simulation attacks, but do you see a combination of some proactive activity? Would you simulate a bot while you reduced a security control or something of that nature, see what happens or are we veering outside the traditional definition of chaos engineering when we do?

Aaron: I talked to a lot of people about combining breach and attack simulation and breach and attack simulation mostly and chaos engineering. They are just different techniques. We are actually injecting the failure. We are not simulating an attack. So, you have to think about this. We are doing this most of the time on very large-scale distributed systems. I mean hundreds and thousands of microservices, right?

When you start stepping through multiple attack points and you are sending a lot of data to this system because it is very difficult to tell the sift through that data what broke, what didn’t, what work, what didn’t. And you have to account for the non-linear nature of a complex system, is that complex systems also can’t be modelled. It can’t be really diagramed. The only way to understand a complex system is to interact with it. And if you start doing a bunch of things it is very difficult to figure out what happened within and I guess you understand what happened to the initial target. What about all the cascading issues? You lose that visibility because of all the noise.

Guy: Then that drives you to make the specific changes as accurate as possible because you don’t want to introduce additional complexity with a very complicated experiment.

Aaron: I have always tried to stay away from complicated multi-step experiments and you can actually learn a lot by one simple experiment even just the port misconfiguration just you get one experiment on all of your instances, you’ll learn a lot of different types. Actually on the chaos experiments for availability, you can learn a lot. I mean just about almost everything you need to learn through manipulating latency.

You can learn a lot of different things about cascading failures or SLOs, SLIs you learn a lot of different things. I mean some of the other use cases that I like to preach on for chaos engineering for security is like you said instant response. So instant response is great for instant response teams. My boss when we started when we first showed in ChaoSlingr he’s like, “Aaron, I love how this validates what we are doing on engineering front. But Aaron I love how this keeps our instance response team sharp. It keeps them well practiced.”

What you are able to do is because a security instance is very subjective to nature. No matter how much you spend, how many people you have, you still don’t know where it is coming from, who is going to build it, why they are doing it. When the event occurs, that may not have been when the actual incident began, that is just when you detected it. But with chaos engineering, rather to actually start kick off to initiate the event. So, we know what started and where that incident began.

Guy: You would mark as though there was some unauthorized access to some system or something like that, would that be the incident in this case and sort of see what happens next?

Aaron: You could go back to that PortSlingr example, right? We injected that I mean we can see whether the controls worked. We can see whether or not, “Hey shit, we didn’t have enough people available or we didn’t have the right skills or the [inaudible 0:25:39] were wrong or where did we just look at things differently.” One, because there isn’t an outage. We are not trying to evaluate people during a high-cognitive load situation.

We are able to proactively discover these things in a way where the incidents response team can learn. You can actually now manage a measure because you know the point in which it started. Your measurements could be very subjective. If you are assuming that when you caught it was when the met began because it is probably already too late because you have just when you caught it. And then you are also assuming it wasn’t a cascading failure that caused that. It typically is not one thing. It is simply multiple things that parted to that event.

Guy: That is another statement that is true for reliability and for security for itself. You know this has been a great overview of chaos engineering and the purpose of it. You know the notion of you don’t actually have to run this in production to get value of it. That is probably the eventual place you might aim for but it doesn’t have to start there and for security you know you can test the different security controls. You can test the incident response process for it and see how that works.

If somebody wants to dig in some more and learn more, you’ve actually written a couple of books or have participated in one, written another recently. So, for this there again, what were the two names of the books to look up?

Aaron: About three weeks ago we released an official O’Reilly AML book on chaos engineering. So, there is a chapter about security chas engineering in there. Actually, on the Verica website, I think it is verica.io/book. You can sign up there for the chance to get a free copy of the book. Other than that, at the end of the summer, Kelly Shortridge who is on the show just before me and I have been writing an O’Reilly book on the topic.

We’ve got several companies that have been involved that are doing chaos engineering. A lot of interesting use cases about container security, cloud security, how people are using it, how chaos engineering for security are used to increase better logging, better monitoring, better observability. We expand on the use cases and different companies and give their stories as well. That would be a great place that people can check out.

Guy: That is awesome. We’ll definitely post that. I guess the first one is there is a link that we can post on the shownotes. The other one we will have to wait for end of summer to have that come out.

One last thing that you can’t be a guest on this show and not share, beyond the security chaos engineering which is the obvious tip that we have spoken about throughout the episode, if you have one bit of advice to give a team looking to level up their security prowess, what would that be?

Aaron: I guess if I had to pick one thing it would be that everyone on the security team must understand how to code. It is extremely valuable when somebody can go from idea to product on their own. Python it was originally written for children. Python is like Blizzard’s theme, what Blizzard’s games team which is easy to play but hard to master. There is so many great things you can do by – if you don’t understand how to code, you now have a road to empathy to understanding how software is built and the challenges/complexity with it. It also helps you understand what Synk does with the software dependencies. You start to understand how those are really are a thing. This is kind of the supply chain problem.

And then you start understanding, “Oh we don’t have the log events for not being written. We were blind to a lot of things.” Understanding software begins from my perspective on learning how to code. But when you start to understand that you start to understand how the sauce is made then that really helps open your eyes to pathways to better security and pathways also into the value chain.

So, if you are a security professional and you don’t know where you stand in terms of the value chain of the company, you really need to think hard about that because you’re not contributing to the value that the company delivers, how long do you think your job is going to be secure? There is a way to do that.

Guy: That is a great piece of advice, definitely everybody in the security field. Aaron, this has been a pleasure. Thanks a lot for coming onto the show.

Aaron: Hey, thanks for having me.

Guy: And thanks everybody and I hope you join us for the next one.

[END OF INTERVIEW]

Aaron Rinehart

CTO and co-Founder at Verica

About Aaron Rinehart

Aaron Rinehart has been expanding the possibilities of chaos engineering in its application to other safety-critical portions of the IT domain notably cybersecurity. He began pioneering the application of security in chaos engineering during his tenure as the Chief Security Architect at the largest private healthcare company in the world, UnitedHealth Group (UHG). While at UHG Rinehart released ChaoSlingr, one of the first open source software releases focused on using chaos engineering in cybersecurity to build more resilient systems. Rinehart recently founded a chaos engineering startup called Verica with Casey Rosenthal from Netflix and is a frequent author, consultant and speaker in the space.

Up next

Stay up to date on all the episodes

Episode 68

DevSecCon London Panel

with Clint Gibler, Tash Norris, Doug DePerry, Jesse Endahl, Zane Lackey

Episode 69

Understanding What Cloud Security Means

with Teri Radichel

Episode 70

Transforming Comcast with DevSecOps Practices

with Larry Maccherone

Episode 71

Changing Culture

with Nitzan Blouin

Episode 72

Security Champions, Assemble!

with Guy Podjarny

About The Secure Developer

In early 2016 the team at Snyk founded the Secure Developer Podcast to arm developers and AppSec teams with better ways to upgrade their security posture. Four years on, and the podcast continues to share a wealth of information. Our aim is to grow this resource into a thriving ecosystem of knowledge.

Hosted by Guy Podjarny

Guy is Snyk’s Founder and President, focusing on using open source and staying secure. Guy was previously CTO at Akamai following their acquisition of his startup, Blaze.io, and worked on the first web app firewall & security code analyzer. Guy is a frequent conference speaker & the author of O’Reilly “Securing Open Source Libraries”, “Responsive & Fast” and “High Performance Images”.

Find us on:

Join the community

Share your knowledge and learn from the experts.

Get involved

Find an event

Attend an upcoming DevSecCon, Meet up, or summit.

Browse events