Ep 132. Responding to a Security Incident with Rob Zuber

“ROB ZUBER: So if you think about like Docker, and Kubernetes, and the complete growth of cloud over that timeframe, we’ve changed so many things about software construction and software delivery. Which in many of those things have been real game changers, like it changed how we even think about updating systems. I mean, it was a world of puppet and chef, versus, I’m just throwing that thing out and replacing it, or pods, and autoscaling and stuff like that. One of the things that’s changed in there among many is the realization that machine-to-machine communication authentication is really important.”

[INTRODUCTION]

[0:00:46] ANNOUNCER: Hi. You’re listening to The Secure Developer. It’s part of the DevSecCon community, a platform for developers, operators, and security people to share their views and practices on DevSecOps, dev and sec collaboration, cloud security, and more. Check out devseccon.com to join the community and find other great resources.

[EPISODE]

[0:01:08] SIMON MAPLE: Hello, and welcome to another episode of The Secure Developer. My name is Simon Maple, and I’m going to be hosting this session with my friend here, Rob Zuber. Rob is the CTO at CircleCI. Rob, welcome to the session. First of all, how you doing?

[0:01:24] ROB ZUBER: I’m doing well. Thanks for having me, Simon. I’m excited to be here.

[0:01:27] SIMON MAPLE: Excellent. Excellent. Well, in this session, we’re going to be obviously talking about an incident that occurred way back when that was announced on January 4th, a security incident that occurred. We’re going to be really discussing a number of things, including how you learned about the incident, what happened after that both publicly, I guess, somewhat, and some of the response that you needed to do internally. I guess we’ll look at it with a bit of hindsight as well, trying to understand what we learned, what you’d do better next time, what you do differently next time. Really, this is the kind of thing that’s great for us all to learn, because it’s something we can all go through. I think it’s definitely an area that isn’t as well-trodden across the majority of organizations.

Tell us a little bit about yourself, Rob. I saw that you’d been a three-times founder and a five-times CTO. I’d love to kind of hear a little bit about your journey and about how you progressed your career.

[0:02:17] ROB ZUBER: Yes. Well, progressed is always a funny word. In hindsight, a little happenstance, a little sort of what am I interested in doing now. I can’t say I have a really had a good plan, but it’s worked out at it. Honestly, I think one of the things that has worked out for me is doing a large number of different things. I started my career actually in manufacturing. I studied engineering, and did process engineering in a manufacturing facility. I was trying to do analytics on our manufacturing and thought, I’m pretty sure if I knew more about computers. I could write software that would do this, instead of me trying to do it by hand or in spreadsheets or whatever. That’s actually how I started getting into software. Then that was the late nineties, startups were a thing. I switched out of manufacturing and into software and never looked back.

I think that was – I went into a startup really early and that was the thing that really excited me, was the pursuit of the idea, trying to find product market fit, trying to understand what customers want, trying to iterate, and move quickly, and have just moved through bunch of different domains, everything from consumer electronics, to mobile, to now, developer tooling, and in a number of different roles. But from product to business development, to pure engineering roles. I think a lot of that has helped me, I would say, just build a much better picture for someone operating as an executive. I know a lot of CTOs, everyone does the job differently. Everyone has a different background, and a lot sort of do come through a very traditional engineering progression. But I think as we talk about supporting our customers through something like this, having spent a lot of time with customers in various roles, I think that really helps me think about it from that perspective.

[0:03:54] SIMON MAPLE: Yes, absolutely. I think that’s really important to not just – when these kind of things happen, like you say, it’s important to understand what to expect if you’re in the same position as a consumer. Whereas, very often, we act sometimes differently, I guess as the producer of this data and information, which we can talk about. I’m sure many people, everyone will pretty much know what CircleCI does. But in a level of playing field a little bit and tell us a little bit about what you do at CircleCI and how CircleCI integrates with typical customer environment.

[0:04:21] ROB ZUBER: Yes. CircleCI is a provider of continuous integration, continuous deployment effectively, what we now think of as change validation because that space is expanded beyond what maybe we traditionally thought of as CI. Like two pieces of code coming together and checking in a server that they work properly or whatever. All the way from the desktop and making sure the changes look good there out into production as you’ve deployed, and then are releasing, and knowing, okay, this has been exposed to customers, and it’s still doing what we expect.

I think, yes, I often talk about two parts of good outcome, building the right thing and then building that thing, right? You know what I mean? The former of those is more of a product management question. Do we understand our customer? Are we building the thing that they want? The latter being more of the technical engineering question. Did we implement the thing we were planning to implement? Is it behaving correctly? Most of our stuff is focused on the latter of those, like making sure that the thing does what you expected it to do. But those fast feedback cycles, enabling developers to get software in front of customers as quickly as possible, also helps you with the former.

If I take the smallest possible increment, put it in front of my customers, and they don’t react, they don’t use it. Then I know the next increment is not particularly worthwhile, so I can adjust how I’m thinking about delivering software and what it is I’m thinking about delivering it. All of that I think is critical to the way most of us think about software delivery now. A lot of it is work on systems that’s not particularly the core competence of our customers, like you at Snyk are trying to build security tools for your customers. You’re not trying to build software delivery tools. I mean, there’s a little bit of intersection we know, but our customers use CircleCI, so they can focus on building the product they’re trying to build for their customers, not be working on the tooling to make that happen.

[0:06:18] SIMON MAPLE: Yes, absolutely. I think, actually, it’s interesting, because I think – pretty well, he talks about being in manufacturing before. I think it was interesting when you made the switch across. I bet you didn’t expect supply chain issues to follow you into software back then. But it’s interesting that very often, actually, when we do talk about things like supply chain, we actually think sometimes on artefacts that do go into our application, what we deliver, versus actually what we do put together in that pipeline. In fact, why don’t we start talking about the incident, actually, because I think it’s incidents like this that actually make us start realizing the importance of security across our infrastructure, across our development platforms, and pipelines.

Yes. The incident was announced on January 4th. Why don’t you tell us a little bit about the background of the incident. I’m sure many people were affected, as someone who was affected, some people weren’t affected. But I know at Snyk, it was something that we needed to kind of like jump on fairly quickly as a customer of CircleCI. Tell us a little bit about the incident and how it came about.

[0:07:12] ROB ZUBER: Yes. I mean, I’ll say first, there’s a public report on our blog, and you can find it pretty easily. I think Google will help you. I’ll probably cover high points. But anyone that’s listening that wants more details, absolutely, check that out. We’ve done our best to communicate as much as possible about it. We’ll probably end up talking about the communication over the course of the incident. There’s kind of two halves to that timeline, prior to January 4th, what actually happened and got us to the state, and then host how we sort of responded and communicated about it, all those things.

The super high level is, we had an employee, an engineer within CircleCI whose laptop was infected with malware. That was used to capture a SSO session. That SSO session had been password authenticated, and 2FA authentic- I don’t know what the verb is around 2FA. We use a word. But that session was created knowing that it was the employee that we expected. But then the session token was captured and used remotely, to basically take advantage of the capabilities of SSO to level up to additional access into some of our production systems.

Then within those production systems, the attacker was able to effectively exfiltrate data that was encrypted, and the encryption keys, and with knowledge of the algorithm, I mean, it’s fairly standard to say, no one tries to keep their crypto algorithms secret, they try to keep their keys secret. But with both the keys and the data, you have access to everything. The particular assets were primarily “secrets.” I mean, we’ll talk later about what our customers did and didn’t understand as we were trying to communicate to them. I think that’s a really interesting area to get into.

In order to use CircleCI, yes, you’re testing, but you’re also maybe deploying and pushing into another environment, maybe one of your production environments. So some of our customers have things like AWS key pairs stored within CircleCI, and those are part of the secrets, things that we make accessible to the build. In order for your build to use them, there’s a point at which they have to be decrypted and available to your build. That capability exists across the systems in our platform. That’s what the attacker was able to access and then exfiltrate. That’s kind of the actual activity prior to the fourth. The fourth was when we announced, it was a little bit before that, that we had been effectively notified of some activity that seemed tied to something that was only stored on our system.

[0:09:50] SIMON MAPLE: Yes. I was going to ask about the detection, how you actually identified, how you worked out that that was the case?

[0:09:55] ROB ZUBER: Yes, yes. It was effectively someone coming to one of our customers saying, “There seems to be some activity related to this. Do you have any understanding of how this could have occurred?” We did some investigation initially looking at that particular asset, which was a subset of overall secrets that we have, because they’re stored in not always different places, but different attributes, whatever you want to call it.

We focused on that particular type of token first, which was GitHub OAuth tokens. That was the thing we had a record of being exposed. So started working on rotating those on behalf of customers. But, then also started an investigation to try to understand what the exposure was, pursued a bunch of different avenues. Then, on the 4th, identified broader access. That was when we had a clearer picture of the access. Fairly soon after, I mean, that’s a judgment statement, but fairly soon after notified on the 4th. That day was where we went from really understanding the scope to notifying our customers. I will say, I said, we’d come back to talking about notifying customers and that communication piece, which we probably will. But we made the decision fairly quickly to go broad. “Here’s what we know, it’s not that much. But we know this is a risk for you, our customer, therefore, please do something. We’ll come back, as we understand more.”

[0:11:14] SIMON MAPLE: There’s a tough kind of balance with that as well, when you do try. Like for example, if we identified a vulnerability, some of the larger vulnerabilities that we’ve discovered, we would try and do as much non-publicly, first of all. Because the more that goes into the public domain, the more attackers you’re likely to find. But this is very significantly the case where a vulnerability has been found, not an actual attacker yet. So there is an attacker trying to get. So you’re trying to work out how you can patch in advance or potential attackers identifying the problem. But of course, this is more like a zero-day, where the attackers are already there. Would you have done anything differently in your approach to the broader announcement if you have to do it again?

[0:11:56] ROB ZUBER: I don’t think so in terms of scope. I think you’re absolutely right. Like there’s this responsible disclosure, like how do we manage communicating to our customers, that they need to take action without giving more value, more information to the attacker that they can take advantage of? We did weigh that, but we weighed it on the order of minutes. You know what I mean? We didn’t gather a big council, and have a long conversation, and schedule a meeting for next week to talk more about it. It was like, we need to act, and what do we feel is the best thing.

As I said, from a perspective of scope, I don’t think I would do anything differently. The thing that we learned was, when we broadcast that, “Hey, we think you should rotate all your secrets.” We had a lot of customers who were sort of like, “What’s the secret? What secrets do we have with you?” Because people set up their CI pipelines, CD, whatever, they get them working. And then they leave them, they just work, which is a great feature. But at the same time, maybe the person who set it up doesn’t even work there anymore and someone else is managing it. They’re not sure how it’s configured or what sorts of things might have been put in place.

I think, both in terms of how we think about product long term, which I think will come back to you, but in terms of that communication, we probably could have been more explicit. Not because we were hiding something, but because we had assumptions about our customers using the product all day, every day and knowing exactly how everything was configured, and where they could find things. We also learned about areas in our product where it was difficult for larger organizations to identify everything that they had stored with us. We actually built a bunch of tools kind of on the fly right after as we started getting tickets of people saying, “Hey, we understand we need to do this, but we can’t find this information.” We said, “Oh, wow. That’s actually hard. Okay, let’s add this API, change this capability, expose more data so that you can find all these things.”

We built a product targeted at people delivering their software every day, not a product design for incident response. So there was some learning in that. But I think I’m super proud of the team jumped out. We had folks who didn’t stop to even say, “Should I build this?” They just went and built stuff, and started pushing it out into the community to help folks find what they needed and understand.

I think scope-wise, to circle back to that whole thing, given the scope of the impact, there was no other way to – maybe someone will email me and tell me a great other thing we could have done. But from my perspective, I would amplify the message that quickly again. I think we learned more about what our customers understand and how we need to communicate. I think overall, we did a pretty solid job of communication, our comms team was going 24/7 for weeks. But understanding truly how our customers think about the product and their level of understanding of some of those pieces, so that we could be more explicit or clear. And again, we also learned about capabilities of the product in terms of exposing that information.

[0:14:46] SIMON MAPLE: Yes. I think actually, I think it was very much applauded by the community in terms of how fast you responded and how open you were actually providing that data. I think, yes, I mean, Snyk were – for an amount of time, we were hard at work, trying to understand how it affected us and what. It’s interesting, because it’s one of those things that when it works, it works. But unless there’s an incident, you don’t actually stand back and look at your environment as critically as perhaps we all should do. We’re not excluding anyone in that. I think everyone could always spend more time with their security hygiene at their systems and things like that. There’s never an end of time of how much you can spend in there.

Yes, it’s working out where all your keys are, identifying keys that, like you say, have been in place for so long, or maybe even haven’t even got a timeout or a lifetime. There’s a lot of, sometimes, skeletons in your closets for a lot of the company. I mean, Snyk is a seven-year-old company, I’m sure there are many that are tens of years old that have problems that are very deep rooted in their environments. So yes, in terms – from the customer point of view, how well do you think after the messaging got out, and the people were trying to identify if they were affected, how much they were affected, how well they were able to rotate keys and secrets. How do you feel we as an industry adapted to this as a as an incident? Do you think we’re almost like – sometimes it takes incidents to make us better as a community and as an industry. How better place do you think we are now after seeing an incident like this in software and like this incident being prepared for something similar?

[0:16:15] ROB ZUBER: Yes. I think it’s probably in a couple phases, let’s say. I think you’re absolutely right. People had to go learn some things about their own systems. I would say, from the people that I spoke to directly, unsurprisingly, I had a chance to meet with many of our customers and many of their sort of heads of security and things like that. We both got really valuable feedback on how we communicate, and what we can help them with. We spent a lot of time talking about our product, which I’ll say is the second phase. Like how we can all integrate better in ways that are safer and more manageable.

But in the first part of that, I think that my takeaway from the majority of those conversations was, everyone was keen to learn. Okay, this is interesting. We saw this thing, we’re going to go fix that on our side, or here’s what we need from you to help us be better at this, sort of both sides of that. But the conversations were always curious, focused on learning.

To your point, I think security is a really interesting area. Because, I mean, we know about vulnerabilities because people happen to find them. That’s like, sure, we keep our vulnerabilities, our systems up to date on CVEs and stuff like that. But it’s not particularly data-driven. It’s sort of like, no event, big event, no event, big event. It’s hard to write an SLO, for example, about your security, like days since breach is not exciting. You know what I mean? I think there’s a real thirst for understanding, okay, this is a real live example, like all the textbooks are interesting, but this is a really good concrete example of our ability to respond to CircleCI than our customers’ ability to respond how things went there.

Again, I think all the conversations were at least curious, and thoughtful, and “Okay, what can we change? What can you do differently to help us? How do we use your product differently? How was our response internally from our customers, et cetera?” So yes, I think there is a real desire to learn. Also, I would say, an opportunity for all of us to say, “Okay, here’s a concrete example, how do we use that to drive priorities? I think that one of the challenges of something – again, without being sort of a linear progression of an SLO, or something like that, it’s difficult to prioritize that work against other work sometimes. For many of our customers, I think everyone suffers from that. So having concrete examples, and using that to really inform how we think about our systems, I think it’s an important opportunity.

[0:18:44] SIMON MAPLE: Yes. It’s interesting, actually, when we talk about the kind of like the community aspect as well, and how we can learn from the community, and how we can also share with the community. Obviously, like I said, again, the work that you did, the transparency that you did in and around the incident was very much appreciated and accepted by the community really, really well. Do you feel like given that these events happen, and then kind of like go away for another six months until another incident happens is extremely hard to kind of like make sure that you’re spending the right amount of time to avoid the next incident, or at least be prepared for the next incident? Do you feel like from the community support and from the knowledge that exists out there in the security community, as well as the engineering communities, did you feel like you as an organization knew the steps to take in like a security incident, where you helped along the way without through advice of others who have been through similar or best practices in and around that?

I guess my question would be, if someone out there who is about to have their next breach or their first breach in a couple of weeks, what most helped you during this process?

[0:19:44] ROB ZUBER: I would say, during the process, like the very initial response, we didn’t have a lot of outside help. I was going to say we were internally oriented, but we tried to be as customer-oriented as possible. I would say from the fourth to a week after two weeks after was almost entirely driven by our values. What do we care about? We care about the customer? We believe in transparency. Obviously, this is a difficult situation, but how do we do the best that we can for our customers? That was always the centre of the conversation. That’s less sort of technical details of how did we respond, but we were equipped to say, “Okay, that’s a problem. That’s a problem. Let’s go.” We did actually contact some of our partners, and providers, and things like that to say, “Is there something we can do here to mitigate this risk? Is there something we can do here to mitigate this?”

For all the people in the world that are in software, it’s a very small community. It’s not hard to get on the phone with someone who can help you from one of your partners, et cetera. We talked to our SSL provider immediately, just to understand how does this happen, what do we do differently. We talked to AWS as an example, we said, “Hey, we have all these keys that have been exposed.” They contacted the owners, like they went through and said, “These ones are still active. Here are the owners. We’ll contact them.” There’s additional communication to those people. We did take advantage of some of those things, I guess. Maybe it’s not purely driven by us. But the incident response, I mean, we respond to technical incidents. So we have a playbook, we have like a model of thinking about prioritization, which is always communicate with the customer, help them understand how they can respond and mitigate the issue for the customer as quickly as possible, then clean up.

That might get applied slightly differently in a security incident., but the sort of framing is there and is tested inside of our organization. So our comms team is used to that, our executive team is used to that. I mean, I think, between initial discovery and everybody on the executive team that needed to know being on a call to make the decision about how we were going to proceed was measured in minutes, right? Everyone just dropped everything they were doing. This is the most important thing. We have that capacity within the organization that recognition, “Here’s the customer-affecting thing. It’s more important than whatever you’re doing right now, please get over here, and we’re going to go deal with it.” To then feed that back into the organization, say, “This is our stance, go.” I feel really lucky that we were well prepared from that perspective. Then, again, had the access to make calls.

We had partners on the phone on Saturday afternoons and stuff like that to say, “Hey, can you help us with this? Can you help us with this?” Which was more about, again, communicating out to customers. We’re doing email blasts, contacted our mail provider and said, “We’re about to send hundreds of thousands of emails.” They were like, “Go for it.” We’ll make sure they go through sort of thing versus. “Oh, no. We need you to slow that down, whatever.” I think there’s enough recognition in the industry, which is fantastic to say, “Look, this is what we’re dealing with.” Honestly, most of our partners in here, many of our partners are our customers, but they also just watch the news or whatever.

When we contacted them, they said, “Yes, we’ll drop what we’re doing, and we’ll get this in the queue, and we’ll make sure that it happens sort of thing.” Then later, as we look at, “Okay. Now, how do we build better products, or customers, have great use of these – I keep alluding to how you would use a product. I’ll talk a little bit about it. But I mean, in this case, it was primarily keys, tokens, that sort of thing. The company has been around since 2011. There weren’t other great approaches to managing access tokens. For example, in 2012, when the product launch. Now, there are. We support OIDC, for example, which allows you to do a server-based exchange, and get a 15-minute expiring token, and for 15-minute expiring token gets stolen. Like, it’s not the biggest issue. It’d be one token for one customer versus encryption keys in a database sort of thing. So, we have the opportunity to guide our customers towards that, and sort of orient the product more around. “Oh, if that’s what you’re trying to do, don’t do it this way anymore. Do it this way.”

That has also been our own thinking, but talking to other people in either both our customers, other members of the community – people that those folks are trying to integrate with to say, “How do we make this smooth and easy so that this is the default, this is the way that people build? And the value of this type of breach is massively diminished, because these things just don’t have the sort of superpowers that they have today.”

[0:24:19] SIMON MAPLE: Yes. It’s always kind of like those legacy deployments or the long tail of trying to get the deployments that were there for so much longer, or similar with vulnerability testing. As oppose where, it’s easy and quick to be able to get a fix out, who’s actually getting everyone to actually adopt that fix afterwards, can take a lot of time. One thing that kind of struck me as we’re talking about how the team rallied around the incident, talking about blame culture or no-blame culture. I know in and around that incident, it’s extremely important to have a no-blame culture. When it’s a security incident, that’s public as well, it makes it even more important. I know at CircleCI, it’s a big thing to make sure you do have that no-blame culture.

To be honest, it’s a massive thing from an effectiveness point of view as well at the time an incident happens. How did that play out in terms of – how did you manage to get the no-blame culture on such an incident to make it effective for you?

[0:25:11] ROB ZUBER: Yes. I think, I feel like this is always my advice on culture. It’s like, you can’t build it in a day. That, again, sort of citing other incidents that we have worked on technical incidents, reliability, those sorts of things over the years. It’s been important to us for a long time, and something that we’ve continued to invest in, building that culture, thinking about how we respond. It takes active investment, I would say. As you bring new people in, as the stakes get higher, all those sorts of things. But I think we we’re working off existing experience, and I’ve also spoken about this fairly publicly. But I’ve been fairly involved in reliability over the last year or so myself, just turning my attention to that participating in incidents and that sort of thing.

I think that can go either way. I guess folks can ask other folks that work here, because my opinion is going to be biased, but it can go either way, right? If someone with an executive title is constantly showing up, I think at first, it was a little alarming or intimidating to folks. But you have to show up in the right way and be curious in helping, and offering, like asking good questions. Not the, like, “When am I going to get a status update, but has anyone looked at this?”

One interesting thing about me as an executive, as I’ve been at the company for eight-plus years. And so I have a lot of understanding of legacy systems that not necessarily many people will have. I try to make sure my presence is aligned with that. But as a result, having spent some time dealing with that with folks, and trying to help improve that process, have built relationships with many folks in the organization who respond to these sorts of incidents, who otherwise maybe wouldn’t know who I was right. I mean, that’s my particular role in that.

But modelling blamelessness is a big step in getting other people to act in a blameless way. I think we often talk about it hierarchically, but it’s just as likely for one engineer to say to another, “That’s dumb and you’re dumb.” That’s not a good situation to be in, but it can happen, and it can be equally intimidating as a CTO or someone showing up. I think we’re working off that foundation. What I think was interesting and a little can be nerve-wracking, whatever in this situation is now, you’re publicly stating, “Hey, this is what happened. Here are all the details.” It’s surely possible that someone’s going to look at that and say, “Well, you made this mistake, and this mistake, and this mistake” sort of thing.

But to the point you made earlier, I think that was well received overall, because so many people in our industry are thinking about blameless culture and learning, and focusing on learning. In this particular case, we experience something that others have experienced, others are going to experience, and everyone looked at and said, “Oh, there’s a lot of detail here. How can I learn from this?” That is part of, again, speaking of going and talking to customers. We had a lot of discussion about specifically CI, how do we use the product better, where are you going from a product perspective that’s going to help us.

But then, everything back to the malware. How do we train our employees? How do we protect ourselves from a similar situation? Because clearly, this is not one and done, right? I mean, look at the news over the last 60 days. It’s not just us. So everyone is thinking about how do I learn from this, not just about using CircleCI, but about how I protect my organization, how I think about my systems from endpoints to production. to how we manage internal knowledge around tokens, whatever it might be. I think that primarily, the industry received that again, in the way that we hoped, which was like, “We wish that didn’t happen, but it happened. Let’s learn from it and let’s all be better as a result.”

[0:28:58] SIMON MAPLE: Yes. No, absolutely. It’s interesting, but great to kind of dig a little deeper into that. Because I think when we talk about supply chain and components that we need to secure. And of course, production environments, and making sure what you push to production is extremely important. People have a right to put the levels of effort they do into their process, and their workflows to proactively do the best they can at preventing those types of issues. But it does feel like a lot of the time with other aspects of pipeline security components, security things that don’t go to production. It is easy to not spend as much time on these types of areas of our environment. What kind of learnings did you have in that maybe internally at CircleCI or learning from others? What advice would you give to people in this area?

[0:29:44] ROB ZUBER: Yes. I mean, I think the supply chain point is really critical, and you’re absolutely right. I mean, I started my career thinking a lot about supply chains. There were chips coming from this place, or this place, and this truck is delayed and whatever, which we don’t deal with. But we absolutely are building much more complex software from a bill of materials perspective than we were when I started in the late nineties. Like we maybe took two libraries, and then wrote everything else ourselves. Understanding that each of those components. We refer them as components, metaphorically or literally. But each of those components is its own surface area. Whether it’s from a reliability perspective, and we’ve seen that where people have just decided they’re sick of building a piece of software. So they turn it into something that will take down your system. Like, I get it. I get the people are frustrated, but they can be equally frustrated, and this has happened. Turn over the keys. I don’t want to maintain this anymore.

So they turn over the keys to the first person who asks, and that first person who asked has malicious intent, right? Understanding how your system is composed, is really valuable in that regard. I think, yes, we can try to build systems to prevent all these things. The number one piece of advice is like make it simpler, always make it simpler. Which is not, write your stuff yourself. I mean, we talked about crypto or like, please don’t write your own crypto algorithms. That’s not going to end well. But use one version of something across all your systems instead of a different one and each of your systems sort of thing. Because that multiplication, the combinatorics of that are going to be bad for you in the end. I mean, just from a vulnerability scanning perspective, right?

If I have seven different libraries that do approximately the same thing. Now, I’m exposed to seven different – just managing the vulnerabilities. Let alone actually being exposed to any of them, is seven times more work. I want to be putting my work into building the product for my customers, doing what they want. Not managing all the underlying libraries and tools. I think simplify wherever you can, get your system to a place where you can reason about it easily. Then at least, when something goes wrong, you know exactly what to do and how to fix it.

Obviously, I’m a big fan of having an automated pipeline with automated vulnerability scanning. These are great things, have those in place. As soon as something comes up, Log4j comes up a lot recently as a dependency that impacted a lot of people. You mentioned earlier, the long tail of updating, if you have all this tooling in place, and you could say, “Cool, let’s update our dependency map and just push everything to production, where it was all validated along the way.” Problem solved. I was exposed for minutes, not for months. I think having the fundamentals in place and keeping your system as simple as possible are great starting points. Because then, you can at least reason about these problems, versus sort of throwing your hands up and say, “I don’t know. We have 7,000 services in 40 different languages, like I just don’t even know how to think about this anymore.” And then you don’t know how to think about your exposure.

[0:32:46] SIMON MAPLE: Yes, that’s great advice. If we take it one step higher, maybe towards the business side. Obviously, we talked about some of the technical aspects. How did it play out from the business side of things in terms of whether it was impact or whether it was actions that folks needed to take? Was there many waves on that side?

[0:33:02] ROB ZUBER: Yes. I mean, I don’t think there’s anything particularly surprising, like our customer response initially was primarily focused on how am I affected and what do I do? Which is, people prioritize appropriately. But yes, I alluded to talking to many of our customers. That’s important to us to makes sure that they understand, to make sure if they have questions based on what we did publish. That feedback was really useful. Like, “You said this, but this doesn’t make sense” or “I don’t understand if I need to do this or that.” Those sorts of things helped us understand our communications, also helped us understand our product. How are you using our product, and can we help you use it in a better way that’s going to reduce the risk for you over time?

I think all of those conversations with our customers happened at different levels. Sometimes it was myself or another executives, sometimes it was through our support channel. I mean, depends on the customer and what they were looking for. But I would say, primarily focused around everyone’s learning right there. Okay. In some cases, the security head who says, “My team uses this, but I don’t necessarily understand how they use this. Now, I understand that I need to understand, if that makes sense. Now, it’s been made apparent to me that this is important. What do I need to know?”

A lot of it was investing a lot more time with our customers in those conversations. I would say, much of that, because our customers are very technical, is folks like myself or head of engineering, technical leaders, talking to technical leaders, versus more of an account management conversation or whatever. But obviously, everyone from CircleCI, whose customer-facing was involved, right? Whether it was direct communication, ensuring that someone from the customer knew was acting. We did everything in our power to broadcast information and still came up feeling like folks were coming in and saying, “How come I didn’t know about this?”

That was a really interesting takeaway, is just – I feel like we talked about this a lot communicating to employees. Make sure you’ve said it over, and over, and over. We felt like we were just pushing through every channel we had, and still, we’re getting feedback, like, “You need to be louder. You need to talk more sort of thing about this.” I think, once you’ve decided that this is the thing you’re going to make really loud, like you can’t be loud enough, is sort of a takeaway from there.

Yes. Overall, I would say, a lot of talking to customers, a lot of helping them understand how to navigate the situation. I will say, this happened on January 4th, which is like, most people came back to work on January 3rd. We already had a little bit of an inkling that something was going on, like we were investigating, but we didn’t know the scope until the 4th. But you often come into the new year with big plans, and like what this year is going to be like. We spent the first six weeks probably of the year, talking to customers about working through the issue and also talking to customers about it. But I would say, I wouldn’t really trade that for anything. I mean, I wish it was under different circumstances.

[0:36:01] SIMON MAPLE: It’s a good forcing function, I suppose.

[0:36:02] ROB ZUBER: Yes, absolutely. We had an opportunity to meet tons of customers, have really good conversations, understand better where we could be going from a product perspective. Then absolutely, like we’ve done, prioritized work, and changed what we’re delivering from a product perspective, in some cases. Because it became abundantly clear that with this one change, or this one change, people could make really impactful changes to their own use of CircleCI. It was very easy to just put that at the top of the queue and get it done.

[0:36:31] SIMON MAPLE: Yes, I was going to ask, actually. Because you did mention that you were surprised that the way people were using CircleCI at times. I guess, your biggest surprises in – or what did you learn most, and what’s most interesting to you about the way people today we’re using CircleCI. I’d love to kind of like – if you could go into any detail in terms of some of the – not direction changes, necessarily, but some of the bigger things that you would be looking at going into this year, because of the incident that happened to make people more secure, make people think about how to use CircleCI better.

[0:36:59] ROB ZUBER: Yes. I don’t know if it’s surprising, per se, but the world has shifted, right? It’s interesting for us, because we’ve done the same thing for a long time, which is trying to give you confidence in the change that you’re making, in your ability to put that out in front of customers. That started out as CI in 2011, 2012, but containers and Docker didn’t really exist when we started CircleCI, or I wasn’t there when it started. But when CircleCI started, we used LXC, because Docker wasn’t even a thing.

If you think about like Docker, and Kubernetes, and the complete growth of cloud over that timeframe, we’ve changed so many things about software construction and software delivery. Which in many of those things have been real game changers, like it changed how we even think about updating systems. I mean, it was a world of puppet, and chef versus, I’m just throwing that thing out and replacing it, or pods, and autoscaling, and stuff like that.

One of the things that’s changed in there among many is the realization that machine to machine communication authentication is really important. So OIDC as a tool, as one of the possible tools for that is something that’s come onto the scene. We implemented it. We implemented support for OIDC. But sort of put it out there as an option instead of saying, “Hey, dear customer, I see that you currently have AWS tokens stored with us. You’d probably be in a better spot if you switched to using this tool, right? I would say, overall, as we look at these tools, and actually, I was going to say, no offence to someone in security.

But honestly, I think what Snyk has done really well has taken this bag of sharp knives that is security tooling around vulnerabilities and made it really easy. It’s actually easier to just take this one thing and all my vulnerability checks and everything else get done really simply. As a developer, I don’t need to become an expert. In many other spaces, it’s like, just go read this 300-page spec on how to do machine to machine communication successfully, and then go implement it in Bash, or whatever. People are like, but I’ll just put it in this, I’ll just make it an environment variable, right?

We have an opportunity to really simplify that stuff, and we don’t have to be the OIDC provider. But we can make it really easy to use it for the things that our customers are trying to do. So again, I wouldn’t say it’s a surprise, but a realization that just making it available doesn’t have the whole community saying, “You know what, that would be better. Let’s all stop what we’re doing and switch.” We have the opportunity now to say, “Click this one button.” I don’t know if it’s going to be that easy because you have to do some stuff on your side, but whatever. Like, here’s the two easy steps, do this thing, and now you’re in a much better place. That, not just making it a little more configurable, like there’s some basic stuff that we’ve put out there already, but to make it so that people can check for the things they want to check for in terms of the claim. But making it so it’s obvious, and the default, and how you think about doing integrations with external systems. I think that’s a big area.

Then there’s other things that we have been working on or OAuth had this whole thing came up, versus those types of tokens, basically have long expiry large scope. So changing those sorts of models, which we’ve been investing in for a while, but it’s being accelerated, I would say.

[0:40:32] SIMON MAPLE: Yes. You had some really interesting points there, actually. It kind of goes back to what you were saying earlier about the levels of communication and communicating loudly and more. I think a lot of the time even, with feature releases, and things like that. Okay. When you will release a feature, how many people actually hear about that? It’s constantly pushing that button to say, “Don’t forget about this.” I think, almost like the auto-remediation or auto fix in terms of being able to say, “Oh, by the way, you do have a problem with this configuration, or this would be a better configuration.” For whatever reason is really valuable for people to hear and educate as well as to why they’re exposed or why it’s not linked to have this there.

[0:41:07] ROB ZUBER: Well, I think there’s an interesting intersection there, to your point of auto-remediation. Which is, we see that you’re doing a thing. We’d love for you to do this other thing. But if we did it on your behalf, we would need a lot more scope than you want us to have. So building that right relationship that says, “Click here, we’ll give you exact instructions. If you click this button, you’ll grant us access for five minutes to make this update or whatever.” Then we’ll take it away, whatever those models might be. I think it has to be that easy. But we’re talking about making it really easy to do a thing.

Again, that would be maybe impacting your primary repository, and you don’t want us to have that access in a long-term basis, so that we want to make it as easy as possible, but also give you that confidence that we don’t have the access to go with just something on your behalf, sort of thing. That’s always been an interesting balance, just in terms of what we do. But if it’s not that easy, then you – maybe I’ll work on that next week, and then next week becomes next month, and next month becomes next year, and you don’t make the changes.

[0:42:05] SIMON MAPLE: Yes, absolutely. Rob, it’s been amazing. This episode gotten very fast. I’d love to kind of like ask you the question, which we always ask our guests, which is, we’re at the start of 2023 now, and casting our eyes forward to the rest of the year. Get a crystal ball, what are your predictions be for 2023? What should people be focusing or paying attention to for the remainder of this year?

[0:42:26] ROB ZUBER: Yes. Topically, I don’t want this to be my prediction. But watching how we came into this year, like just watching the news far beyond CircleCI. Obviously, paying a lot of attention to security news at the moment. It does feel like there’s a steady increase, or maybe even an acute increase. I don’t know if it’s in reporting or actual activity, but it does feel like this is going to be a year where as a community, we either get more serious about this, about security, and about these types of threats, or we suffer the consequences. One of the things that I would love to see, and I don’t know that I can make this prediction, but I’m signing up to be part of it is, for us to talk about these things more openly. To your point of the report that we put out, it felt like people were surprised by the amount of detail that’s disappointing in the sense that, if we’re not sharing that detail and learning from each other with this level of activity that seems to be going on, then I think we’re putting ourselves at a disadvantage.

I guess my prediction is continued increase, and my desired prediction is that we take that opportunity to talk more openly to share how we can all be better and learn from each other. As I said, there’s not linear data within your own organization, hopefully. So we can fill in the data by getting data from other organizations. Everybody that experiences something, if we go out and say, “Oh, this is what happened. This is what we understand about the motivation, the IOCs, whatever it is, and then share that across. We’ll start to build a data set that will actually help us all be better.” It doesn’t feel at the moment like that conversations as open as I would want it to be.

[0:44:14] SIMON MAPLE: Yes. Hopefully, people seeing, and learning and seeing the leadership that you had during the incident. Hopefully, we see more people being as upfront as transparent and as fast with information as you were during the early parts of this year. So yes, big thank you for joining us on the podcast episode. Again, thank you to the way you handle, the speed at which you handle the incident and providing that level of depth of data. Yes, long may it continue, and hopefully, we see more folks growing in this community with that same level, same intention.

[0:44:46] ROB ZUBER: Yes. Thank you so much for having me. It’s been amazing. I will say, I would love to see that – I mean, on the one hand, I don’t wish this on anyone. On the other hand, if you’re in this situation, honestly, call me. I’ll try to help you. I’m so motivated to see this change that I’ll put that out there. Like, I’d be happy to help anyone else that finds themselves in this situation, because it’s not fun. But the best we can do is learn from it and try to all be better.

[0:45:09] SIMON MAPLE: Amazing. Rob, actually, you could have mentioned at the start, but we didn’t get around to it. But you run a podcast yourself, right? How can some of our listeners hear more from you on your podcast?

[0:45:18] ROB ZUBER: Yes, thanks. I also forgot, so it’s called The Confident Commit. We are just heading into our third season, and we’ve sort of made it a little bit more thematic and our second season was primarily about learning from failure. So you could see a lot of motivation, sort of taking advantage of my own lessons there. Not really my lessons, but the thing that I love about it, which hopefully you’re having the same experience as I get to talk to so many awesome leaders from the industry, about how they perceive these things. So The Confident Commit, you can find it wherever you find your podcasts. This season, which we just kicked off. We’re talking about teams, great teams, how they function, how they work. I’m super excited about that. We have a few queued up and we’re still actively recording that one.

[0:46:02] SIMON MAPLE: Amazing. Great show. Feel free to sign up for that, definitely. Thanks again, Rob and thanks to everyone for listening. We look forward to chatting again on a future episode. Thanks all.

[END OF EPISODE]

[0:46:16] ANNOUNCER: Thanks for listening to The Secure Developer. That’s all we have time for today. To find additional episodes and full transcriptions, visit thesecuredeveloper.com. If you’d like to be a guest on the show or get involved in the community, find us on Twitter at @DevSecCon. Don’t forget to leave us a review on iTunes if you enjoyed today’s episode. Bye for now.

[END]

Responding to a Security Incident

About this episode:

Tags:

Episode Transcript

About Rob Zuber

Up next

About The Secure Developer

Hosted by Guy Podjarny