In episode 56 of The Secure Developer, Guy Podjarny talks to Seth Vargo. Seth is the author of ‘Learning Chef’ and is passionate about reducing inequality in technology. He is currently a Developer Advocate at Google. This discussion is centred around the talk Seth gave at DevSecCon Seattle – “Secrets In Serverless”.
Seth Vargo is a Developer Advocate at Google. Previously he worked at HashiCorp, Chef Software, CustomInk, and a few Pittsburgh-based startups. He is the author of Learning Chef and is passionate about reducing inequality in technology. When he is not writing, working on open source, teaching, or speaking at conferences, Seth enjoys spending time with his friends and advising non-profits.
On today’s episode, Guy Podjarny, President and cofounder of Snyk, talks to Seth Vargo at DevSecCon Seattle. Seth previously worked at HashiCorp, Chef Software, CustomInk, and a few Pittsburgh-based startups. He is the author of Learning Chef and is passionate about reducing inequality in technology. Today, he is now a developer advocate at Google and is passionate about the human element of security. This discussion is centered around the talk Seth gave at DevSecCon Seattle titled, Secrets in Serverless. In this episode, we flesh out one of the core principles of this talk which is that security is not binary. Here we explore the often-unseen side of security and how developers can prevent or limit attacks by assuming from the get-go that their secrets will be leaked! If you’re looking for practical, in-depth advice, as well as a leading, expert strategy that will shift your view on managing secrets in serverless – then this is the episode for you!
Guy Podjarny: Hello everyone. Welcome back to The Secure Developer. Today, we have another DevSecCon edition – version of it – and we have with us Seth Vargo from Google. Thank for joining us, Seth.
Seth Vargo: Cool. Thanks, Guy. Thanks for having me on the show. I’m really excited to be here and talk a little bit about secrets.
Guy: Secrets. Talking about secrets is always best. Before we talk about secrets, first secret, what is it that you do?
Seth: Well, if I tell you, I have to kill you. No. So, I work for Google. I work on the developer relations team at Google. My day is spent in a triad of things. I’m out with communities. That’s why I’m here at DevSecCon Seattle, meeting with users, figuring out what security challenges they’re having. Then I’m also coming and kind of taking that feedback, synthesizing it. I like to say I translate customer into engineer. But I take that back and work with our product and engineering teams to make sure that our products are meeting the needs of what the communities are telling me they want out of security products, or just products in general.
Guy: Yeah. I guess that’s sort of the relation part of that sort of advocacy. It goes both ways.
Seth: Yeah. I’m like Google Translate. That’s a service, right?
Guy: That’s a good way to sort of think of it. In general, when you talk about advocacy, like what space of developer relations do you deal with?
Seth: Sure. When I first joined Google, I did a lot of stuff in the DevOps and site reliability engineering space, but security has kind of always been a passion of mine. More recently, especially with my background working at HashiCorp, the company that makes Vault, I have transitioned into this role, where I do a lot of focus on security but I’m more focused on how we can make security everyone’s responsibility and less so like what kind of vulnerabilities exist in like this release of this thing, right? We have advocates who were focused on security. I’m more focused on like the human element of security.
Guy: Yup. Kind of helping people be secure as they evolve. I guess that’s a good segue to your talk. What is your talk going to be about at this event?
Seth: Yeah. There’s obviously this new thing. I don’t know if you’ve heard of it. It’s where you use other people’s servers and you pretend they don’t exist?
Guy: Called clouds, no?
Seth: No, that’s called server. I guess that’s true. It’s like serverless, right? It’s this idea of there’s computers out there somewhere, but here’s my code. Either it’s in a container or it’s in some packaging format. “Please run it for me and scale it infinitely.” Air quotes for those listening at home, right? That movement was started by developers and there wasn’t a lot of focus on security, right? It was all about, I’m going to bypass my operations team. That’s why we’re doing serverless. We don’t have to worry about acquiring computer networking and storage, and I’m going to bypass my security team because, well, it’s just serverless, like someone’s doing everything for me.
We see – Well, I have my serverless application. Maybe it’s something that is doing social media monitoring and it needs a Twitter API key and a Facebook API key in order to get metrics or data from those third-party systems. Or maybe it’s sending a text message that we need like a Twilio API key. How do we get those secrets, or those credentials, into the serverless applications? How do we do TLS? And really shifting the responsibility left? Because developers are like, “Well, we don’t want security in our cycle.” Now, they’re deploying insecure applications. So, it’s fine if you don’t want security in your release cycle or your CICD process, but you still need to care about security. We have to shift that as left as much as possible.
Guy: Yeah. Basically, you can try and remove the security team from your process, but the security bit, you know, that kind of has to stick around or you might be in bad shape.
Seth: Yeah. My talk is really a journey. It’s more of a fake story, but a story of how we took this application, we deployed it, we put all our secrets in environment variables, and then we got pawned. We showed up on the front page of Hacker News, and everyone was like, “Well, how on earth did we get hacked?” First, it starts out that we didn’t run our application in production mode, and therefore whenever someone caused our application to crash, the entire environment was dumped because that’s what our web framework happens to do.
This is really common, right? Jango, Rails. If you’re running in production mode, whenever that application crashes, it tries to be helpful and it prints out the entire environment. All our secrets were in there. So, we patch that, and we deploy in production mode. We successfully get a generic 500 error. But then it turns out that there’s a malicious dependency in our software supply chain, actually. Even though we’ve mitigated this one particular part of the attack, it turns out that on boot there is this package that otherwise is helpful in our dependency chain that is just running the end process and then posting it to random placement endpoint. An attacker is basically getting a full dump of our environment. As a result, we haven’t really increased our security much, and this is actually very common, especially in the Node.js community. There’s been a lot of attacks on like bitcoin wallets, etc. where there’s an otherwise useful package that has a little bit of code in there that every so often does something very nefarious.
Guy: Yup. It’s been, unfortunately, kind of growing substantially. Well, either the symptom of malicious libraries has been growing, or our ability to detect them has been growing. Either way, there’s definitely kind of more incidence of this.
Seth: Yes. We talk a little bit more about like automated vulnerability scanning and could we detect this. The unfortunate part is like, “Yes, we may be able to detect this, but someone has to report it first,” right? This is one of the challenges with a lot of the vulnerability scanning. It’s like, they’re really great for detecting things that have already happened.
We move into the next logical step, which is like, “Okay. Well, if we’re going to use environment variables, we can’t store our secrets in plaintext in those environment variables, so we have to look to some type of encryption. What if we encrypt them using like a key management service or some encryption software that is a third-party piece of software? Then at boot, our application decrypts those environment variables and only stores the plaintext values in memory, right?” If an attacker runs the env command or dumps out the environment, they just see encrypted bits of data. If an attacker were to say trigger a core dump or if they were doing a targeted attack where they knew that you were using a key management service, they would still be able to get access to these secrets.
Guy: Yup. But it still raises the bar, like it still makes it harder, or potentially kind of moves you off of the scripted, like the generic ones, and veer more into more sophisticated and more targeted version of attack to get that data.
Seth: Right. This brings up a really interesting point that I’m trying to get people to think about, which is that security is not binary, right? People, especially developers, often think of security as like a light switch, which is like, “Oh, I’m secure now,” or, “I’m insecure.” But it’s really a spectrum and it’s constantly evolving, right? The type of security that an application needs, varies. A payment processing system needs significantly more security than, say, like a static html website that’s running on an Apache Server and has no access to anything else. This is where you have to kind of make a threat model and assess like, “How much effort do we want to put into this to actually consider ourselves secure, and what is our threat model, and are we socializing that with our internal teams as well? If we have stakeholders, or teams that are depending on our service, what is our threat model? What security guarantees do we make? Specifically, which ones do we not make?”
Guy: Yeah, for sure. But you left me kind of in suspense. So, you’re in that spot. You can do the core dump. Is that where the story ends, or is there another chapter?
Seth: No. The last thing we then look at is how do we prevent an off-line attack. If an attacker is able to get access to that running instance, they can trigger a core dump, like that’s a really targeted attack. There’s a number of ways we can protect against that, but those all involve like using things like memory locking, auditing your dependency trees. But ultimately, the way that you best protect against leaking secrets is just to assume that they’re going to be leaked.
The best protection is assuming that you have none. This is where you bring in like a secrets management solution, something like Vault, where instead of hard-coding your credential that sits in an environment variable, maybe you need, say, a password to talk to Reddit, for example. Instead of that being a hard thing that sits in a string and in an environment variable, whether it’s encrypted or not, you instead acquire that credential at runtime and it only lives for the duration of the process.
Especially in serverless, these applications generally aren’t very long-lived. They spin up a microservice, do some processing, listen to a pub/sub event, and then turn down. Instead of having a shared credential, we instead at boot authenticate to some service like Vault. We get a single-use or maybe time-based credential that expires at the end of that function. We’re limiting the scope that an attacker can have access by limiting the time the credential is valid. When the application dies, or when that serverless application terminates, we revoke that credential automatically.
Guy: This is kind of panacea I guess kind of, I mean, if you have this set of automated key management systems, how often do you see this in the wild? I mean, when you look at – You see from the Google side, people using a Google Cloud or from the experience with Vault, which I guess might have been a bit of selection bias as is that user base. I find when I talk to people, there is a general appreciation to the KMS side but key rotation and this notion of like actually expiring that key, may be not as well adopted. Is that a fair statement?
Seth: I think it depends on the organization. I think this is a really good question. It depends on who is responsible for managing the key management service. If it’s a security team and the developers just know that there’s an API that they call to do encryption for them, the security team is likely rotating those keys frequently. They may be rotating them every 30 days or even every two weeks. To developers, every key management API is like encrypt this piece of data, and it automatically uses the most recent key. If the developers are controlling those keys on the other hand, especially if they’re not super security conscious, it might not be as regular. There might be like a geo-ticker that’s in the backlog somewhere that’s like, “Go rotate that key.”
This is why like specifically on Google, we actually have automated rotation for our keys, so you can set up basically – I like to consider it like a crown job that sits within KMS that’s like, “Rotate this key every 90 days,” and it will just automatically rotate it. Then that brings up the other side of it, which is you have all this old data that’s encrypted with kind of an older version of the key. How do you upgrade it? This is where the other side of security that we don’t do a lot of talking about comes in, which is auditing and logging and reporting on that data.
Even with all of the protections we put in place in the talk that I’m giving, there’s still a security vector, right? There’s still a threat model, or an attacker could gain access for a very short period of time, and the only way that we can really deal with that moving forward is to detect it and respond to it appropriately. That’s where auditing and logging and more recently, anomaly detection is coming in, which is, especially if you work in a larger company that has a lot of data, feeding that into a system that analyzes when the secrets are being accessed, from which applications and services they’re being accessed.
Whenever something doesn’t look right, firing an alert and saying, “Hey! This doesn’t feel appropriate. Someone should take a look at this.” The same way that a big spike in memory usage might flag someone on the site reliability engineering team, we do the same for the security team.
Guy: I think, definitely, we need to kind of track them, and I guess back to your point about security not being binary, right? You just have to reduce the attack possibilities, so the attack surface or the likelihood of a compromise. How much is this is serverless? You use serverless as your example, right? You use KMS in this storyline. You should write a book about it. How much of this is different to serverless specific, or is different in serverless versus a containerized environment?
Seth: I think in a containerized environment, it’s pretty similar. I’ll rephrase your question, so I can give you a different answer, which is what about like traditional VMs and like on-premise infrastructure? When we think of like traditional applications, especially monoliths, they tend to be like spun up and then they live forever. The key difference is that a serverless world and even a containerized world, applications tend to be a lot more ephemeral, right? They come and go. They get moved around. They get rescheduled, especially if we’re using freezing Kubernetes or Nomad or Mesos or whatever it might be.
In an on-premise world, most of those monolithic applications do not handle change well, whether that’s a configuration change, or whether that’s a secret change. Because of that, we tend to see less rotation and less of a focus on security at the application level. Instead, organizations put that thing in a box, put a firewall around it, use networking policy to try to restrict access as much as possible. Sometimes, there’s a cultural aspect here as well, which is that application was written by a developer who left the company 20 years ago. It’s running on Java 3.5, and like no one is going to touch that thing. We need it to keep running for now, right?
That’s really the key difference is like a lot of the paradigms that I’m pushing for don’t work well with those very legacy, very noncognitive applications. It rotates secrets frequently. That means your application needs to handle graceful restarts. That’s a cognitive property to factor out property. That’s not a property that legacy applications may exhibit and that makes it really difficult to follow some of these best practices.
Guy: Got it, yeah. I guess, in concept, you could do it in a VM, but it just means it wouldn’t be as natural in serverless or in short-lived containerized environment. It’s just elegant in the sense that you get the key, and throughout the life of the invocation the key is just that key.
Guy: How does this relate? In serverless, there is one of the – Oftentimes, the myth is this thought that you get a system spun up for every serverless goal. When you use a KMS with those environments, in the serverless environment, does the key stay in memory for the duration of the instance’s existence or the function’s invocation, if that makes sense?
Seth: It stays.
Guy: If an instance stays warm.
Seth: Yeah. It stays in memory for the lifetime of what we call the “cold boot.” The first time you hit a serverless function, there may be no running instance, so you eat what’s called cold boot time, which is spinning up, so the container or the VM or whatever, depending on the cloud provider and then the initial init functions that that code runs. In goal, there’s like an init function, right?
Node.js, anything that’s outside of a main function will execute before the function can respond. That’s cold boot and then from that point, the instances available. As long as it’s continuing to get requests, it will stay hot as we call it, or opposite of cold boot. During that time, your secrets, your keys, whatever will live in memory of the instance. But then when the instance terminates, it doesn’t get a request for a certain period of time, or if you set some time out for the maximum lifetime. At that point, the instance is killed.
On Google specifically and Cloud Run, which is our serverless offering, it’s actually built on top of the K-Native open source tool, so it is just a container. The same kind of expectations you would have around starting a container you have with Cloud Run as well. You can set things like CPU and memory availability, maximum timeouts, concurrency, that type of stuff.
Guy: Got it, yeah. I guess my next question would have been, but you actually kind of answered it, about the performance impact of using a KMS. But at the end of the day, it probably just gets absorbed into that boot time, cold boot startup time.
Seth: Yeah. This gets into like a little bit of an architecture discussion, but you could architect your application so that every HTTP call invokes KMS to decrypt the secret to fulfill the user’s request. That’s going to incur a lot of extra latency in the request cycle and also potentially extra costs, as you’re going to hit the KMS with every user request that you get.
A more scalable approach is at cold boot once, you’re going to call the KMS service, decrypt the keys or decrypt the secrets, and store them in memory. Then moving forward, every request goes a little bit faster. Then each of those have tradeoffs, right? There is no right or wrong answer. It really depends on your threat model and this goes back to like threat modeling as super important when building these applications.
Guy: Got it. Maybe just one last question. We’re sort of going a little long here. Is there another sort of step in sort of this secrets’ management horizon? Is there something past key rotations and KMS’s that is even better that is sort of around the corner?
Seth: Yeah. I mean, I hinted at it earlier. I think ephemeral, short-lived secrets or just identity and access management in general, is like the right direction. Having a credential that is only valid for the lifetime of my function and can only access these one or two things, like principle of least privilege.
The second thing is like do you even need a secret in the first place? If you’re using a cloud provider or even multiple cloud providers, can you leverage OIDC and cloud providers identity and access management to do pam? There is no exchange of a credential. This function, or this serverless application, is able to talk to this SQL database, because I set up a permission at the cloud provider layer that enabled that to happen.
Then there is no exchange of credential. There actually isn’t a vector for an attacker to try to escalate some privilege, because that lives at a higher level than the application or service that you’re talking to.
Guy: Yeah, got it. It makes sense, and I guess you can open up a different conversation about all your eggs in one basket around that, these access management portals. But I think that’s a topic for its own podcast or 10 of them on those.
Seth, this has been fascinating. I definitely encourage people to go watch the YouTube video and sort of I guess see the whole story unfold. We got the good kind of cliff notes over here. Thanks for coming on the show.
Seth: Yeah. Thank you for having me.