Tech Captains

EP16: Unlocking Incident Management with Chris Evans of incident.io

Ron Danenberg Season 1 Episode 16

We're excited to host Chris Evans, the co-founder and chief product officer of incident.io. This groundbreaking company has raised $30 million to enhance incident response in tech teams.

In this insightful conversation, Chris discusses his journey from Monzo Bank to incident.io's creation, where he leveraged firsthand experience to develop a practical tool. 

His book recommendation offers valuable insights into system complexity and safety, which align with his company's philosophy.

Additionally, he talks about incident.io's office-based policy, explaining how they prioritize office presence for certain roles while accommodating remote work.

Tune in for more on their integrations, culture, and mission to enhance incident management.

➜ Hosted by Ron Danenberg and Gareth Thomas.




Welcome to this new TechCaptains episode. I'm Ron and with Gareth's today, we are lucky to interview Chris Evans, co-founder and chief product officer at incident.io. Incident.io is a company that raised $30 million to improve incident response in technology teams and companies. Chris, how are you today? Very well, thank you. Diving in from a slightly gray and rainy London, but happy to be here. Great, great. So Chris, To start, I've seen that you spent several years at Monzo Bank. I think that's also how I've been following incident.io since the beginning as I was connecting with a number of people at Monzo. And for those who don't know, Monzo is one of the most popular neo-bank in the UK. And so my question is, is that where you got the idea for incident.io? Yes, sort of. So I joined Monzo back in 2018, I think it was 2017 or 2018, back when it was relatively early. So it had a few hundred thousand, a few hundred thousand customers. And so I was responsible for building out the platform engineering part of the organization, which at the time was me plus I think three engineers. So relatively small, we had a few hundred thousand customers. And yeah, over my time there, it grew into like a pretty big, significant, kind of part of the organization with like 7 million customers. And sort of along that journey, I sort of picked up responsibility for running all of OnCall for the bank as well. So, you know, OnCall being those engineers that are responding to things going wrong, that had sort of a natural segue into incident management, which for banks is actually really quite difficult because there's not just the kind of like, how do we How do we get the technology back and fixed and running? But there's all the sort of regulatory kind of implications of, you know, well, you know, if customers can't access their money, that kind of thing, that could be pretty painful. So, yeah, the idea for this originally was sort of came from the pain of having to sort of navigate as incidents with a bunch of engineers. And so what I did at the time was write something that was just, I mean, like, laughably simple, like the very first version of this was like, a lambda function in AWS, the when someone did a slash command, it created a channel and everyone was like, oh my God, this is like 10X better than what we had before. And then like from that point on, it got more and more sort of like, you know, capable. And then I sort of was like, cool, well, it'd be cool if I could store some state with incidents. So introduced like a Django app with a database attachment to it. So we could start to be like, well, we now know things and we can store things about incidents. But yeah, that was the sort of like the real where it all came from. The reason I say sort of is because when I was at the company before Monzo, a company called Moo, someone had written a similar thing, which just created a channel and a Google Doc whenever there was an incident. And it worked so well there that I guess I could credit it back to there was the real source. And is Monzo one of your customers yet? Well, they're not, actually. And I sort of put that down to, I credit that to the good work that we did to sort of build out that tooling. And I think if you... If you look at organizations, there is always an inertia to jump from tool to tool. And when you've got something that was working and essentially they built their own custom thing in house, and it was deeply integrated with old data stack and various other things going on there. Yeah, they, they sort of the, they haven't quite got the volition yet to move. Sorry, you mentioned, you mentioned lambdas. And one of the things I've noted, certainly the last two companies was it seems to be this really interesting tool that drops into DevOps teams that they suddenly go, I can... I can solve some problem by just writing a quick Lambda. And I mean, I know there's people building applications in serverless, but I find it interesting that once again, it's somebody saying, yeah, we just wrote a quick Lambda, dropped it in, and it's so useful for things like that, isn't it, in sort of DevOps environments? Yeah, I think like I'm sort of yet to use Lambdas or any kind of like functional type infrastructure like that. in anger, I think, in any kind of situation. So Monzo was like a big microservice shop and they had microservices that acted a little bit like lambdas. But yeah, as a sort of like, I need to get something running in production and able to respond to requests. It's pretty powerful as a way to a paradigm to sort of like deliver things to production. And I remember Chris that training at the time that Monzo had hundreds and hundreds of microservices. And that was an interesting architecture, especially for a new company. And I was wondering if that's something you've replicated at incident.io. You've done it a bit more hybrid. Uh, yeah. So Monza when I, when I joined Monza again, had a few hundred microservices. And I remember joining and being like, oh my gosh, this is, this is wild. Like they are really doing, doing microservices. And I think. There was like an ever updated graph of the number of microservices. Everything was in a big monorepo. You know, microservice was a folder within that independent deployable unit. And by the time I think, by the time I left, there was something like 1800 microservices that's independent microservices, not sort of replicas. That'd be like 10,000 or something replicas. So they, they really went in on it and. It worked phenomenally well because they invested in. absolute ton of time and effort in the whole developer experience there. So there were tools to be able to create a new service. Um, and so everything would derive from the same microservice chassis and shipping to production could be done. You know, I could go from like, Oh, I have an idea for a service to it being running in our production Kubernetes cluster in 10 minutes, something like that. And the focus was a hundred percent on like, where, where do I want to inject my business logic for what this service does? And so it was just phenomenally quick. Everyone had. default monitoring, they had default observability, all of these kinds of good things. So it worked well, but a lot of effort. At Incident.io, we started with the three of us working evenings and weekends and really early mornings to get this company off the ground and just didn't have the time to be able to go and do that level of work. And so for us, it was like, what is the simplest way for us to get this app? into production so that we can get shipping or business logic. Like this company doesn't sort of really doesn't have customers. It doesn't have anything. And so in the early days we essentially were like, Heroku, we all know it. We've all used it. We know it won't last forever or there'll be a point at which we want to tap out. But for now, Heroku and a Postgres database work just fine. And so the things that we did bring from Monzo's everything's written in Go. But aside from, aside from the fact that it, yeah, that it's essentially Heroku Postgres and now we are. using increasingly aspects of Google Clouds. We use BigQuery for analytics. We use the cloud storage. We use PubSub. And the direction travel is that in the next few months, we will wholesale be on Google Cloud. But yeah, for us, it was like, keep things simple. And the app today is still a sort of, I guess, a modular monolith is, I think, how it's been described. But yeah, it works really well. And so you're on Google Cloud. Like why that choice? Because I think today AWS is more popular than you have Microsoft and Google, if I'm not wrong. So why do you choose Google? Yeah, it comes down to like, sort of proficiency of the team and the team that we knew we would want to hire. So, like Pete came from Go Cardless, which is a company that is out before Monzo and was familiar with Google Cloud. I think some of the Some of the tools that they have there are just a little bit more developer-friendly. So PubSub, I think, was one of the first... In fact, BigQuery was the very first thing we had, and we had that because we used that on Monzo. And then PubSub is a really easy thing to integrate with. So when we started needing to do sort of async event processing, it was just phenomenally easy to get things set up and running. And then just honestly unifying everything around Google. So we used G Suite for email and calendar and things. It just felt like a really sensible choice for us. I also wanted to, I was also wondering, so the company is not even two and a half years or it's going to turn two and a half years this summer. So you raised 30 millions, you are 60 something people. I mean, it's a very, very fast growth. So what were the biggest challenges for you so far? Like as a founder, as a tech guy, as the product officer, like there's many different hats here. Yeah, I mean, putting a company from scratch is just nothing but challenges, I think. So there have been, you know, trying to hire a team whilst building a company, whilst selling to customers, all of that is, is just very, very difficult. And so, yeah, I mean, like, honestly, it has been a scrap to get here, but things are sort of going very well. We're very fortunate, I think, in that we... We launched out of the gate and we had people sort of queuing up wanting to use our software and so had customers incredibly early on, which means then like that's one of the biggest hurdles, I think, because as soon as you have customers, you have feedback, which is just the most valuable way to strengthen things. And the more feedback you get, the stronger the product gets and the more you can ship it and make it valuable for other customers. But yeah, there's been a ton of things. So like product has always felt like the easiest part, honestly, because we're building a project, a product for engineers. that we have sort of lived and breathed. And so the problems we're solving, it's not like we're building, you know, accountancy software and have to kind of get our feedback through someone through sort of secondhand information. It's like, would this have been valuable to me in previous roles? And it's like, absolutely, yes, this was a huge pain point. So there's that side of it. And then I think that also segues a little bit into the sales side. So I do a lot of the sort of sales calls, like still to this day, I will jump on with our account executives. And like, it just helps to be able to authentically, you know, sell from a position of like, listen, I've been there here with it, here are the ways that I sort of tackled that. And it's not sort of like reading from a sheet of, you know, this is what I should say here, it's like, there is depth behind that. And so that makes things, makes things easy, easier. But yeah, like I, in terms of challenges, hard to pinpoint a very specific thing, but like anyone who's built, built companies from early stage will know that it's, it's a, an absolute grind. And you kind of just got to push through a lot of stuff. So I think it is easier when you can eat your own dog food though, isn't it? I do find it interesting you saying you understand the product because you're writing it for engineers. Do you think though that there's a danger that you could be a little complacent sometimes in kind of assuming that you know the answers in that situation? Oh, 100%. And I would really hope that complacency doesn't sort of leak into how we do things here. And there's, there's actually some things that are a little bit more challenging when it comes to eating your own dog food, or, you know, drinking your own champagne, which is the other more popular phrase. But like, yeah, it's a slightly nicer way to think about it. But like, I think honestly, there are parts of our product that so we actively use our product to respond to our own incidents. because it's very rare that incidents are so severe that we can't use the whole, the product in some fashion. And there are parts of that we can use day in, day out. And we have many times a day where we interact with it, but we're also a really small organization and a lot of the problems that we solve really, really help larger organizations that, so it's harder to sort of get the benefit from, from that. So a good example would be like. doing incredibly thorough post-incident analysis and debrief, that was the kind of things that we used to do at Monzo a lot, right? So we'd have a big incident, we would spend weeks of engineering time on like the post-incident, like learning, understanding that kind of thing, documenting so that we can socialize that across the organization. And I think in a very small organization, that kind of thing just isn't quite as useful or it's easier to socialize learnings because frankly, everyone is sat around for the same table in the same office. And so like there are parts where we can't sort of, you know, get, use our own product to the same level. And then around like complacency of like, do we, do we know everything? Like definitely not, but the sort of beauty of building this company is like the initial customers that we were selling to were customers that looked a lot like the companies that we have worked in the past. So we do know pretty well how those things work and increasingly selling to much, much bigger organizations. And like the biggest organization I worked in was, um, Marks and Spencer. So like an e-commerce company here. And that was a huge IT org. And so I have some reference points there, but not sort of deep reference points. But as we continue to sell to bigger and bigger customers, we get to sort of like learn a bit more and like the default is not sort of drop into any kind of engagement with the prospect and be like, listen, we've got this, we're going to tell you how to do it. It's like, how do things work? Where's the pain? How might we sort of sit within that? And like, there's a huge like incentive alignment there, right? Which is. By doing that, we get to learn more about different sectors, different sizes and types of organization. But then we also get to build a stronger product out of it because it means when we go to the next version of a company that looks like that, we are infinitely better placed to be able to go and sell to them something that feels like a good fit. So yeah, it's a good sort of call out and a sort of risk I think we're kind of aware of and track sort of implicitly. But yeah, yeah. I think we're all on top of it for this time being. And Chris, I know that you, and you've actually mentioned it in the past few minutes, that you are highly integrated with Slack, and that's how the idea started at Monzo. So are you today very dependent on Slack? Like, can you work without Slack? Do you work with other messaging platform, or are you fully like a Slack app, basically? It's a- Good question. So right now we, so I guess if you maybe feel like zoom out, if you look at how incidents normally work, right? There is like some form of like monitoring and alerting system somewhere that will be, you know, monitoring infrastructure applications and looking for things going wrong. After that fires, it will go to some form of paging solution. So page duty is like a good example of this. You know, it takes events in and its job is to essentially like find a human to go and get into the loop to sort of respond to a thing. At that point. historically there hasn't really been a very good solution for coordinating what happens next. And so that is where Incident IO sort of naturally fits in. That's where we started this company was like making it easier to respond to incidents. Like responding to incidents is founded on good communication. And so we essentially picked the communication platform that we were familiar with, which was Slack. And it also allows us to build very rich kind of app experiences. And so the product is deeply integrated into Slack in the response phase. And that's very, very deliberate because like, kind of philosophy and a principle we have is that when you respond into incidents, like, we don't want to pull you away from where you're communicating to come and use our platform, we should kind of meet you where you are, sort of sit within that and make it very effortless. And so we let you integrate from within Slack into page duty, into other tools, into your status page. All of those things are right there. But and then like, go on. No, but like, sorry to interrupt, but like you mentioned earlier that you want to target bigger companies, bigger organizations than yours. but Slack is usually used by smaller organizations. The bigger ones would use Microsoft Teams. So how do you deal with that? Yeah, it's a good question as well. So I think first of all, there are many big organizations who do use Microsoft Teams. There are also many big organizations who use Slack. So they have, I think, a reasonable coverage of like the FTSE 500 Slack. So at the moment, not a major concern. And the sort of, the key thing here is that our... Whilst our app is built on Slack right now, it's done so deliberately so that we can focus on one platform and do a phenomenal experience on top of that. There is nothing secret really about Slack and that we couldn't hot swap it out for Teams or whatever else. So 100% like we will build on other communication platforms. If we're sort of grounded on the philosophy that during the response phase, we should be where you're communicating. Well, if you're communicating on Microsoft Teams, we will be there. It's more a matter of when not. not if, and I think there's other platforms that will be interesting in future too. Like Zoom, for example, are increasingly allowing developers to build apps into that experience. And so if your company is default responding by opening a Zoom bridge and jumping on there, well, we don't want to be sort of like, oh, you've got to be in Zoom and somewhere else dealing with us. We'll join you where you are kind of thing. So yeah, it's a good question, but essentially this is just one of like, try not to spread ourselves too thin too soon. I think that's a really good philosophy to this idea of joining them where they are. Because like you said, you're potentially bringing in lots of people into a discussion about how to respond to an incident, potentially people in different geographic locations on different time zones. So yeah, I think the idea of coming to where they are is a really good idea. So now I'm interested as well. I think you've had CTO roles, right? You've held quite technical roles. And I'm just wondering, is there anything that you've learned now? in a product role that you would share with someone like yourself who's in a technical role? Oh, good question. So I've never been a CTO before. My whole career has been very technical, though. So originally like soft... Sorry, my mistake. I thought you were. No, no, no. It's all good. So I've been a software engineer. I did maths at uni and then did a computer science thing. I've always been messing around with computers, then got into software reasons are a bit convoluted into platform engineering and then have sort of like run reasonably large platform engineering type teams. And then when it comes to incident IO, so Pete Wanoway, the co-founder, he's the CTO here. I am sort of by title the CPO and that is a little bit of a kind of like Pete and I work very closely together on that side of things, but it's also like, let's pick some titles and have them scan externally to be very transparent. But sort of to the point of the question, which is sort of maybe, you know, how does product and technology and that sort of thing maybe integrate and have I learned things either side that help there? I think, I think absolutely yes. Um, so a good example of this is that we don't have at the moment at this company, any, any sort of backend engineers, front-end engineers or infrastructure engineers, we have product engineers. And I think this is a bit of a wave that we've seen, um, in the industry as well, which is essentially like product engineering. being those engineers who are incredibly, I guess, business and customer oriented, right? And sort of wear a little bit of a product manager hat as well as an engineering hat, comfortable being a little bit scrappy to get something out the door, knowing that value in the hands of customers trumps technical excellence kind of thing. And so that is something that we're leaning into a lot here, which is we move very, very quickly. That is one of the sort of the few benefits we have over the large incumbents huge engineering organizations, we can move very, very quickly and we are comfortable trading off excellence. As is like a good example, like we chose Heroku in the first case, right? We knew that wouldn't be a long-term thing for us. It's working fine still. And it meant that we could get value into the hands of customers and have customers on board much, much quicker. So I think, I think for me, like I, I kind of, I kind of look at like product and then engineering. being separated a little bit, like, you know, if you reround the clock and people were like, well, we have developers and we have operations folks and everyone's like, well, no, let's, let's not do that because it makes sense to have, you know, this sort of wave of DevOps or whatever you want to call it, or just basically like philosophically, like engineers should have like operational responsibilities is a good thing. And I think the same is true of like engineers having product responsibilities. And so like our engineers will talk to customers frequently. They will think about writing proposals and specs. Like. frequently and they're good at communicating those things as well. We sort of optimize for hiring around that as well. If I'm a product engineer working for your company then I can design, build, launch a feature by myself potentially end to end on your platform. Yes, yes. So within sensible guardrails, right. So there's like a good way that works, which is that like essentially engineers have the responsibility and trust to go and do that. and they work with a team to make sure that it's delivered, but they have a degree of autonomy, there is a bad version of that, which is like every engineer is free to go and like do their own thing. And so it's not the latter. It is essentially, you know, we have a team who are like, you know, your mission is to like build a phenomenal post-incident process, right? So from the moment the incident's resolved to the point where the last action item is completed, like that is your remit. Go away and think, you know, talk to customers, like gather all the feedback that we've got. go and look at that, shape up a roadmap. And then we will then talk about that. And we will sort of like between the teams sort of from a bottom up point of view and from Pete and I sort of top down, like we will shape that into a roadmap that we think is sort of like well aligned with strategic goals of the company, but also tackling the kind of on the ground stuff that people are hearing. So all of these things are a negotiation. I think they work well by getting different perspectives in the room, but yes, fundamentally to your question, like engineers are responsible for building, designing shipping, getting customer feedback, and iterating on those products that they're responsible for. And Chris, I know you have an office in Shoreditch that you guys like to share your logo that you can see through the window. From Shoreditch High Street or from all streets around the world, I'm not sure. But like, so what is your office-based policy? Are people working remote, hybrid, they do how they want? Like, what are you doing here? Yeah, so we are, we are kind of, I guess, like office, office first is maybe the sort of the labelled way to sort of describe this, which is we, we have deliberately hired people who want to work in an office, certainly for products and engineering, and for our London office. So that typically looks like finding those people who are like, well, I want the balance of being able to do some work from home, but I really draw a lot of my energy from being in the same place as other people that I work with and being able to stand up and whiteboard. like that. So it is not one of those cultures where it's like, you must come into the office. It's like most people come in more than three days a week because they want to be here. But yeah, I think basically at early stage companies, it's just like phenomenally hard when there is so much ambiguity in everything you do and anything you can do to kind of make the communications higher bandwidth just is a huge advantage. So I think it's working really well here. Like people... People enjoy being in, we have invested in a nice office, as you say, as a big like flame logo that you can see from sort of old street roundabout all the way down to Finchby square. Um, it looks really cool. It's one of my, my greatest contributions to this company, I think. Um, and the office feels great. And I think like the only caveat to that is like we found, so we've also launched in the U S and we've got an office in, in New York, in Manhattan. Congrats. Um, that's thank you. Uh, That is sort of where we are putting a lot more commercial people. And it turns out that kind of like being in the office when your job is broadly outward facing. So you, you are mostly talking to customers. So a salesperson, for example, is going to be mostly on the phone talking to customers, sort of that, that philosophic philosophy rather of like trying to be in the office breaks down a little bit. And it actually makes more sense to put people close to the people that they need to talk to. So we have salespeople on the West coast of America, middle America, and on the East coast too. So that's sort of, yeah, company is a little bit different depending on roles. That makes sense. And Chris, we have a tradition in the podcast that every guest presents a book they really liked. So the one you sent us before is called The Field Guide to Understanding Human Error by Sini Dekker. So could you tell us more about what this book is about and why did you choose it? Yeah, it was a hard choice because it's like being asked what your favorite meal is or your favorite song is. But this is a book I've read several times. I can't actually remember who recommended it to me, but it's like a whole manual that frames how things going wrong can be interpreted in different ways. So like, there is the, you know, a good example of this is like human error, for example. There is an old sort of way of looking at that. saying, well, human error is like humans are these fallible things in these otherwise very stable systems that we try and operate. And when things go wrong, it's like, how do we, how do we crush the sort of human error out of the situation and avoid that being the case? And it presents like a new view of that, which is like, you know, individuals are sort of as much part of the system as the tech and technological parts of this. And they should be seen as like things that are sort of creating resilience as much as sort of causing issues kind of thing. And so this book basically has just this number of different angles of looking at like system complexity and things like socio-technical systems. So that's that kind of broader view of systems in organizations, which are like, it's not just like technology and people. It's like, there is this fusion of those things. It covers like blame and learning. And basically like I read it and I just found myself like nodding through the whole thing being like, yes, this is, this is a much smarter way to look at things and to think of things. And I think when you sort of look at that book and then draw. draw the dotted lines into what we're doing with this company incident IO. It's kind of clear that like we're trying to sort of embed some of these like philosophies and ways of thinking into like the software that we're producing. So yeah, I think it is a, it's a really, really good book, but it's also practical. That's the beauty of it as a lot of, as a lot of literature that goes around, which is incredibly academic. And I'm like, yeah, I get it, but I don't know how to apply it. Whereas this is just like, so Sydney Decker, I think he was formally an airline pilot. and then went into studying a bunch of human factors and things like that. And so has imprinted a bunch of the learning of how safety within the airline industry has increased over time. And so very, very practical book, would strongly recommend to anyone. I've always been really fascinated by complex system failure. It's really bizarre, but I read books about aircraft accidents and stuff like that. Because it's interesting that you had a pilot that wrote it, because I think they... introduces it called crew management in the airline industry, which was around, because there was this tendency for people to defer to the most senior, you know, the captain on the plane in an instant and very poor decisions being made. And then I think now they're bringing that into the surgical theater as well, where they have a similar issue with really experienced doctors who people to have tended to defer to, who've made, you know, major mistakes because no one wanted to say anything to them. Yes, 100%. I think, yeah. Incidentally, if you're after another recommendation, this is not a book, but a talk. There's a phenomenal speaker called Nick Means who gave a talk called How to Crash a Plane. And it covers, it's a sort of a catastrophic story, unfortunately, but it is a deep, insightful look at this and sort of like positioned against how engineering teams and things work. So would strongly recommend, I can send a link to after if it's interesting. Definitely, if you send the link, we can add it to the... Yeah, we can add it to the description of the podcast. Nice. Yeah, I was looking, so I noticed, I mean, you've got quite a lot of integrations. I think you've got about 24 integrations. And you've got things like Zapier, which I'm a big fan of. And I could see that's kind of fantastic for tying in all kinds of systems into this response process. Vantor, I thought was really cool because you can do the SOC 2 compliance piece, right, the follow up there. So I was wondering, what is your What is your favorite integration you've done and why? Ooh, that is a good question. Um, what is my favorite integration? I think, I think purely from like the impact it can have on like organizations as a whole, I think the integration we have with status page is probably one of my favorites. So, um, sort of like explain what this does. Most companies will have a status page. They put up some, which is like, you know, status.monzo. for example, and it's the thing that says whether you're up, whether you're down, how your historical performance has been. And for the vast, vast majority of companies, it's like always green, they never update it. You know, something breaks and you go and look at it and they're like, everything's good. And you're like, it's clearly not. And so it like is one of those things that can lose a bunch of trust. Conversely, I think if you use them really well, and they are sort of actively embedded as part of your response process, is a really strong way to like gain trust with customers and, you know, basically come out better as a company. And so I think our integration there allows you to essentially in no effort at all, almost in literally, you know, 30 seconds put up a status page, be able to keep your customers in the loop and keep them front and center when you're dealing with incidents. And I think the sort of effects of that justify it being, you know, a really, really solid integration for our product. And I mean, Chris, I think this, this concludes the podcast today. So thank you very much for taking the time for speaking with us today. Thanks so much. It's been a pleasure. Thanks, Chris. Thank you. And for everyone listening, we will be back in two weeks. Bye bye.