Podcast

Moving Slack's development experience to remote environments

Sylvestor George led a project to move Slack's entire development experience to remote environments, which was widely regarded as a “dramatically better experience”. Here he shares the full story of that project, including how they identified the problem, the solution they created, and how they convinced engineers to adopt the new workflow.

Transcript

Abi: You recently published a post on the Slack engineering blog about how Slack has shifted to remote development environments and that’s really what I want to focus on today. I want to just kind of dive in. In that post, you mentioned that for years, engineers at Slack isolated and tested their changes by running aspects of the Slack application on their local computers. So I’m curious, what did this actually look like for a developer?

Sylvestor: For most of my time at Slack I’ve been working with internal tools. So I didn’t share the experience that the actual product engineers who work on web app, which accounts for 60% of changes, or engineers work within Slack, what they go through on day to day basis. So when we found this team, the first thing we did was we did a lot of surveys and we realized so much friction existed that did not let engineers do their work smoothly.

So from what I understood, the current scenario was people would have the dependencies running on the Docker containers within their laptop, which includes HHVM, which is a server used for running HACKLAG developed by Facebook and it would be very resource intensive, and it would literally make your laptop sound like a power generator. So imagine if you’re trying to develop, it would be really impossible to be on a video call in Zoom, or let alone Slack application work on your laptop. So it was really getting frustrating, and for them when they’re doing the development, they still would want to share their work with their peers for reviews, for feedback and stuff like that. So what would happen is they have the local setup, they would do their development there, running all these dependencies and then they would sync their work continuously to a development environment, which they could share across Slack with the peers.

So this also limited them to only being able to work on one change at a time because as you’re thinking, you need to be linked to this one environment and stuff like that. So it was really restrictive, and if something internally would break, like let’s say a Docker version would be inconsistent or anything as such something they MIS some mistake they made in one of the scripts or dependency scripts. It would be very difficult to recover. So these were the big pinpoints that we learn from many such engineers.

That’s fascinating. I mean Slack is an application that I’ve used every day for the past I don’t know, five, six years. However long Slack’s been along, but to think about the process developers have or had to go through. So it sounds like you had to install all these heavy dependencies on your computer to just run. Were developers running basically Slack itself? So they would make a change and then rebuild Slack on their machine essentially?

Yes, it does. So there are different phases to it. So like I said, Slack is a monolith, right? And it’s a pretty complex application, especially the web app version of it. So downloading that repo itself takes about 30 minutes. So what people would do is they would clone their application, they would then pull, to kind of make it a little convenient, what a lot of engineers did in the past was put some scripts together, which would kind of orchestrate this Docker containerization aspect for them. So everybody would run this script. It will bring up these Docker containers and this HHVM server running, and then they would have the IDE, which would use all this features for them to develop.

However, luckily they didn’t have to run the server itself on the local machine. What would happen is they could sync it to an environment and then run the server there, but the most of the heavy lifting of developing would still happen on the local machine, which itself was a big pinpoint. And just to add on, before they moved into the Docker containerization aspect, this used to be done pretty manually, like running bare bone HHVM on the local laptops, which was even worse because there were regular updates of HHVM and then folks would be out of sync and stuff like that. So containerization made it a little better, but the situation was still very brittle when it came to development.

Yeah, it sounds challenging, and I mean how many engineers were doing this on a daily basis? And you mentioned 60% of the changes at Slack, but how many engineers was this affecting?

Oh, this was easily over 400 engineers doing continuously developing in web app. So mostly no matter which kind of domain or vertical you work on, you end up touching web app one time or the other. So web app is the source kind of the primary depository within Slack and also for example, every onboarding exercise with a new engineered joins, they have to do this onboarding exercise, which involves putting a live PR or making a live change on web app. So the remote development environments made a significant change there where the setup period used to be 1.5 hours and remote development brought it down to five minutes. So it was a great productivity boost for onboarding engineers to get their stuff in as the first change in web app.

Well, that’s incredible and we’ll dive in more into kind of the before and after and different measurements around that, but another question I have and you kind of touched on it already, you mentioned kind of containerizing certain things, but as I understand it through your post before making the real push to switch to remote development environments, there were a number of these incremental improvements. You tried to make two local development environments to make things easier. I’m curious, what was sort of the highlights or the progression of things you tried and what sort of impact did they have?

Oh, great. That’s a good question. So like I said, Slack has grown significantly in last five to six years. So has the repository itself with the new features coming in and stuff like that. So when they were pretty small, let’s say four years ago, I would say the workflow was you have these dependencies like HHVM that you need to run your code or do the development. So what folks would do is just run HHVM on the local machine and have the IDE configured, which is primarily BBS code and make that development. Now, the biggest problem was, let’s say Homebrew installs breaking HHVM or the inconsistencies where developers would be working on different version of HHVM. So it kind of became a catch up game every time an HHVM update happens and folks wouldn’t would see failures because they’re on different versions and stuff like that.

So there were a lot of complexities involved, but that was the first phase. As things evolved, the next thing a lot of engineers who worked on web app put together the scripts, which would containerize all these aspects together so that folks would be on the same version of HHVM. So they would have this regular build images uploaded with the new versions of HHVM that they are on, and the setup scripts would just pull that image and everybody would be on the same version. So it removed a lot of inconsistencies, but again, running stuff on local, it requires a lot of scripting, building images, putting a lot of stuff together to differentiate between running on macOS versus Linux. So those inconsistencies still existed. So folks worked really hard to make things better, but things were still brittle and they would break and debugging them, especially on local would be really hard.

Right and of course, because it’s local, every individual developer could be facing all kinds of individual local issues and I’m sure that was difficult. So I’m curious then in your article, you say based on the information gathered from user interviews, engineer feedback, you said there’s a clear need to evolve how we write code and that led to the inception of the remote development project. You kind of touched on this at the beginning, but what were you finding in the interviews, or surveys, or what metrics? What were kind of your north star metrics that told you we need to invest in remote development environments?

All right. So Slack does this, I’d say it’s yearly or every six months engineering survey, or maybe I think it’s every quarter. So they do engineering survey and for most of those web app development has had always been reported as the primary pain point for the engineers, and that’s had always been the case for the longest time and that’s one of the things that led us to the inception of infrastructure known as Developer Workspaces team, where we wanted to focus on those points and try to make it better. So when we placed, where we had dedicated folks to kind of look this problem, the first thing we initiated was doing user interviews ranging from senior engineers to principal engineers, front end engineers, backend engineers, everyone to get their aspects and how we could resolve those pain points or make it better.

And there was always a thing called remote development as a concept out there, but we never tried it or nobody tried it, or nobody thought it would work for web app, or we could do something like that for web app. So we started kind of putting stuff together and figured out if this would actually solve most of the pain points that are listed by the engineers and that’s where we made the comparison, and we came to a conclusion of this might actually be a good productivity boost for our engineers. So we shared the basic proposal with a lot of engineers who actually work on web app and we got an overwhelming response. People came out to volunteer, to try the prototype and help us with making it better and stuff like that. So that gave us a boost.

Also, there’s always been an internal channel where people with working on local development would report their issues. So looking at the traffic there where people would spend so much of time debugging their issues, and then other engineers would be helping them out. We saw there’s so much happening where folks are wasting the time on debugging issues, which kind of made it clear that we need to make this better. We need to smoothen this experience so that they can focus on working on features, working on product, working on bugs and stuff like that instead of debugging stuff.

So it sounds like it was kind of a set of surveys and kind of follow up interviews on that and from that, you looked at kind of the overall frustration level and developers saying this was a top priority as well as getting an understanding of how much time was being maybe wasted by developers needing to debug things. I mean did you create a slide deck proposal for this that said, “Here are the stats, poor satisfaction.”

Oh yeah.

Yeah, I’m curious what was kind of those headline stats then?

Oh yeah, so proposal, we went through the proposal phase where we listed the aspect, what the data that we got from the interviews and the surveys in that proposal, and we kind of had a solution section where we portrayed how these could be resolved. And we shared it across the initiating teams to get their feedback, what they feel, how we can make it better, and that was a really good thing because it went through multiple creation and we got to know about most of the use cases. So one of the funny things working with internal tools is basically you don’t actually develop so much on the product itself.

It’s hard to cover all the edge cases that folks face. So I think the proposal aspect helped us to kind of get more perspective from the wider audience to consider the use cases that we probably missed. So we were able to get that, went through multiple creations and eventually came with a solution that didn’t all, but at least 50% of the engineer solved their problems, and then we kept the remaining for later as we made progress and iCreate for it there.

That’s awesome. Well, let’s get to the exciting part at least for some listeners, which is like, “Let’s get into the nitty gritty of how you guys created remote development environments.” So as I understand it, the solution that the developer experience of your solution kicks off with a simple CLI command, right? You basically start, create a new remote development. So walk us through how this works.

Great. Cool. So we wanted to make it very simple for our developers to get started, like especially also considering folks who are onboarding and new, who doesn’t know anything about Slack, it could be overwhelming to install, do setup and especially in a limited amount of time where, everybody is doing the same thing in parallel. So we wanted to make it simple and Slack has this bunch of Slack CLI, which been working for the longest time, and it has been very fruitful. Engineers are already used to using those Slack AI. So we wanted to keep it as easy as adding a new utility in the CLI bucket. So now, engineers could just run the Slack, remote dev command and pass on the branch they would want to work on, and this would not even require cloning web app people on your local web or laptop with basic setup, such as all the security setups that you have on Slack laptop, you can go ahead and start working on a remote development.

So you’ll run your command with the branch that you want to create and if the branch does not exist on the remote, it’ll create one for you. It’ll then find an available environment for you and then check out your branch in that environment. Once you have a reserved environment, it runs all the setups that is required for the user to work on environment. The setup includes obvious code server setup, GitHub setup, and any other local configuration that user needs to put on the remote environments, which basically means folks, like batch profiles and stuff like that, which makes the remote environments more familiar to the users. They could literally do that as well.

Once that is done, when the entire setup is done, it just opens up a VS code, IDE instance and connects it connected to the remote environment with all the required VS code configurations that is needed for web development, and that’s it. You can start making your changes. You have a live server, the same environment is used as a live server so you can see your changes right away on the same machine. So yeah, that’s the basic workflow that people can use now for remote development.

Well, that’s incredible. So every new sort of feature branch has its own sort of instance of everything running in the cloud that you’re connected to from your local VS code instance?

Yeah.

Well yeah, that’s sounds simple, but what were the most difficult parts of building this?

Actually, so the infrastructure is the most exciting part of it. How we have this infrastructure ready for the users to be available instantly on demand, right? And there was a great job done in the scalability of these environments and we wanted to keep the time to minimum for them to get these instances available. So for example, behind the scenes, we have this three pool of instances that are always available and up to date with the latest web app repo master so that folks don’t have the friction of wait time of pulling the latest code there. So we use this AWS auto scaling group, and we have this end number of instance always running and being available. So as soon as there is a new request, we reserve the environment for the user. We take that environment out of the ASG pool and that ASG automatically spins up a new environment to maintain the size of the pool again.

So this kind is fascinating to me because this kind of makes sure that the time it takes for a user to get a new environment is less than two minutes. So imagine you come in to work, you get yourself a coffee. While you’re making yourself a coffee, you could get a new remote environment for yourself by the time you are back. So that’s one of the most trickiest part because we didn’t want it to be a long process. We wanted it to be as smooth as switching a branch, pulling the latest master on your local and those kind of things. As long as those tasks take, we wanted to keep this comparable to that and we worked really hard on hitting that time benchmark of keeping it under two minutes for the VS code set up, for GitHub setup. So we kind of did a lot of datalization and stuff in this whole process when we run the CLI so that it’s up faster.

Well, that’s brilliant. I love that strategy of kind of keeping the warm instances ready so that they can be spun up on demand. I’m curious, just the developer in me. So you make a change then in VS code and then to actually see that change, like validate it in the web app, does it run another CLI command to restart the server, like the applications there? Or does it auto kind of rebuild and restart, how does that work?

Very good questions and there are two features of it and they could be split into back end versus front end aspect. So web app itself, they have dedicated teams, web infrastructure versus front end teams who manages all those build aspects. So we already have all those existing scripts to run all processes to run, which takes care of this. For example, in backend, there is something called auto load that has built on every change, and we already had those scripts when folks used to develop on local. So when they used to sync their changes to dev server at the end, it’ll run that auto load map so that the changes are reflected. So we can use the same strategy. We had a watchman trigger on our remote environments where when a change is detected, it builds the auto load again so that those changes are reflected.

And as soon as they make any changes, it will show on the server as well. Same with the front end, they had their own ability watch script, which would continuously run the builds and if any changes are detected, it would iCreate on those changes and rebuild those. So they could run the same thing on the remote environment. So when a front end change is made, the ability watch will continuously irate and do the bills. So workflows also one of the good thing, one thing we benefited from the local development aspect is, and we wanted to do that was make it less and less deviated from the previous workflow. So it’s easier for developers to switch to a remote environment, but still have the existing kind of feel to it so that they don’t have to remember new commands or think of new commands to run. That actually I think helped us to remove the friction from switching because making developers switch the workflow that they have been working for couple of years, it’s a hard thing to do. So yeah, we wanted to make it a less painful for them.

That makes sense and so tell me about adoption. I mean I imagine this was such a drastic improvement. I mean is anyone using the old workflow or did everyone adopt the new remotes pretty quickly?

Oh, this journey has been very exciting for us. So we released a prototype in August 2021 to kind of put it out there, get feedback from the volunteers who agreed to kind of test and stuff like that, and we got a really good response from them, but we also got a lot of edge cases and use cases that we did not consider, but that helped us to smoothen the experience before we went for the GA release. So we took two months to make things better, fix bugs and help cover more use cases, and within two months, we saw 30% of our backend engineers had already switched completely to remote development, which was awesome. We didn’t expect that, but that was pretty awesome and then we decided that we are good to announce its GA release, and we did that in October 2021.

And since then, we did a bunch of surveys and got more feedback to learn how the adoption is going, what are the frictions people are facing? And we got a good feedback from users too on some use cases where people were, like say Emax user were having issues that they’re not used to VS code. They want to support Emax, they want to support them and we got those kind of feedbacks too. So it gave us the opportunity to be able to support those as well on remote develop environments. So we added those capabilities as well and then by January 2021, we saw 90% of developers had been switched to remote environments. So it’s a matter of from general release happening in October within three months, we saw over 90% of backend engineers completely being switched to remote development, which was great for us. It blew us away and we did not anticipate that kind of a response, but yeah, it was overwhelming.

Well, congratulations. That’s amazing and I must say, when you mentioned that you guys have Vim and Emax support, I’m pretty impressed by that. That’s incredible. I’m curious as a whole, was this harder? Did this take longer? Was this kind of over budget against what you estimated as far as getting this to GA or obviously you took an iterative approach, but was it more difficult or less difficult or exactly as anticipated to kind of get to GA with this solution?

Good question. So personally, I think for me, from the perspective, when I got into exploring this project and the prototype version of it, we had good part of the infrastructure already there for us, which we used for just sharing it with the live sync and the previous workflows. So the infrastructure part was already existing and it was working great. So our focus was dedicated to how we bring these two together so that the workflow from moving your code from local to the infrastructure and then interacting, connecting VS code, having all the capabilities in VS code to be able to feel like you’re local. So the tricky part of that was being able to make most of the developers happy. So we tried really hard to cover most of the use cases.

And I think to a great extent we did that, but also we missed many of those use cases as well, which we did as part of the feedback loop and iCreate changes. So we got it close to that 100% eventually. So I think from the prototype perspective, and I really think we did a good job with doing the prototype earlier. So that one time, because what happens is if it breaks a couple of times or people don’t like it a couple of times, they’re hesitant of giving it another shot. So doing the prototype gave us the opportunity to work on or fix most of the issues before it went on GA, and once it went to GA, I think the experience was much smoother.

People were much happier and then the adoptions gradually increased after that because we started getting word of mouth from users. Even today morning, I had this one thread, an engineer responded to one of the features that we’d recently added, which basically persists your command history across remote environments. So that when you switch to a new environment, it still maintains your command history, which basically, you don’t lose anything when you switch across different environments and somebody just appreciated that feature saying, “This is amazing.” So we’ve been hearing that, getting those kind of responses and people been happy about it. So we’ve been getting a lot of word of mouth, which really helped us.

I’m curious, when you were starting out before building this solution, did you look at vendor solutions, like for example, GitHub Codespaces? I’m curious, how did you ultimately decide to build your own?

Very good question. So Codespaces and that’s actually a very interesting job topic because we looked up Codespaces when we were working on the prototype and that’s when they also posted a blog when the entire GitHub team itself moved to Codespaces for their engineers. So one of the limitations and which is the biggest limitation was they did not support self hosted GitHub Enterprise, which is what Slack uses. So they had only support for GitHub Cloud. So anyways, we couldn’t have used Codespaces even if we wanted to, but then also we evaluated the aspect where even if we decided to move to GitHub Cloud, the transition that it would take to being allowed to web app run on Codespaces servers would be huge. For example, you need to kind of the security aspects, the transitioning, it wouldn’t be a project with a few months.

It would take a little over a year to get it there. However, since we already had the infrastructure in place for remote development, we thought we could make it there much sooner and since this was a critical pain point and a struggle that every engineer was facing, we wanted to get a solution out sooner. So now since our remote development environments was only for web app and with the response, we are looking into supporting this for all repos across Slack. We are considering other alternatives like Codespaces and other options out there that could kind of help us do that. Maybe not for web app, but for other small self serve applications within Slack.

That’s awesome. Well, thanks for sharing that. I think that’s really helpful for context for other organizations at the beginning stages of evaluating this. Well, I’d love to know more about the impact this has had. We talked about adoption and some of the pain points at the beginning, but with 60% or more of code changes at Slack affecting, going through your solution now essentially, I imagine the impact has just been huge. So do you have again, some stats or data, just kind of the before versus after? I mean you mentioned one thing, a new engineer, it took him one and a half hours to get kind of started, and now I imagine that’s within 10 minutes. Do you have other kind of before and after metrics like that, that are compelling?

Yes. So one hard thing about this is it’s really hard to measure these aspect on remote environments because it depends on what usage users do. If they did not have any metrics in the local setup, they wouldn’t have it in the remote, but for example, when we initially rolled this out, the volunteers who helped us explore the performance, they kind of ran let’s say, X test on the local versus X test on the remote, and we saw them to be much faster on remote. So that’s a huge boost, and I share that in my blog too, there are these front end builds that we run for front end development and they turned out to be much faster on remote environments compared to on your local environment.

So those are a few metrics, but most of the stats are over feedback based on how it improved the productivity of the developers and that has been overwhelming ranging from an engineer to senior engineer to even a principal engineer, reaching out to us, appreciating the product, and getting the feedback on how it has improved the productivity has been really overwhelming, but with stats wise, the limited measurement that we did with different performance variants, we’ve seen remote development to be faster in those aspects

Were there any sort of secondary benefits of remote? I know you were focused on a lot of these pain points, like time spent debugging, time spent spinning up. Have there been secondary advantages you’ve heard of? Just the ability to maybe test things end to end easier because you have a more full environment? I’m just curious if there are kind of secondary benefits you’ve found from remote development in general.

Yeah, one of the things that I found a lot was it easily made context switching much easier for developers, right? So if you’re working on a big feature or as a product, it goes through its whole life cycle where people are reviewing, giving the feedback and they’re working on it and stuff like that. Initially, you would always have to work on a branch, put it out there, switch to another branch, do the work. Oh, there’s some feedback. Switch back to my branch, do the whole setup thing, run the build again and stuff like that, right? So in this case, engineers wouldn’t have to do that. They would just spin up the branch, do their stuff, keep it as it is and while they’re waiting for reviews or feedback, they switch another branch and work on another feature or bug they’re working on.

And if they want to switch back, they would just open another VS code instance and get to it. So the context switching kind of reduced a lot, and one interesting thing and one thing I learned, which we never considered was, one engineers posted a feedback that they could work on the laptop with the battery from two to three hours instead of an hour before when things were working on the local. That took us by surprise. I’m like, “We didn’t even consider that.” And I was like, “But that makes sense if you’re not running so many intensive resources on your laptop, it’s not consuming that much of power.” So they could freely work more when they are mobile or do not have access to a charging point and stuff like that. So those are the few benefits that we discovered that were pretty nice to hear.

I love that. Well, those make a lot of sense and that last one’s funny. I was thinking developers could spend more time developing on the beach or something like that.

Especially with the pandemic, I think many people did.

Exactly. Well, this has been such an insightful conversation. I want to kind of conclude with a few questions, just sort of advice on companies out there that might be thinking about remote development environments in general. So, I mean in your view, I think most projects today kind of start off with a local development environment, that’s kind of still the default word. When does it make sense to maybe switch or should everyone just switch today?

Yeah, so it only depends how fast paced your organization is, right? For example, Slack is a super fast paced organization. They have a number of deploys on a daily basis. So there’s more bugs, more fixes, more features going in. So if your organization is focused on pushing more changes and being really robust, this is a really good investment and you can always optimize it in future, you can make it better and kind of leasing those environments for a short period of time and then being able to come back and resume them and stuff like that if you want to be cost effective. If that’s a factor for an organization of course, but yeah, if you want to measure, if you want to increase the productivity, this kind of a solution reduces a lot of friction that every engineer goes through, and I would say if that’s focus, I would highly recommend remote development environments.

That makes sense and lastly, what advice would you have for maybe teams out there that are leaning towards kind of building their own remote development environment solution like you have. Maybe off the shelf solutions don’t quite have, maybe even pricing model, maybe build versus buy there doesn’t make sense for a cost standpoint, or maybe there’s just unique requirements or unique infrastructure kind of things going on that make it easier to do in-house. So what advice would you have for those teams?

Great question. So I would say so for us from my experience, when we started off with remote development, it was for web app and we knew exploring outside options would be kind of complex and will take more time, but that doesn’t mean they’re not good. I would highly recommend, depending on your needs, whether you have them or how big your team is for example, right? For us to be able to support web app, it’s been fine, but to support all repos across Slack, it’s going to be overwhelming and don’t think we’ll be able to do that easily.

So that’s why we are considering more outside solutions like Codespaces or Bunnyshell. We are at a phase where we want to try, but it provides the workflow, the orchestration and the maintenance for us. We might still might use our own infrastructure, but something that we don’t have to develop or fix ourselves because it will be overwhelming if we maintain more IDEs, if we maintain support more configurations. So it depends on what state your company is, how many resources you have dedicated towards finding a solution, but if you are starting from square one, it’s definitely worth exploring existing options out there if they work for you with your requirements, and if not, then consider building something in house.

That makes sense. Well, Sylvester, thanks so much for coming on this show. I really enjoyed this conversation. You got me really excited with a lot of the things you described, including the Emax and Vim support for your solution. Thanks so much for coming on the show.

Yeah, thank you for having me. Appreciate you inviting me to do this talk. This has been really fun.