Week 13: Fire Fighting

Started the week off resting, still bent on not touching code at all so I got to watch Superman and Lois - I've waited all this while to watch the second season. It seems quite interesting as it is going. Mercy and I also got to watch some Anime most nights. I bet she now likes them.

I had the opportunity to talk to another user this week, and as usual, it started with doing support - trying to show him how to use one of the features. I can see why some companies recommend their staff spend some time in support. Speaking to the user made me realize that some things might not be as obvious as you think, you just have to be explicit anyway.

I got to know he’s a CodeSandbox user looking who wished CodeSandbox had PHP support that was how he got to know about PHPSandbox.

Finally, resumed work on Wednesday. The early mornings were peaceful until later in the evening when issues started happening on PHPSandbox.

First, it was our 2T EBS (Elastic Block Storage) that ran out. This storage host the source codes of the notebooks(projects) on PHPSandbox. But as usual, what we do is just to increase the storage on AWS, this helps reduces unnecessary costs as we get to use only the storage we need. We've done this incremental size growth for about two years and it worked fine until Wednesday when it failed to happen. Growpart complained it cannot resize our disk beyond 2T because it is formatted as an MBR partition.

Indeed, I went down on research quickly only to see that this is by design. The design isn't in our favour at this time. We need to act fast as the 500 errors have started popping up.

We could try to patch this somehow (as I see in some random online post) but those solutions aren't by design - getting support for an unrecommended process could be pointless. We could also plug in another EBS on AWS in less than a minute, but the application code didn't make room for that ... yet.

My colleagues and I eventually concluded on deleting our non-essential notebooks to clear the room for the users that evening (our major users are in the US and they come online at the night 😭). We cleared our notebooks and we claimed about 60Gb+. We started looking for other means we can solve our problem as our solution might not last beyond the night.

After measuring our options I could see the best shot we had was making the application code support multiple disk locations on the server. Because honestly, I think the 2T EBS has tried 😆.

I cooked up a PR that night to clear the way for writing the multi EBS solution and on merging it, PHPSandbox went down. All Cloudflare requests were hitting the ground. 502s everywhere! After rolling back the deployment the server was still down.

After some investigation, I noticed the PHP-PFM was timing out for some reasons. After some research, I still couldn't figure out the cause. I was very desperate about bringing the PHPSandbox up that night till I eventually pulled the plug on the server (after almost a year of uptime on this server, I shut it down). After retiring the server, no request to the backend could still go through.

This was the time when the region we have most of our users are awake and this happened just after I posted the link to PHPSandbox in a Twitter conversation. So this wasn’t a good time for PHPSandbox to be down at all.

After pursuing this till around 4 AM, knowing I still have to wake up for my work in the morning, I decided to give up, I rolled back to the last release I knew was working fine, put the backend in maintenance mode and turned on Cloudflare’s under attack (I did have a fleeting thought that this could be an attack since even after rebooting the server all the PHP FPM workers are just busy with requests 🤔). I went to bed angry and tired.

I decided to enable all these services I turned off before sleeping some moments before resuming my work in the morning. I was surprised to see everything working as if nothing had happened. I inspected PHP FPM again and could see the workers now work normally. I still don't know what happened. Could this be that it was going to solve itself eventually?

As a founder, it's sometimes painful to start your week calmly and then have to be on your toes for the rest of the week. You can move from a functioning app to a dead one pretty fast. I started the week relaxed as I wanted but it was short-lived. Looking back I see this cycle is quite consistent. I want to assume other founders will share the same story.

There is also the essence of wearing many hats. Being able to administrate our server and also look into every part of the code to scout issues, while knowing how to communicate with the users who may have questions in such times all played their own part.

I also see the part where over time, so many things about PHPSandbox have gotten better because we encountered one issue or the other. The role of a product being used by people cannot be underemphasized in the journey of the product to becoming perfect.

FYI: We still don’t know what caused the outage. But we’ve been adding more bells and whistles to make this less annoying if it happens again.