Week 26-29

Some parts of my last notes highlighted the fact that I’ve been working on reducing the storage costs on PHPSandbox. I also highlighted how cheaper it is to have data stored in AWS S3 compared to having them stored inside EBS instances which is basically what we’ve been doing all this time. My main work the few weeks has been to offload notebook data that aren’t in use to S3 and then bring them back when needed on the EBS instance. This is highly motivated by the fact that we are running out of cloud credits. Nothing was so scary to me in the last few weeks more than imagining if the credits do finish.

The basic path to implementing this for PHPSandbox depends on moving actual data of all notebooks (not block data) to S3 first, implementing a Garbage collector that recycles storage using S3 as a place to offload actual data to, and then reducing the EBS volume from 2.5Tb to 500Gb. All of these pieces are required to be done in order so, that there won’t be data loss at least.

Prior to these weeks, I implemented a backup on S3 for the notebooks and has been going on quite well. The next is to implement the Garbage collector that will run every hour to find notebooks that are now inactive, shuts them down, back up their data on S3, and then remove their local storage to reclaim the space for another active notebook. This would guarantee some constancy in the EBS storage usage and ensures we’re only using what is needed.

This, however, also requires that we have some sort of activity tracking across the board for all notebooks so we can know the ones inactive. Tracking activities on model updates were easy but not efficient since a notebook can be active without a model update. An example is when a user accesses the notebook using the preview link. This is an activity but is not being tracked through model updates - at least not yet.

During a conversation with one of my colleagues at work, an idea popped into my head on how to do this easily without doing anything in the Database (and can even choose to use the DB later). Implementing the idea was basically quite straightforward and achieved the goal also. Now we can know if a notebook is active for a given amount of time in the past, know all active notebooks in the given amount of time in the past and we can also mark activities easily. In fact, accessing the preview URL for a notebook now marks an activity.

At this point, we are good to go with implementing the garbage collector itself since all the pieces are now in place. The first iteration of the GC (garbage collector) did get some notebooks broken due to some edge cases and had to rapidly do a lot of deploys during this period. I’m quite convinced users during this period at night would notice this quite a lot. Before you realize it, there are already lots of broken notebooks around.

One of the issues some of these notebooks had was having uploaded inconsistent data (basically empty data) to S3. Thanks to S3 versioning, we have a chance and restoring versions that aren’t valid, but we can only do this manually, very manual!

We finally had to bring down the server for a while in the night so as to back up all notebooks to S3 before the GC will now run for all the notebooks. This would reclaim the space initially. The GC finally flies and now works consistently and we can see the consistent results in the storage usage reported daily to us on Slack.

The last bit of this is to finally reduce the 2.5T EBS volume we are using to persist data locally. With the GC running every hour it means we don’t need to keep the 2.5T size anymore. On AWS, increasing EBS size is quite straightforward forward unlike reducing it. In fact, there is nothing like reducing it, you technically need to create a new volume, move all the data to the new smaller volume, shutdown the EC2 (server), remove the large EBS volume, and then attach the new smaller volume as the root volume.

For me, moving the data from the big volume to the small volume was the most difficult part. I remembered having to create new volumes about 7 times because of RSync. The GC has already managed to reduce all the data on the 2.5T EBS volume to about 300Gb+ some can easily move this into a 500Gb volume. While copying all the data of 300Gb to the 500Gb volume, the 500Gb volume always runs out of space and it’s surprising because how can 300Gb data not fit into a store with a capacity of 500Gb. I eventually discovered that this is due to how rsync makes copies and needs to be instructed to not use that method of copying files by adding the —sparse flag. That was after I tried a couple of solutions. The most annoying part is having to do this while the server is down so that we will be guaranteed no new data by the users. Many of our users are active at night which even makes this more pressing as the urge to quickly finish this and open the site up is strong.

This was completed eventually and finally was able to reboot the server. We are now down to a 500Gb EBS volume. This would save us about $150+ every month and is a good step to cutting costs.

We can’t go back on the GC at this point but there are errors in notebooks that have been flying around in the error logs and also users raising issues. Most of these are due to inconsistencies in storage data or corrupted filesystems. I had to spend another significant time investigating why we have these issues and how to prevent/fix them. For every report of this type of case we get, I would need to check the backup on S3 and the content of the notebook stored in the DB to know if there is some consistency. The common case is that an empty backup is uploaded to S3 on behalf of the notebook and the next time it downloads these contents aren’t there anymore and the notebook refuses to start.

After fixing a handful of these cases manually by downloading a consistent version of the backup to my laptop and then uploading it to replace the current inconsistent version, I decided to start writing some code to automate this moving forward. First, I need to detect this issue ahead of the user when a notebook is about to be opened, and then dispatch a solution. The detector I implemented basically compares the size of all the notebook content in the database with the one in the S3 backup, if the size differs, meaning the one in the DB is greater than the one on S3, we know that is an issue since the DB data is the ultimate source of truth (in our case).

The next is to implement the restoration of a specific version of the S3 backup for such a notebook where the backup size is greater than the data in the DB. In S3 terms, this means grabbing all the S3 versions (except the one that qualifies as authentic) for the backup and deleting them so that only the authentic one is left. If there is no authentic one, based on the algorithm, we will raise an exception and this means we need to take a manual look. It was fun going through the AWS API docs to find a way to implement this. I go this sample working and all I need to do to use it to fix any notebook affected is to run an Artisan command which worked fine after trying it for a few other notebooks that were affected.

After some more manual trials of the solution, which was packaged as a dispatchable job, I started applying it to places where the detectors are raising these issues. The outcome was that we now have less of this issue coming up. At least PHPSandbox knows how to handle inconsistent storage.

After all these, I had the chance of getting back to the Play project after a few weeks of not touching it. My team has been testing it and final issues are being attended to. It’s gradually getting set for release and I can’t wait for PHP devs to start using it. More tools to help the PHP developer experience.

Speaking of developer experience, I was finally able to get PHPSandbox working on a plain MacOS setup without the need for Vagrant/Laravel Homestead. My friends and I have been running on Homestead for a while and it gets more and more difficult to use especially when most of us are now running apple silicon which doesn't support Virtual Box. The alternative which is Parallels desktop is a paid software ($100+ / yr) whose license we can’t share between ourselves. I personally don’t like the fact that I have to do so much setup and also the number of memory resources it consumes. I was finally able to put in the final pieces - which are the notebook storage system used by PHPSandbox that is not compatible with macOS (because of a different filesystem approach). The bottom line was making the original storage system (based on directories/folders) to be the default in the macOS environment. The other part was getting Laravel Valet to work across the stacks of the individual services PHPSandbox is also depending on e.g. WebSocket, Expose Tunnel, Vite, etc. Using macOS as the default dev environment did come with a rough downside - we won’t be able to test the default storage system (Block devices) that we use in production on macOS. I’m still working on ways to make this non-effect. One way I’ve put forward is to keep the storage implementations pure and not do anything business related. This means an implementation called to create disk storage will only do that at all times and fail when it needs to. The business of checking if it already exists will be done by the caller a level above that implementation. This allows us to accurately test the business logic, knowing that they will work for any storage implementation.

I’m happy I was able to achieve this for the rest of the team as the whole setup is now quite straightforward and doesn’t need so many resources to get things running. It has also saved the team $200+ that would be used to buy Parallels licenses.

After more trials, I finally get Phpactor to work as a language server. Since it’s written in PHP I’m hoping we will be able to get more out of it in terms of doing some custom things. One challenging part of making this work was getting the StreamWriter and StreamReader to write appropriately in a ReactPHP context. In a NodeJS context, this is quite straightforward as there are implementations for this already. After going through the Phpactor codebase and comparing it with the JS implementations, I finally got a dirty solution working. My next steps will be to organize how we want to use Phpactor in PHPSandbox and how to improve the Language server on our own in cases where we want to add some custom things.