On one of our projects, users complained about a performance issue. They were experiencing loading spinners and timeouts when using the application.
When we evaluated the infrastructural setup, we concluded that we had to change it completely. This was challenging since the application consisted of many moving parts and databases. It would take some time to finish this change.
We wanted to fix the experience quickly, before the infrastructural change was completed. This would keep the users happy and buy us time to implement the infrastructural change. Afterward, the quick fix could be removed.
Surprise! Batch jobs are resource intensive
The root cause of the performance issue was a batch job that could be triggered at any given time. During the job execution, the users should still be able to use the application. But since the job took up a lot of resources, the application became unresponsive.
All databases used by the application were hosted on one server, which created a bottleneck and was thus responsible for the unresponsiveness. This was also the reason we decided to ditch the current architecture and set up managed databases instead.
Let’s take a closer look at how the batch job system is implemented. At the start, one or more parent jobs are pushed to a queue and processed. These jobs were tasked with dividing up the big job into smaller chunks.
The logic in these parent jobs was pretty simple, and they could be processed almost instantly. On the other hand, the child processes could take some time to complete, as they were computationally intensive.
In turn, they pushed more jobs into the queue, dividing the work even more. At first sight, it looked like the jobs were sufficiently split up, but one of the databases was a bottleneck, causing the performance issues.
Let’s break the system (on purpose)
The first step in figuring out how to patch the problem was reproducing it. We simulated the data from production in a staging environment and started testing. Sure enough, we could reproduce loading spinners and an unresponsive application.
Moreover, this was reproducible every time we triggered a large batch execution. If the batch job was bigger than a certain threshold, it would cause the application to be unresponsive. Now, we had a reliable way to reproduce and trigger the issue. It was time to start testing out potential solutions.
Go slow to go fast
All previous attempts to fix the issue focused on speeding up the processing of the jobs. There were ten workers to handle the message jobs and many API servers to handle the web requests but none of these completely fixed the problem.
That’s why we slowed the process down, hoping the database server would have time to simultaneously handle web and job requests. Because the system used Laravel Horizon, we could leverage its job throttling feature.
You can use this functionality by making just two minor changes. In the AppServiceProvider, you need to define the rate limiter for the job:
RateLimiter::for('your_job', function(YourJob $job) {
return Limit::perMinute(config('queue.rate_limiting.your_job'))
->by($job->groupingId);
});
On the job itself, you also need to configure middleware:
public function middleware()
{
if ($this->jobSize < config('queue.rate_limiting.your_job.from_job_size')) {
return [];
}
return [new RateLimited('your_job')];
}
As you can see, this is very easy to implement.
Since the performance issues only occurred if the job was bigger than a given threshold, we added two minor tweaks.
Using the `by` method allows you to define a group of jobs that should be throttled. If you need to execute a job for a big account, you can specify that only jobs for that account should be limited. Other jobs will not be rate limited.
Secondly, we only introduced rate limiting on the job when the job size was bigger than a given threshold. We’ve also made the values for the job size and rate limiting configurable. This allows us to quickly adjust the settings without a deployment.
We tested the fix on staging and found that the loading spinners were gone. Soon after, we deployed on production, and no further reports came in.
What would Treebeard do (WWTD)
Sometimes it makes sense to think out of the box and look at different angles to a given problem. When the only solution is to speed things up, it might be a good idea to try and take it slower instead.
I’ll leave the last sentence to a character from the Lord of the Rings, “Now, don’t be hasty, master Meriadoc.”
Member discussion