If you have a really simple PHP application that you deploy to a single server, deploying it basically boils down to transferring the source code to the server, one way or another. Maybe you also clear OPcache, if you have it enabled.

If your application is more complex and constitutes a composition of different components, including queue workers, deployment can get more interesting. Quite a few times with the client teams I get to work with, I have touched on the subject of avoiding downtime and doing progressive rollout of features.

To give a simple example, suppose you are trying to use staged migrations where you’d: (i) deploy your database migration first, then (ii) deploy the new code that makes use of the new schema and finally, (iii) perform a database migration that does some type of clean-up.

It’s a bit ironic that one of my deployments experienced a related issue, albeit without downtime. Here’s the story, so you can learn from my mistake.

An investigation into the failed jobs

The squad I worked with was responsible for implementing a faster algorithm to make a calculation which was triggered multiple times over the course of a day. Although the feature was thoroughly tested (both through automated test coverage as well as QA process in place), we were very careful with the deployment as this feature was at the core of the product. As such, the new implementation has been deployed and was running for a few days already, in parallel with the old implementation.

After confirming we were very happy with the new implementation, we flipped the feature flag and from that point on a core product feature started depending on the values computed by the new implementation.

The next day, while casually checking the failed jobs queue to make sure everything is running smoothly, I noticed a few related queued jobs had timed out. I was rather surprised by that because on average they were running in 1 – 1.5 seconds each. I opened the Laravel Horizon dashboard and noticed some of the jobs took really long to run (17 seconds, 30 seconds, etc.).

This particular calculation takes date ranges as an input but a business rule states they should be in the future. After inspecting the parameters, I noticed some of the date ranges were in the past.

It made no sense to compute those, but apparently the legacy code was issuing events that triggered pretty useless calculations for historic periods. In fact, some of the time ranges spanned a full year, which is when the job ran for 30 seconds or so.

Computing future time ranges (even for a year) is a relatively lightweight operation compared to computing the time range of the previous year which is usually full of data. On one hand, I was happy to know we could compute such a long time range in a pretty short time, but on the other hand, it was a pure waste of resources, so I decided to fix this.

Adding a fix to the job’s command handler

I created a ticket and immediately started to work on the bug fix. As always, I began by adding failing tests which I would then make pass by adjusting the applicable code. It boiled down to the following:

  • test past time ranges are ignored
  • test current time ranges trim the past part
  • test can override to compute past time ranges

The last test was related to the fact that we have administrative Artisan CLI (Command Line Interface) commands that allow us to issue computation for a time range that does start in the past and spans a year into the future (this was enabled to generate historic data for customers).

I reasoned how to do this best and landed on avoiding dispatching these jobs at all. But I concluded that I didn’t want to dive into the old part of the codebase (we are intentionally trying to separate new code from the old) and as such, I decided to go for a simple solution: allow dispatching of the job and make a decision whether it should be computed when it is picked up by the queue worker.

In case you wonder “hey, is this not wasting resources?” such a no-op job completes in roughly 0.01 second, and considering we are breaking the old app into bounded contexts with proper test coverage, one day these jobs won’t be dispatched at all. That is the reason I decided to KISS (Keep It Simple, Stupid).

I implemented this by allowing an extra flag to be passed to the job constructor which would support computing time ranges including some in the past, and I set it to false by default.

// ComputeFoobarCommand
   
  public function __construct(
     int $groupId,
     CarbonImmutable $startDateTime,
     CarbonImmutable $endDateTime,
     bool $computePast = false
   ) { 

Next, in the job handler, I added the following method, which I call before I actually handle it.

 // ComputeFoobarCommandHandler

   protected function shouldCompute(ComputeFoobarCommand $command): bool
   {
     if ($command->computePast)
     {
       // This is an override for dispatching the command from CLI.
        return true;
     }
     if ($command->endDateTime->isBefore($this->clock->now()))
     {
        return false;
     }
     if ($command->startDateTime->isBefore($this->clock->now()))
     {
        // Override startDateTime to avoid computing the past part.
        $command->startDateTime = $this->clock->now();
     }
     return true;
 } 

With that, we make sure we don’t compute past values unless the job is coming from CLI.

⚠️ Stop reading here for a minute. Can you see the problem already?

Complex job handling

In distributed systems, you have many components that work together. When you make a deployment, nothing is stopped for all components to be updated. The application components keep running and the update is progressively rolled out.

After going through code review and pull request approval, I deployed the fix. This issue would not have been caught either locally or in Testing or Acceptance unless they were put under the load that Production is experiencing.

Sentry caught the following problem on Production:

Typed property ComputeFoobarCommand::$computePast must not be accessed before initialization

Read about typed property must not be accessed before initialization.

I quickly looked at my job constructor: $computePast had a default value. I was confused as to what could be the source of the problem. Then it struck me.

⚠️ A deployment has completed. However, Redis had still the original jobs queued up. When the newly deployed queue handler started picking up old jobs from the queue, it blew up here if ($command->computePast) because this field did not exist on the old job deserialized from Redis.

The event count in Sentry was growing, so I had to act fast. There was no downtime, but the calculated values were running out of sync with reality. I immediately made a release with a hotfix which was to add an early return true; statement right before the failing line.

✅ This allowed the queue workers to correctly process all of the old jobs and drain the queue.

After confirming in Horizon there were no jobs left on the queue, I checked Sentry to ensure there were no more reports of the issue. I removed the hotfix early return and made another release followed by another check in Sentry just to be sure.

Avoiding failed jobs when deploying to distributed systems

How can you avoid this issue from happening when you update a queued job’s source code?

❌ Wait for all queues to complete before deploying code.

That’s not an option, because these jobs would have kept dispatching from the live system, so you could wait forever. This only works if you have a low traffic application and can put it in maintenance mode (php artisan down, drain the queues, deploy), which is not an option for respectable large scale applications.

❌ Block the job from being dispatched.

Again, that basically boils down to having a scheduled maintenance window, because that would limit the capability of the application — see my previous point on a maintenance window.

✅ Don’t assume the job passed to your handler immediately matches the newly updated job class.

It is almost like versioning your job but only for a relatively short period of time (until queue workers have processed all of the existing jobs on the queues). Albeit a counter intuitive assumption to make (I updated the class constructor, right? RIGHT?!), I could have assumed my ComputeFoobarCommand class would not immediately match the new class definition. Then, I would have avoided accessing an uninitialized class property which came from a serialized old job(s) waiting in the queue. You can use reflection, create a brand new job class and deprecate the old one, or version your job class to name a few solutions.

Generalizing the problem of backwards compatibility

There is more to learn here. There are many types of changes where you have to be careful while doing progressive rollout to maintain backwards compatibility.

❌ An application with auto scaling behind a load balancer may effectively be in 3 states during a rollout:

  • instance(s) with old code
  • instance(s) where the application is being deployed to (detached from the load balancer and not serving live traffic!)
  • instance(s) with the new code already

As you can see, the consumer of an API exposed by this application cannot make an assumption whether the instance responding will have the new functionality or not. It could be either an old instance or a new instance responding.

❌ How about removing a column from a table?

Let’s assume you have 2 instances of an application serving web traffic as well as a feature pull request that contains both the feature and a database migration that removes a column from a table.

If the database migration runs during deployment of the first instance and there’s an incoming request to the second instance, you have a problem, because the old code could still depend on the non-existent column (while the first instance is finishing deployment before it is attached back to the load balancer).

These are in fact fairly simple examples, but you can see how deploying to a distributed system without downtime (or intermittent issues during the deployment time only) is not an easy feat even in a relatively simple web application.

This is similar to maintaining backward compatibility for a mobile app you may have published that depends on your existing API version. But here it’s on a system component level and pertains to how these components keep interacting while your deployment is ongoing.

When staggering deployments with zero down time, remember to maintain backward compatibility with other parts of your system that have not been updated yet (or even actively keep processing, like queue workers, until drained).

Keep your pull requests focused and don’t assume your code immediately gets deployed to all application servers or containers at once. In fact, it helps if you assume there is a large delay while the (usually auto-scaled) instances get gradually updated, and think about how that affects your system as a whole.