A client project I work on required some processing, scheduled nightly at 2am. I’m a curious creature, by nature, so I was eager to learn what keeps the queues busy for up to 30 minutes in the middle of the night. It must have been something important!

The business domain

Allow me to introduce you to the context first. The client manages a number of resources. A resource is a set of unique items, which in turn can be assigned to groups. Group utilization reflects how many of the items are in use at a given point in time.

An item can be used by a customer. At any point in time, it is critical for the client’s product to correctly track utilization of a group, such that no group is ever overutilized. Overutilization of a group would lead to customers having the right to use the same item at the same time — not a good situation to be in!

Processing utilization jobs at 2am

In the middle of the night, the processing would kick off. It worked by tracking the start and end of usage using a datetime. The usage was calculated by checking how many items were in use with a database query.

Here’s how the original algorithm worked:

1. For each group, create a Laravel job that:
   2. For each datetime (start and end), create Laravel job that:
       3. Queries the number of items in use
       4. Updates or creates the group utilization

I should note here that throughout the day, group utilization values were modified, too. They were incremented or decremented whenever item usage was planned or canceled as following:

1. Query the group utilization at a given time (the value)
2. Update or create the group utilization with the value ±1

Did you catch it? Do you already know why the nightly processing from scratch was necessary?

The two-step list above was not run in a transaction; it’s a race condition. The group utilization values could not be trusted. In fact, these values weren’t trusted by the team, as group overutilization was a recurring problem.

Rough worst-case estimation of performance

Read: Performance tips for Laravel

For the sake of argument, let’s imagine there are 500 groups and that the biggest group has 250 items. Note that a group could be much bigger than that. Let’s assume a pretty minimal pattern of only 2 uses of each item per day.

What would be the database query count assuming said values?

Line 3 of the nightly algorithm runs 1 SQL query
Line 4 of the nightly algorithm runs 2 SQL queries

{group count} x {group items} x {2 uses per day} x {2 start & end per use} x {3 SQL queries}

500 x 250 x 2 x 2 x 3 = 1.500.000 database queries

And how many jobs issue those 1.5M database queries?

{group count} x {date times}

500 x (250 x 2 x 2) = 500.000 jobs

While the absolute value of 500k jobs isn’t that significant of a number on a larger scale project, the number of queries is rather high. Note that there is also some overhead associated with queueing jobs and establishing database connections, so our considerations are not solely limited to query execution time here.

Practically, the start and end times of usage are not necessarily unique, so above numbers could be slightly smaller, but we should always consider the worst case scenario when talking about performance and load.

There are more problems to uncover here. First of all, there’s a period of time where the group’s utilization data is wiped before the newly computed values are inserted.

As a result, during the processing of group utilization (which was originally not a fast operation), overutilization could take place because wiping the data means no items from the group are reported to be in use!

Besides the race condition mentioned earlier during the day (incrementing/decrementing of group utilization), the nightly processing also has a race condition (between lines 3 and 4). Even if usage patterns are infrequent during the night, an incoming customer request at the time of processing could trigger it.

(Not) Building on top of buggy algorithms

I lied to you a little at the start of this article. I was definitely very curious about the 2am nightly processing, but in fact I was also tasked to add a feature that affected group utilization. You could not only indicate usage of a single item within a group, but you could also mark a number of items in use at once (say, 10 of them).

Because this affected group utilization in a new way, I did not feel confident adding the new feature on top of the existing legacy code that was buggy anyway.

I knew I needed a solution that would not only cover the new feature in a performant way but also incorporate the single item usage values. This would allow the computed values to correctly reflect group utilization.

There’s a better way

I’m a big fan of employing my algorithmic experience at work. Whenever there’s an opportunity to do that, I get excited. I had no doubt there was a better solution than the one described above, which felt very brute-force.

In fact, I knew of a possible solution because I employed a very similar technique on a prior project at madewithlove (for overlapping text highlights). So the fun part of coming up with a new solution to a problem was sort of out of the window. It was now merely the implementation that was left.

Coding the new performant algorithm

tl;dr for each group, we make a sweep through a union of single and multi-item usages and save the running total for every usage start and usage end time.

Whenever item usage is planned, updated, or canceled, we dispatch a queueable ComputeGroupUtilization command on the Laravel Bus with the appropriate date time range that needs recomputing.

In other words, whenever there’s a change in the system that affects group utilization, we kick off group utilization computation for the affected group and time range only.

For each item, we store a `datetime` which is either the start or end time of the usage and we sort by datetime ascending. The `count` represents a utilization difference that the item’s usage would affect, and is either +1 or -1 for single item usage and +n or -n for multi-item usage, as shown below. This data comes from two different tables.

For each group, we make one database query with the following columns: `datetime` and `count`, a union of four subqueries:

// for single item usage
single.end_date as datetime, -1 as count
single.start_date as datetime, 1 as count

// for multi-item usage
multi.end_date as datetime, -multi.number_of_items as count
multi.start_date as datetime, multi.number_of_items as count

How the solution works is actually simple.

  1. Using a cursor on said union query, we take into consideration the initial group utilization value (that’s a separate database query per group)
  2. We traverse the results and update a running total with the `count` value for each datetime, and
  3. We store the value of the running total as utilization value for {group, datetime}.

Here’s a sample output, containing a primary key, a datetime, the group id and the utilization value. These are the results for group id 190 over a period of 18 minutes:

| id |     datetime      | group | count |
10230 2022-06-18 16:20:00   190     78
10231 2022-06-18 16:25:00   190     92
10232 2022-06-18 16:30:00   190     90
10233 2022-06-18 16:38:00   190     84

Rough worst case estimation, round 2

Let’s use the same values as before: 500 groups, up to 250 items each, 2 uses of each item per day.

What would be the database query count assuming said values?

{initial group utilization} + {group count} x {group items} x {2 uses per day} x {2 start & end per use} x {1 SQL query}

500 + 500 x 250 x 2 x 2 x 1 = 500.500 database queries (vs 1.5M previously)

And how many jobs issue those 500K database queries?

{group count}

500 jobs

That is 3 times fewer database queries and 1000 times fewer jobs queued.

And here’s the kicker… we only ever run this initial computation for all groups ONCE!

After the utilization values have been initialized for all groups (for which I created a Laravel Artisan command), only the affected time range is recomputed whenever there’s a change in item usage.

On average, the job runs in 1.5 seconds, occasionally touching the 2-second mark whenever a time range of a full year has to be computed.

And of course, we use a database transaction to update the group utilization values. No more wiping of group utilization values before they’re recomputed. No more queues getting red hot at 2am.

As always, I have test-driven developed this feature. In fact, I wanted to cover several edge cases due to overlapping start/end times, so I would feel completely lost if I was forced with a gun at point-blank to do it without tests. Not because of the gun, but because of feeling miserable.

Here’s a PHP doc block I wrote for most simplistic test:

/**
 * 4.01    5       6       7       8       9       10 - day
 * |       ################################# - calculated period
 * |
 * |       |M5-----------------------------| (Multi-use of 5 items)
 * |
 * 0       5                               0 - utilization
*/

4.01 means a fixed test date of 4th of January 2021; the top line is a calendar. The next hashtag line marks the period we will recompute. The M5 indicates a time range when 5 items are in use. The last line shows the expected utilization values.

And here’s the last PHP doc block I wrote for the last test:

/**
 * 4.01    5       6       7       8       9       10 - day
 * |       ################################# - calculated period
 * |
 * |M2-------------|
 * |S1---------|
 * |       |M5-----------------------------|
 * |                               |M4--------------|
 * |                               |S1---------|
 * |                               |S1.....INF
 * |
 * 3 <--   8   7   5               11      6 - utilization
*/

S stands for single item use. M stands for multi item use. Note the initial utilization value (at the bottom) is 3 before the calculated period, so that’s the initial database query before the actual running total algorithm kicks in. You get the idea 🙂

Rolling out the performance update

I knew this new code was at the core of our client’s product. In fact, it’s the only type of work we do at madewithlove — at the core of the clients needs; no boring auxiliary stuff.

Because of that, I made sure we would roll this out very carefully. We deployed the new code behind a feature flag. We then enabled the feature flag and allowed the new implementation to run in-parallel to the old implementation that was still in use.

We could have attempted to manually reconcile the old and new values, but because the old values were off, if there was a difference it wouldn’t be clear which code is right.

So instead, I wrote another Laravel Artisan command that takes group_id and datetime as arguments, computes the utilization value based on a database query (not the command with the running total), and compares it to the stored pre-computed utilization value. A successful equality was signified by a green message printed to the terminal :).

After one week and a couple of checks with the CLI command, I was confident the new implementation would behave exactly as expected. It especially helped my confidence when the old utilization value was known to be incorrect, but the new value was good! TDD FTW.

We flipped another feature flag and the system switched over to using the new group utilization values when the API was queried. After a few days the old code was removed (along with the feature flag) and the old values backed up to S3 and removed from the database.

Parting thoughts

Brute-force is fine as a first iteration on a body of work to prove something is possible (even if it has a terrible running time). But it rarely, if ever, is the right solution to the problem. A different solution doesn’t inherently have to be complex but can yield much better performance.

Race conditions continue to be non-trivial to spot for an untrained eye. However, I am confident that the mentoring we’ve provided to our client’s engineering team puts them in a better position to spot these kinds of issues early and prevent them from getting out to production. They might not even get committed in the first place because we encourage pair programming, too!

Thanks for making it to the end of a pretty long article on what basically constitutes the use of a running total. However, it’s a second time when it has proven a useful technique, so maybe it’s not that obvious in the end and perhaps will help you on a future project one day!