Background job queues and priorities may be the wrong path
I got this idea after reading the Reflections on GoodJob for Solid Queue from Ben Sheldon, who is the author of the excellent GoodJob.
I shared the same idea that purpose queues (mail, API, stats, …) were bad. I was convinced the best practice was to have job queues that define the priority, such as low, medium, and high. But after reading the following part, I completely changed my mind :
Queue design and multi-queue execution pools. I do think queue design is a place where lots of people do it wrong. I believe queues should be organized by maximum total latency SLO (latency_15s, latency_15m , latency_8h) and not by their purpose or dependencies (mailers, billing, api). Nate Berkopec believes similarly. And I think that informs that execution pools (e.g. thread pools) should be able to work from multiple queues and have independent concurrency configuration (e.g. number of threads), both to ease transition from the latter to the former, but also because it allows sharing resources as optimally as possible (having 3 separate pools that pull from "latency_15s", "latency_15m, latency_15s", and "latency_8h,*" in GoodJob’s syntax). I personally think concepts like priority or ordered-queues lead to bad queue design, so I wouldn’t sweat that. Any ordering regime more complex than first-in-first-out (FIFO) prioritizes capacity (or lack thereof) over latency. This might sound strange coming from me who champions running workloads in the webbrowser on tiny dynos, but it’s different in my mind: I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.
Indeed, defining latency queues (latency_15s, latency_15m, latency_8h) is much better. Because if the high priority queue is always busy, others will never start. However, with the idea of SLO latency, we could totally get rid of queues.
A job needs two attributes to define when it should be started: run_at and max_latency.
That means the job worker only needs to order them by run_at + max_latency
, and takes the first.
It seems both flexible and simple.
Here is an example to illustrate :
class ApplicationJob < ActiveJob::Base
max_latency 10.minutes # Arbitrary default value
end
class SendBoringEmailJob < ApplicationJob
max_latency 30.minutes # These jobs will be started in less that 30 minutes by default
end
class SendImportantEmailJob < ApplicationJob
max_latency 1.minute # These jobs will be started in less than a minute by default
end
# Of course, it should be possible to override default values
# Job 1 will start between 10 and 15 minutes because it tolerates 5 minutes of latency
SendBoringEmailJob.set(run_at: 10.minutes.from_now, max_latency: 5.minutes).perform_later
# Job 2 will start in 11 minutes and does not tolerate latency
SendBoringEmailJob.set(run_at: 11.minutes.from_now, max_latency: 0).perform_later
# If workers are quiet, job1 will be run first in 10 minutes
# If workers are busy, job2 will be run first in 11 minutes
# If workers are too busy, both jobs will exceed their max latency.
# The last case is a great indicator for either optimizing jobs or adding workers.
After having this revelation, I strongly believe that background jobs should neither have queues and priorities. Indeed, a maximum latency is enough to respect a Service Level Objective, which is much better. Job queues are from the past. Viva latency SLO!
Post-scriptum
The article received interesting comments, especially from Hacker News. Thanks to all commenters for sharing their experiences and constructive opinions. It shows that this post is incomplete and far from being perfect. I will try improve it for futur readers by talking about overload situation and jobs not using the same resources.
In an overload situation, the choosen strategy (priority or latency) determines which jobs to delay. With a priority system the low priority jobs will be all delayed. Nevertheless, it does not guarantee that all high priority jobs won't be late. For a latency strategy, the delay will be shared among all jobs.
All jobs are not using the same resources. Somme are waiting for IOs, others are CPU intensive. It's not the best way to mix these different kinds of jobs in the same queue. For example, for IOs jobs it's better to have many threads. However for CPU jobs, a thread per process should be enough most of the time. That means whatever the strategy choosen, it's wrong to mix CPU and IO jobs in the same queue.
Finally, it's even possible to mix queues, priorities and latency. For sure that sounds perfect in theory, however it becomes more complex. As always it's a trade-off. We need the simplest solution that is good enough.