JobMaster — Architecture & Performance Tuning Guide
JobMaster is a distributed, highly tunable background task execution engine. This guide provides the architectural blueprints and tuning formulas to help you confidently size your cluster, optimize throughput, isolate workloads, and shield your Master DB from contention.
If you haven't already, read the Architecture Overview first. Understanding how Buckets, Coordinators, and the SavePending flow work will make the tuning decisions in this guide much easier to reason about.
1. Configuring and Tuning Cluster Settings: Sizing Your Configurations
When scaling your cluster, you have six major parameters to adjust. Here is how to toggle them to fit your workload:
Parameter 1: How many Coordinators do I need?
Under normal conditions, 1–2% of the total execution capacity is sufficient for the coordinator pool across the entire cluster. Running at least two coordinators is recommended — this provides active/passive high availability with automatic failover.
Before scaling up coordinator count, consider raising the TransientThreshold first. A higher threshold keeps jobs buffered in the bucket longer, reducing pressure on the master node and often deferring the need for additional coordinators.
- Why: A Coordinator's job is extremely fast. It queries, acquires, and pushes.
- Tuning
TransferBatchSize:- Default is
1000. - For high-throughput scenarios (1000+ jobs/sec), increase this to
2000or5000. This allows a single Coordinator sweep to onboard thousands of jobs in a single database round-trip, reducing write lock duration. - For low-throughput, heavy-compute scenarios, keep this at
100–500to prevent over-allocating large chunks of work to single buckets.
- Default is
→ Configuration reference: TransferBatchSize
Parameter 2: How many Buckets do I need (BucketQtyConfig)?
Buckets are the fundamental unit of concurrency partitioning. They act like logical storage partitions. When a worker owns a bucket, it takes exclusive claim over all jobs inside it, reducing lock contention across workers.
- Low Bucket Quantity (1–2 per priority):
- When to use: Heavy, resource-intensive, long-running jobs (e.g., video encoding, report generation).
- Why: Keeps the worker focused. Prevents fetching too much concurrent work, reducing CPU context switching and memory spikes.
- High Bucket Quantity (5–20+ per priority):
- When to use: High-velocity, lightweight, sub-second jobs (e.g., transactional webhooks, data ingestion, event alerts).
- Why: Maximizes parallel intake. Multiple workers can pull from different buckets concurrently without waiting for each other.
The Sizing Rule of Thumb: To ensure perfect load distribution, the total number of active buckets in a cluster lane should always be equal to or greater than the total number of active Executor nodes (Total Buckets >= Total Executors). If you have 10 workers and only 4 buckets, 6 workers will sit completely idle!
Priority isolation: dedicating workers to a single priority
Beyond sizing bucket quantities, you can assign workers to serve only specific priorities by setting others to 0 buckets. Since the default per priority is 1 bucket, you must explicitly zero out every priority you want to exclude. This reserves a dedicated executor fleet exclusively for high-urgency work — guaranteeing that Critical jobs are never starved by lower-priority processing, even under peak load.
Use this when a specific priority carries SLA-bound or user-facing work that cannot tolerate any queueing delay caused by lower-priority jobs competing for the same workers.
// GENERAL workers — handle VeryLow through High; Critical is excluded
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("General-Executor")
.SetWorkerMode(AgentWorkerMode.Execution)
.BucketQtyConfig(JobMasterPriority.VeryLow, 1)
.BucketQtyConfig(JobMasterPriority.Low, 2)
.BucketQtyConfig(JobMasterPriority.Medium, 4)
.BucketQtyConfig(JobMasterPriority.High, 6)
.BucketQtyConfig(JobMasterPriority.Critical, 0) // excluded — never picks up Critical jobs
.ParallelismFactor(2.0);
// CRITICAL-ONLY workers — reserved exclusively for urgent jobs
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Critical-Executor")
.SetWorkerMode(AgentWorkerMode.Execution)
.BucketQtyConfig(JobMasterPriority.VeryLow, 0) // excluded
.BucketQtyConfig(JobMasterPriority.Low, 0) // excluded
.BucketQtyConfig(JobMasterPriority.Medium, 0) // excluded
.BucketQtyConfig(JobMasterPriority.High, 0) // excluded
.BucketQtyConfig(JobMasterPriority.Critical, 10)
.ParallelismFactor(6.0);
This pattern composes naturally with WorkerLane — you can combine lane isolation and priority isolation for granular control over which workers serve which workloads.
→ Configuration reference: BucketQtyConfig
Parameter 3: When should I create a separate Worker Lane (WorkerLane)?
Think of Lanes as logically isolated queues or dedicated execution zones. By default, all workers and jobs operate in the Default lane.
-
When to isolate: Create a separate lane when you have a mixed workload of Fast/Critical jobs and Slow/Heavy jobs.
-
Why: If they share a lane, long-running analytics jobs (taking 10 minutes each) will fill up all available bucket slots, causing high-priority transactional emails (taking 50ms) to starve in the queue.
-
Solution:
- Dedicate a worker fleet to
.WorkerLane("Critical-Emails")with many buckets. - Dedicate a separate, low-priority worker fleet on cheap instances to
.WorkerLane("Heavy-Analytics")with few buckets.
- Dedicate a worker fleet to
-
Database vs. Message Broker Isolation for Long-Running Tasks:
- Guidelines: For jobs that take longer than 30 seconds to execute, prefer using a database-backed transport layer (RDBMS like PostgreSQL or SQL Server) rather than an ephemeral message broker (like NATS JetStream).
- Why: Message brokers are optimized for high-velocity, sub-second streaming. Directing long-running workloads through message broker channels can saturate dispatch/onboarding buffers and trigger consumer acknowledgement timeouts. A database-backed transport is designed to handle sustained, long-duration tasks gracefully and with full durability.
- Implementation: Create a dedicated database agent connection, and configure a specialized worker pool using a dedicated
WorkerLaneconnected to that database transport.
// Registering Isolated Broker vs. Database Transport Connections & Worker Fleetsbuilder.Services.AddJobMasterCluster(config =>{config.ClusterId("SMB-Enterprise-Cluster");config.UsePostgresForMaster("...");// 1. Register Connections// Fast Ephemeral Transport for real-time webhooks & emailsconfig.AddAgentConnectionConfig("Fast-Broker-Connection").UseNatsJetStream("nats://localhost:4222");// Durable Database-backed Transport for heavy/long analyticsconfig.AddAgentConnectionConfig("Durable-Db-Connection").UsePostgresForAgent("Host=localhost;Database=jobmaster_transport;...");// 2. Define Isolated Worker Fleets// Fleet A: Fast-Velocity Message Broker Workers (Pulls from NATS JetStream)config.AddWorker().WorkerName("Broker-Executor").SetWorkerMode(AgentWorkerMode.Execution).AgentConnName("Fast-Broker-Connection") // Uses fast message broker.WorkerLane("Default") // Processes transactional, sub-second jobs.BucketQtyConfig(JobMasterPriority.Critical, 20).ParallelismFactor(4.0); // Highly concurrent execution// Fleet B: Slow-Running RDBMS Workers (Pulls from Postgres Database Transport)config.AddWorker().WorkerName("Long-Running-Executor").SetWorkerMode(AgentWorkerMode.Execution).AgentConnName("Durable-Db-Connection") // Connects to the database transport.WorkerLane("Slow-Analytics-Lane") // Dedicated lane for long executions (>30s).BucketQtyConfig(JobMasterPriority.Medium, 1) // 1 bucket = Strict sequential order.ParallelismFactor(1.0) // Low concurrency per node to shield CPU.BucketBufferSize(5); // Prevents pre-fetching heavy tasks into memory});
→ Configuration reference: Worker Lanes
Parameter 4: Tuning Prefetch Buffers (BucketBufferSize & BucketBufferLeadTime)
To achieve fast dispatching, workers pre-fetch jobs from their assigned buckets into local memory. This is your cluster's read-ahead cache.
BucketBufferSize(Default: 250):- High-Velocity Tuning: If your handlers run in milliseconds, increase this to
500–1000. This keeps the local worker's in-memory queue full, guaranteeing minimal delay between jobs. - Heavy-Job Tuning: If your handlers take minutes, lower this to
10–20to prevent caching work in memory that might sit idle while other workers have free threads.
- High-Velocity Tuning: If your handlers run in milliseconds, increase this to
BucketBufferLeadTime(Default: 15s):- Defines how far ahead in time the worker scans the bucket to prefetch.
- For high-velocity jobs, keep it short (
5s–15s) to ensure high freshness. For slow, predictable schedules, you can increase this to30sto reduce network polling frequency.
→ Configuration reference: BucketBufferSize · BucketBufferLeadTime
Parameter 5: Sizing Concurrency and Threads (ParallelismFactor)
The ParallelismFactor (Default: 1.0) is a multiplier that scales the base number of concurrent execution slots assigned to a priority lane. It determines how many threads can execute tasks concurrently on a single worker node.
The framework computes the base running slots (capacity) based on job priority:
- VeryLow: 2 base slots
- Low: 3 base slots
- Medium: 4 base slots
- High: 5 base slots
- Critical: 6 base slots
The actual number of jobs executing in parallel is calculated as:
Run Capacity = Round(Base Slots × ParallelismFactor)
The worker automatically pauses fetching new jobs when this capacity is saturated, preventing memory exhaustion under heavy load.
Sizing Scenarios for ParallelismFactor:
- Compute-Bound Tasks (CPU-Intense):
- Characteristics: Video processing, image manipulation, cryptographic calculations, heavy PDF parsing.
- Tuning Strategy: Keep the
ParallelismFactorlow (typically0.25to1.0). You want the total number of running tasks across all active buckets on the worker to closely align with (and not exceed) the physical CPU core count of the host container/VM. - Why: Having more execution threads than CPU cores causes CPU context switching thrashing, slowing down every task and increasing overall memory footprints.
- I/O-Bound Tasks (Network & Storage Intense):
- Characteristics: Sending emails, calling external webhooks/APIs, pulling/pushing file storage (e.g., AWS S3), or executing lightweight database queries.
- Tuning Strategy: Scale the
ParallelismFactorhigh (typically2.0to5.0+). - Why: Because your worker threads spend most of their time waiting for external networks or storage IO, the CPU remains largely idle. A higher factor allows you to process dozens or hundreds of concurrent requests in parallel on a single lightweight worker.
→ Configuration reference: ParallelismFactor
Parameter 6: Sizing the Look-Ahead Window (TransientThreshold)
The TransientThreshold (Default: 10 minutes) is a global cluster setting that defines the look-ahead time window for scheduling and fast execution routing.
It governs two critical parts of the JobMaster lifecycle:
- Coordinator Scan Lookahead: The Coordinator queries the Master DB for future scheduled jobs whose execution time falls within this window, reserving them in bulk and assigning them to active worker buckets.
- Immediate Execution decision (SavePending): When a new job is scheduled, the write buffer compares the job's planned start time against this threshold. If it falls within the window, the job uses the immediate YES path execution shortcut (bypassing normal scanning queues to execute instantly in the worker's active bucket). If it is scheduled outside the window, it takes the NO path (saved to the Master DB and scheduled to be acquired later).
Sizing Scenarios for TransientThreshold:
- High-Throughput, Stable Environments:
- Tuning Strategy: Increase the threshold (e.g.,
15to30minutes). - Why: A larger look-ahead window allows the Coordinator to prefetch and onboard larger batches of future tasks in fewer database sweeps. This significantly reduces the polling overhead and query load on the Master DB.
- Tuning Strategy: Increase the threshold (e.g.,
- Highly Dynamic or Auto-scaling Environments (e.g., Kubernetes):
- Tuning Strategy: Decrease the threshold (e.g.,
1to5minutes). - Why: If worker pods scale down or restart frequently, pre-allocating jobs too far in advance increases the volume of "orphaned" tasks that must undergo self-healing recovery when a worker goes offline. A shorter threshold keeps task distribution highly dynamic and minimizes recovery overhead.
- Tuning Strategy: Decrease the threshold (e.g.,
→ Configuration reference: TransientThreshold
Parameter 7: How many Drain workers do I need?
Drain workers are persistent background processes, but significantly lighter than Coordinators or Executors. When a bucket becomes lost, the cluster distributes it across available Drain workers — each one can handle multiple orphaned buckets concurrently, spawning a dedicated set of recovery runners per bucket until the jobs are redirected back to the Master DB.
When running in a decoupled topology, the right count depends on how frequently Executor nodes go offline — whether due to crashes or intentional replacement (e.g., rolling updates in Kubernetes):
- Stable environments (rare turnover, long-lived processes): 1 Drain worker is sufficient.
- Dynamic environments (frequent worker replacement, e.g., Kubernetes rolling updates or aggressive auto-scaling): target roughly 1–10% of your Executor count — more Drain workers distribute the recovery load when many Executors go offline in quick succession.
A shorter TransientThreshold also reduces orphan volume — fewer jobs are pre-allocated into buckets at any given moment, which means less work per recovery event regardless of Drain worker count.
→ Configuration reference: Drain mode
2. Separating Publishers from Consumers
An application instance can be configured with a JobMaster cluster but no workers registered. This turns it into a pure publisher — it can schedule jobs and benefit from the SavePending short-circuit without running any coordination or execution logic in that process.
This pattern is useful when:
- Your API tier (stateless web pods) needs to schedule jobs quickly without carrying coordinator or executor thread overhead.
- You want to scale publishers and consumers independently.
- Workers live in a dedicated service or container separate from the web API.
The agent connection requirement
Even a publisher-only instance must have an agent connection configured. When a job is scheduled and falls within the TransientThreshold (the YES path), the SavePending mechanism writes directly to the agent ephemeral transport to enable the execution short-circuit. Without an agent connection, the short-circuit cannot engage and all scheduling falls back to the NO path — persisted to the Master DB and picked up later by the Coordinator scan.
// PUBLISHER INSTANCE (e.g., Web API — no workers)
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("My-Cluster");
config.UsePostgresForMaster("Host=db;Database=jobmaster;...");
// Required for the SavePending short-circuit (YES path) to engage
config.AddAgentConnectionConfig("Nats-1")
.UseNatsJetStream("nats://nats:4222");
// No AddWorker() calls — this instance only schedules jobs
});
// CONSUMER INSTANCE (e.g., dedicated worker service)
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("My-Cluster"); // must match the publisher
config.UsePostgresForMaster("Host=db;Database=jobmaster;...");
config.AddAgentConnectionConfig("Nats-1")
.UseNatsJetStream("nats://nats:4222");
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Coordinator-01")
.SetWorkerMode(AgentWorkerMode.Coordinator);
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Executor-01")
.SetWorkerMode(AgentWorkerMode.Execution)
.BucketQtyConfig(JobMasterPriority.High, 10)
.ParallelismFactor(4);
});
Both instances must share the same Cluster ID, the same Master DB, and the same Agent connection. The cluster ID is what binds publishers and consumers together.
3. Self-Healing & Orphan Recovery (The Lost Bucket Rescue)
If an Agent Worker crashes or loses connectivity, JobMaster automatically recovers the orphaned work without losing a single job:
- Heartbeat Failure: The cluster coordinator detects a missing heartbeat and marks the worker's assigned buckets as
Lost. - Adoption: A healthy active worker claims ownership of the
Lostbucket, moving its status toDraining. - Redirection to Master: The adopting worker pulls all unfinished jobs (currently in the active execution queue) and flushes all unsaved jobs (buffered under
SavePendingstatus but not yet stored in the orchestration database) out of theDrainingbucket and redirects them back to the Master DB (setting their status back toHeldOnMaster). - Re-Assignment: The jobs are cleanly picked up by active, healthy buckets on other workers during standard Coordinator scans.
Self-Healing & Orphan Recovery Diagram
Here is how the active worker adopts the lost bucket and redirects both unfinished and unsaved jobs back to the Master DB:
4. Tuning Blueprints for Common Scenarios
Here is how to configure your JobMaster cluster for three classic operational patterns:
Scenario A: The High-Throughput Ingestion Cluster (Fast, Lightweight Events)
- Goal: Process 5,000 webhook dispatches per second with minimal dispatch latency.
- Strategy: Decouple completely. Dedicated Coordinator, maximum batching, and highly parallel workers.
// 1. COORDINATOR NODE (Deploys as a single lightweight container)
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("Ingestion-Cluster");
config.UsePostgresForMaster("...");
config.AddAgentConnectionConfig("Nats-1").UseNatsJetStream("nats://...");
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Central-Brain-01")
.SetWorkerMode(AgentWorkerMode.Coordinator)
.TransferBatchSize(5000); // Massive batch onboarding in 1 DB tick
});
// 2. EXECUTOR NODES (Scale horizontally as 10+ container replicas)
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("Ingestion-Cluster");
config.UsePostgresForMaster("...");
config.AddAgentConnectionConfig("Nats-1").UseNatsJetStream("nats://...");
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Event-Muscle")
.SetWorkerMode(AgentWorkerMode.Execution) // Bypasses Master DB completely
.ParallelismFactor(4.0) // High thread utilization per core
.BucketBufferSize(1000) // Large pre-fetch cache to eliminate gaps
.BucketQtyConfig(JobMasterPriority.Critical, 20); // 20 buckets = High parallelism
});
Scenario B: The Heavy Analytics & AI Processing Cluster (Slow, Compute-Intense)
- Goal: Run long-running AI models or heavy PDF reports (10 seconds to 15 minutes per job) without starving other systems or crashing worker memory.
- Strategy: Isolate with a dedicated
WorkerLane, minimize buckets, and disable pre-fetching.
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("Enterprise-Cluster");
config.UsePostgresForMaster("...");
config.AddAgentConnectionConfig("Nats-1").UseNatsJetStream("nats://...");
// Dedicated Worker for Heavy Compute only
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Compute-Node-01")
.WorkerLane("Heavy-Compute") // Fully isolated pipeline
.SetWorkerMode(AgentWorkerMode.Full)
.ParallelismFactor(1.0) // Process strictly one at a time per core
.BucketBufferSize(5) // Prevent pre-fetching heavy tasks into memory
.BucketBufferLeadTime(TimeSpan.FromSeconds(5))
.BucketQtyConfig(JobMasterPriority.Medium, 1); // 1 bucket = Strict sequential order
});
Scenario C: The Standard SMB Hybrid Setup (Balanced, Single-Node)
- Goal: A clean, balanced setup for a standard application running background tasks, cron jobs, and email dispatches in a single web process.
- Strategy: Use the default
Fullmode with balanced scaling configurations.
builder.Services.AddJobMasterCluster(config =>
{
config.ClusterId("App-Cluster");
config.UsePostgresForMaster("...");
config.AddAgentConnectionConfig("Nats-1").UseNatsJetStream("nats://...");
config.AddWorker()
.AgentConnName("Nats-1")
.WorkerName("Standard-Worker")
.SetWorkerMode(AgentWorkerMode.Full) // Single process does both brains and muscle
.TransferBatchSize(500)
.BucketBufferSize(100)
.ParallelismFactor(2.0)
.BucketQtyConfig(JobMasterPriority.Medium, 3)
.BucketQtyConfig(JobMasterPriority.High, 2);
});
5. Cheat Sheet: Tuning Configuration Reference
Use this quick matrix to diagnose performance bottlenecks and tune JobMaster:
| Symptom / Bottleneck | Root Cause | Primary Tuning Cure |
|---|---|---|
| High Master DB CPU / IOPS Spikes | Executors are hitting the Master DB directly, or Coordinator batch sizes are too small. | 1. Shift executors to AgentWorkerMode.Execution. 2. Increase TransferBatchSize on your Coordinators. 3. Use SavePending writes to buffer writes via the Agent Ephemeral Transport. |
| Idle Workers (Compute Starvation) | Total active bucket count is smaller than the number of execution workers. | Increase BucketQtyConfig for the active priorities so every worker has buckets to claim. |
| Out of Memory on Executor Nodes | Too many heavy jobs are being pre-fetched into memory concurrently. | 1. Reduce BucketBufferSize to 5–25. 2. Decrease ParallelismFactor. |
| High-Priority Jobs delayed by Slow Jobs | Resource contention in a single lane. | Create a dedicated high-priority WorkerLane to isolate execution. |
| High latency between job end and next start | Executor is waiting for new rounds of polling. | Increase BucketBufferSize to allow aggressive pre-fetching. |