JobMaster — Core Architecture Overview

Welcome to the JobMaster architecture overview.

JobMaster is a distributed background task orchestration engine for .NET designed to manage background task execution with a focus on background process auditing (making debugging and manual/historical executions easy), horizontal scaling, and flexible configuration to let developers tune the system to their specific needs.

1. Core Architectural Mission

The core design goals of JobMaster are:

Auditing & Troubleshooting: Providing a detailed background execution audit trail of every job to facilitate debugging and support manual/historical re-executions.
Horizontal Scale: Allowing execution workers to scale horizontally without placing a transaction or lock bottleneck on the central orchestration storage.
Architectural Flexibility: Enabling developers to tune the engine's parameters (coordinators, workers, lanes, and buffers) in the exact way that fits their specific business needs.

2. Standard Flow: Assigning Jobs to Buckets

The standard execution flow partitions workload queues so that multiple workers can process jobs in parallel, reducing lock contention bottlenecks.

The Assignment Flow Step-by-Step:

Durable Jobs: Jobs scheduled in the future reside in the Master DB in a HeldOnMaster status.
The Transient Threshold: The Coordinator scans the Master DB for jobs whose next execution falls within the TransientThreshold (e.g., the next 5 minutes).
Exclusive Bulk Reservation: The Coordinator pulls these jobs in bulk according to the TransferBatchSize and assigns them to Buckets owned by active workers.
Bucket Partitioning: Workers take atomic ownership of buckets. Each worker processes only the jobs inside its owned buckets, reducing cross-worker queue collisions.
Execution & Sync-Back: Worker threads execute the handlers and sync the final execution outcome (Succeeded/Failed) back to the Master DB, providing a full audit trail for easy debugging.

Standard Flow Diagram

Here is how jobs flow from the Master DB through the Coordinator and into the Worker Buckets:

JobMaster — Assign Jobs to Buckets

3. High-Speed Intake Flow: The SavePending Buffer & Execution Bypass

To avoid overloading the Master DB during high-volume bursts (e.g., an API receiving millions of rapid scheduling requests), JobMaster writes scheduled tasks directly into the transport layer and defers Master DB interaction to background runners. The caller does not wait for a Master DB write — the API response completes as soon as the ephemeral transport write succeeds.

The SavePending Flow Step-by-Step:

Fast Buffer Write: When a client schedules a job, the API bypasses the Master DB entirely and writes the task directly to the Agent Ephemeral Transport in milliseconds (or even faster, depending on the transport technology, such as RDBMS versus memory-based message brokers like NATS). The API call returns immediately after this write — the Master DB is not touched on the hot path.
Background Decision Check: A background runner picks up the buffered job and evaluates its planned start time against the TransientThreshold. Because this check runs off the hot path, the Master DB is only written to by background processes — preserving low-latency API responses under any load:
- YES Path (Immediate Execution Bypass): If the job is due immediately, the background runner writes the job record to the Master DB (ensuring a full audit trail) and routes execution directly to the active worker bucket (status → InBucket), bypassing the normal scanning/polling queues for fast dispatch.
- NO Path (Asynchronous Sync-Back): If the job is scheduled for a future time, the background runner flushes it into the Master DB in non-blocking batches according to the TransferBatchSize (default=1000), where it waits as HeldOnMaster until the Coordinator picks it up.

SavePending Flow Diagram

Here is how the API producer schedules jobs and how they are routed based on their planned execution time:

JobMaster — SavePending: Decoupled Buffer & Execution Shortcut

4. Self-Healing & Orphan Recovery (The Lost Bucket Rescue)

If an Agent Worker crashes, stops heartbeating, or loses network connectivity unexpectedly, JobMaster's self-healing loop recovers the orphaned work automatically without losing a single job.

The Recovery Flow Step-by-Step:

Heartbeat Failure: A worker crashes. The cluster coordinator detects the missing heartbeat and marks the worker's assigned buckets as Lost.
Adoption: A healthy active worker claims ownership of the Lost bucket, moving its status to Draining.
Redirection to Master: The adopting worker pulls all unexecuted jobs (queued in the bucket but not yet run) and flushes all unsaved jobs (buffered under SavePending status but not yet stored in the orchestration database) out of the Draining bucket and redirects them back to the Master DB (setting their status back to HeldOnMaster).
Re-Assignment: Once redirected, the jobs are cleanly picked up by active, healthy buckets on other workers during standard Coordinator scans.

Self-Healing & Orphan Recovery Diagram

Here is how the active worker adopts the lost bucket and redirects both unfinished and unsaved jobs back to the Master DB:

JobMaster — Self-Healing & Orphan Bucket Recovery

5. Key Architectural Design Choices

Partitioning via Buckets: Instead of pulling individual rows, workers own entire buckets. This design minimizes locking overhead and reduces queue collision bottlenecks.
Decoupled Coordination & Execution: Coordinators handle Master DB queries and onboarding. Executors only talk to the fast Agent transport, allowing you to scale compute horizontally with minimal impact on Master DB capacity.
Workload Isolation (Lanes): Logical isolation lanes (WorkerLane) allow you to separate slow, resource-heavy compute tasks from latency-critical transactional jobs.

6. Workers

Workers run entirely as background processes and carry three distinct responsibilities: coordinating job onboarding (Coordinator), executing handlers (Executor), and recovering orphaned buckets (Drain). In the default Full mode a single worker handles all three. For higher-scale deployments these roles can be decoupled — each worker process assigned exactly one responsibility — allowing each plane to be sized, tuned, and scaled independently.

Coordinator Mode — The Brains

Scans the Master DB for pending work, acquires jobs in bulk, and distributes them into Agent Buckets. Also assigns orphaned buckets to available Drain workers for recovery. Does not execute handlers.

Resource Profile: CPU-light, network/I/O-dense. Needs strong connectivity to the Master DB.
Benefit: Because execution is fully isolated, Coordinators continue onboarding work on schedule even when executor nodes are at 100% CPU.

JobMaster — Coordinator Mode

Executor Mode — The Muscle

Pulls jobs from assigned Agent Buckets and runs your IJobHandler logic. Does not scan the Master DB for scheduling — it only writes status updates and enforces execution deadlines.

Resource Profile: CPU/Memory-heavy. Master-Agnostic — these nodes interact purely with the fast Agent Ephemeral Transport on the hot path.
Benefit: Horizontal scaling. You can grow the executor fleet without adding coordination load to the Master DB.

JobMaster — Executor Mode

Drain Mode — The Rescue

A dedicated recovery mode. When a worker crashes or loses connectivity, a Drain worker claims its orphaned buckets, redirects all unexecuted and unsaved jobs back to the Master DB, and is safe to terminate once draining is complete.

Resource Profile: Lightweight and short-lived. Safe to terminate once draining completes.
Benefit: Clean, loss-free recovery without disrupting the active Coordinator or Executor fleet.

JobMaster — Drain Mode

Full Mode (Default)

The all-in-one mode that combines Coordinator, Executor, and Drain into a single process. Recommended for most deployments — split into specialized modes only when scale demands it.

7. Next Steps

Now that you understand the architectural concepts, you are ready to configure and scale your cluster:

Configuration References:
- Workers & Lanes — worker modes, parallelism, lanes, and buffer sizing
- Agent Connections — transport providers, connection protection, and decommissioning
- Cluster Configuration — cluster-level defaults and thresholds
Performance Tuning Guide: Learn how to size your Coordinators, Executors, and Buckets for any workload.
- See: Performance Tuning

1. Core Architectural Mission​

2. Standard Flow: Assigning Jobs to Buckets​

The Assignment Flow Step-by-Step:​

Standard Flow Diagram​

3. High-Speed Intake Flow: The SavePending Buffer & Execution Bypass​

The SavePending Flow Step-by-Step:​

SavePending Flow Diagram​

4. Self-Healing & Orphan Recovery (The Lost Bucket Rescue)​

The Recovery Flow Step-by-Step:​

Self-Healing & Orphan Recovery Diagram​

5. Key Architectural Design Choices​

6. Workers​

Coordinator Mode — The Brains​

Executor Mode — The Muscle​

Drain Mode — The Rescue​

Full Mode (Default)​

7. Next Steps​