
\ Almost every backend grows into the same problem. A single job becomes a chain of jobs. Pull the data, clean it, score it, write the result, send the email. Each step depends on the one before it. Some steps can run side by side. Some steps fail and have to be retried. Some steps wait hours for a human to approve them. You start with a cron line and a script, and within a year you have a tangle of scripts that call other scripts, no record of what ran, and a pager that goes off every time one of them dies halfway through. A workflow orchestration engine is the system that takes over this mess. It runs multi-step processes across many machines, runs each step in the right order, retries the steps that fail, keeps a record of everything, and keeps going when a machine crashes. Apache Airflow runs the data pipelines behind a lot of companies. Temporal and Cadence run payments and order flows at companies like Uber and Coinbase. Netflix built its own engine, Conductor, to orchestrate microservices. AWS Step Functions glues together serverless functions. They look different on the surface and share the same skeleton underneath. This post builds that skeleton from the ground up. We will define what the engine has to do, draw the core architecture, model workflows as a graph, walk through the two execution models that every engine picks between, and then work through the hard parts: state, unique IDs, queues, retries, scaling, timers, and the failure modes that show up in production. What a Workflow Engine Has To Do Before drawing boxes, pin down the job. The requirements split into two groups. Functional requirements Run a set of tasks in an order defined by their dependencies. Run independent tasks in parallel and dependent tasks in sequence. Trigger workflows on a schedule (cron), on an event, or by an API call. Retry a task automatically when it fails, with a limit. Pass small results from one task to the next. Pause, resume, and cancel a running workflow. Show the full history and current state of every run. Non functional requirements Durability. Once the engine accepts a workflow, it must not lose it, even if a node dies one second later. At-least-once execution. Every task that is due runs at least one time, even across crashes. Scale. The design should handle millions of tasks per day across many workers. Fault tolerance. A crashed worker must not block other workflows, and its work must be picked up by another worker. Visibility. Every state change is queryable and traceable, which matters as much for debugging as for audits. One requirement deserves a note up front. People ask for exactly once execution. In a distributed system you cannot truly get it, for the same reason a TCP connection cannot promise a message is delivered exactly once. What you can build is at-least-once delivery plus idempotent tasks, which together behave like exactly once from the outside. We come back to this later because it shapes the whole design. The Core Architecture The single most important idea is to separate deciding when a task runs from actually running it . These are two different problems with two different scaling needs, so they become two different components. A clean engine has four parts, plus an API in front. \ Each part has one job. The API or frontend accepts workflow definitions, start requests, signals, and cancellations. It writes to the state store and returns fast. It never runs task code. The scheduler watches the clock and the dependency graph. It asks the store which tasks are now ready to run and pushes them onto the queue. It is the brain. The task queue holds ready tasks and hands them out. It absorbs bursts, so a thousand tasks becoming ready at once does not knock over the workers. The workers are stateless. They pull a task, run its code, and report the result back to the store. You scale throughput by adding workers. The state store is the durable source of truth. It holds the workflow definitions, the history of every run, and the status of every task. Everything else can crash and rebuild from here. Why decouple the queue from the workers at all? Because load is bursty. At midnight a thousand daily workflows may all become due in the same second. If the scheduler called workers directly, that spike would hit them all at once. The queue absorbs the spike and lets workers drain it at a steady rate. It also lets you scale workers up and down without touching the scheduler. This is the same reason queues show up across system design, which I cover in Role of queues in system design . Modeling Workflows as a DAG A workflow is a set of tasks with dependencies between them. The natural way to represent that is a directed acyclic graph , a DAG. Each node is a task. Each edge means “this task must finish before that one starts”. Acyclic means there are no loops, because a loop would be a task that waits for itself, which can never start. \ The DAG gives the engine everything it needs to order the work. Clean runs only after Extract . Score Model and Build Report both depend on Clean , so they can run in parallel . Load Warehouse waits for both of them. Notify runs last. The engine walks the graph in topological order: a task becomes ready the moment all of its parents have succeeded. Two common ways to define a DAG show up in practice. As data. A YAML or JSON file lists tasks and their dependencies. AWS Step Functions uses a JSON dialect called Amazon States Language. Argo Workflows uses Kubernetes YAML. This is easy to validate and visualize, harder to express complex logic in. As code. Airflow writes DAGs in Python. Temporal writes whole workflows as ordinary functions in Go, Java, TypeScript, or Python. Code is more expressive, and in durable execution engines the code itself is the workflow. A minimal definition as data looks like this: workflow: daily_report schedule: "0 2 * * *" # 2 AM every day, standard cron tasks: - id: extract run: jobs.extract - id: clean run: jobs.clean needs: [extract] - id: score run: jobs.score needs: [clean] - id: report run: jobs.report needs: [clean] - id: load run: jobs.load needs: [score, report] - id: notify run: jobs.notify needs: [load] The engine reads this, builds the graph, and from then on the needs lists are all it consults to decide what is ready. Scheduling: Deciding When a Workflow Runs A workflow starts for one of three reasons: a clock (a schedule), an event, or a direct API call. The scheduling part is about the clock. Recurring workflows express their schedule with cron . The engine stores the cron string, computes a concrete next run time from it, and creates a new run when that time arrives. A classic cron expression has five fields. \ minute hour day-of-month month day-of-week 0 2 * * * -> 2 AM every day If you are wiring up a schedule and want to be sure it means what you think, it helps to translate the expression to plain English and preview the next few run times before you deploy it. The Cron Expression Translator does exactly that. The JVM world is a common exception. Engines built on Quartz, and Spring’s @Scheduled annotation, use Quartz cron , which adds a seconds field at the front and an optional year at the end, giving six or seven fields instead of five. The two formats look almost identical and are not interchangeable, and pasting a five field Linux cron string into a Quartz scheduler is a frequent cause of jobs that silently never fire. If your engine runs on the JVM, validate the expression with the Quartz Cron Generator . The scheduler also has to answer a harder question: what happens if it was down for an hour and three runs were missed? This is the misfire problem, and there are two honest policies. Catch up (backfill). Run every missed occurrence in order. Correct for accounting and data pipelines where every interval must be processed. Airflow calls this catchup and it is on by default, which surprises people the first time a paused DAG wakes up and fires a hundred runs. Skip. Run only the next future occurrence and forget the missed ones. Right for jobs like cache refreshes where a stale run adds no value. Make this an explicit setting per workflow. Teams that leave it implicit get surprised after their first outage. There are also three ways the scheduler can find due work, and the choice is a real trade-off. | Strategy | How it works | Cost | |----|----|----| | Polling | Every second, query the store for tasks where next_run_at <= now() | Simple and reliable. A small delay up to the poll interval. | | Timer wheel | Keep upcoming tasks in an in-memory structure sorted by time | Very precise, but must rebuild from the store after a restart. | | Push | An external timer service fires an event when a task is due | Real time, but adds another moving part to depend on. | For most engines, polling with a short interval is the right default. It is boring, and boring is good for the component you are trusting with billing runs. The Two Execution Models This is the fork in the road. Every workflow engine picks one of two models for how a workflow makes progress, and the choice shapes everything else. Model 1: Task Based Scheduling This is the Airflow model, and it is the one most people picture when they hear “scheduler”. Each task is an independent, stateless unit. The engine stores one row per task instance in a metadata database with its status: none , scheduled , queued , running , success , or failed . The scheduler loops over active workflow runs, checks which tasks have all their parents in success , and moves those to scheduled . The executor picks them up and runs them. The worker reports the result back to the store, and the loop continues until the whole DAG is done. \ The strength of this model is its simplicity. State lives in one place, the database, and the scheduler is a loop you can reason about. The cost is that tasks are stateless and a retry starts from scratch . If a task is step four of a long function and it fails, there is no memory of steps one through three. Anything you want to keep between tasks has to be written somewhere external. Airflow passes small values through a mechanism called XComs stored in the metadata database, and large data through a shared store like S3. This is fine for data pipelines, where each task is a chunky batch step and “rerun the step” is the natural unit of recovery. The metadata database is both the strength and the weakness. Every status change is a write, and the scheduler polls it constantly. At scale it becomes the bottleneck, and the usual fixes are connection pooling with PgBouncer, careful indexing, and running multiple schedulers active-active so one slow loop does not stall the rest. Model 2: Durable Execution This is the Temporal and Cadence model, and it is a genuinely different idea. You write the whole workflow as a single function, with loops, conditionals, and variables, in a normal language. It looks like ordinary code: @workflow.defn class SubscriptionWorkflow: @workflow.run async def run(self, user_id: str): await workflow.execute_activity(charge_card, user_id, start_to_close_timeout=timedelta(seconds=30)) await workflow.sleep(timedelta(days=30)) # durable sleep await workflow.execute_activity(send_renewal_notice, user_id, start_to_close_timeout=timedelta(seconds=30)) That workflow.sleep for thirty days is not a thread sleeping. The workflow is removed from the worker’s memory entirely, and thirty days later the engine schedules it onto some worker, which rebuilds its state and continues. The trick that makes this possible is event sourcing . The engine never trusts the worker’s memory. Instead, every step the workflow takes is appended to an immutable event history in the store before it is acknowledged: workflow started, activity scheduled, activity completed with this result, timer started, timer fired, and so on. When a worker crashes mid workflow, the engine hands the workflow to a new worker. The new worker replays the event history from the beginning, running the workflow code again, but instead of actually calling the activities it feeds in the recorded results from history. Within milliseconds the code reaches the exact point where it left off, with the exact same variables, and then it continues for real. Two consequences fall out of this design, and they trip people up. Workflow code must be deterministic. Because the engine replays it, the same history has to produce the same decisions every time. No reading the wall clock directly, no random numbers, no calling an API from inside the workflow function. Anything with a side effect or a non deterministic result is wrapped as an activity , which is recorded once and replayed from history, never re-run. You get durability for free in the application. No scattering retry logic and state tracking across services. The engine remembers, so the code reads top to bottom even though it might run across weeks and survive a dozen deploys. Durable execution costs more to operate and to learn, and it is overkill for a nightly ETL job. It earns its keep for long running, stateful business processes: payments, order fulfillment, user onboarding, anything that touches several services and absolutely must finish. Which One To Pick | Dimension | Task based (Airflow) | Durable execution (Temporal) | |----|----|----| | Unit of work | A task in a DAG | A workflow function plus activities | | State between steps | External (XComs, S3) | In the replayed event history | | Retry granularity | Whole task from the start | Per activity, workflow keeps its place | | Long waits | Sensors that occupy slots | Durable sleep, zero resources while waiting | | Best fit | Scheduled data pipelines, ETL, ML | Stateful business logic, microservice orchestration | | Main cost | Metadata DB becomes the bottleneck | Determinism rules, steeper learning curve | If your work is “run these batch steps on a schedule”, task based is simpler and you should reach for it. If your work is “run this business process reliably for days while services fail around it”, durable execution is worth the cost. The State Store and Identifying Every Run The state store is the source of truth, so its schema matters. A task based engine typically uses two tables: one for the recurring definition, one for individual runs. CREATE TABLE workflows ( id BIGINT PRIMARY KEY, name TEXT NOT NULL, definition JSONB NOT NULL, -- the DAG cron TEXT, -- null for non-recurring next_run_at TIMESTAMPTZ, enabled BOOLEAN DEFAULT TRUE ); CREATE TABLE task_runs ( id BIGINT PRIMARY KEY, -- unique per task instance workflow_run BIGINT NOT NULL, -- which run this belongs to task_id TEXT NOT NULL, -- which node in the DAG status TEXT NOT NULL, -- SCHEDULED, RUNNING, SUCCESS, FAILED, DEAD attempt INT NOT NULL DEFAULT 1, scheduled_at TIMESTAMPTZ, started_at TIMESTAMPTZ, lease_until TIMESTAMPTZ -- worker lease, for crash detection ); CREATE INDEX idx_due ON workflows (next_run_at) WHERE enabled = TRUE; CREATE INDEX idx_ready ON task_runs (workflow_run, status); The partial index on next_run_at is the workhorse behind the scheduler’s polling query. It asks “which workflows are due now” many times per second, and this index keeps that query fast even with millions of rows. Now the part that connects directly to running across many machines: every workflow run and every task instance needs a unique ID . This sounds trivial until you have ten scheduler nodes and many workers all minting IDs at once. An auto increment column forces every insert through one database sequence, which becomes a contention point and ties you to a single database. The distributed answer is a Snowflake style 64-bit ID : a timestamp, a machine ID, and a per-millisecond counter packed into one number. Any node can mint one with no coordination, the IDs are roughly sortable by time (which keeps database index pages happy), and they fit in a BIGINT . 64-bit Snowflake ID +-----------------------------------------------------------+ | 1 bit | 41 bits timestamp | 10 bits machine | 12 bits seq | | unused| ms since epoch | node id | per-ms | +-----------------------------------------------------------+ This is exactly why distributed schedulers reach for Snowflake IDs for run identifiers. If you want to see the structure, you can paste a real ID into the Snowflake ID Decoder and watch the timestamp, machine, and sequence split apart, and How Snowflake IDs work explains the 64-bit layout in full. A durable execution engine takes this further: it stores an ordered event history per workflow, and the position in that history (often called an event ID) has to be unique and monotonic so replay is deterministic. The Task Queue and the Workers Between the scheduler and the workers sits the queue. The scheduler is a producer , the workers are consumers , and the queue decouples their rates. This is the same producer and consumer split that makes Kafka useful, and the broker choice is a real decision covered in Kafka vs RabbitMQ vs SQS . There is an important refinement that durable execution engines lean on, called the matching service and task queues by name. Workers do not get tasks pushed at them. Instead they poll named task queues for work, a pull model. This matters for two reasons. First, a worker only pulls when it has spare capacity, which gives you natural backpressure: a slow worker simply polls less often and the queue holds the rest. Second, you can route specific task types to specific worker pools by giving them their own queue, so a GPU heavy scoring task and a tiny notification task never compete for the same workers. \ When a worker pulls a task it takes a lease : it writes lease_until = now() + 30s on the task row and renews it while the task runs. If the worker crashes, the lease expires, and the scheduler treats the task as failed and reschedules it. The lease is what lets the engine detect a dead worker without the worker telling it anything. It is the same idea as a visibility timeout in a message queue. Reliability: Retries, Backoff, and Idempotency Tasks fail. Networks drop, gateways time out, a dependency is briefly down. The engine’s job is to retry without making things worse. The standard policy is retry with exponential backoff and a cap . First retry after 1 second, then 2, then 4, then 8, up to a ceiling, with a little randomness (jitter) so a thousand tasks that failed together do not all retry in the same instant and create a thundering herd. After a fixed number of attempts the task moves to a terminal dead state and stops, so a permanently broken task does not loop forever and burn workers. def next_backoff(attempt, base=1.0, cap=60.0): delay = min(cap, base * (2 ** (attempt - 1))) return delay * random.uniform(0.5, 1.0) # jitter Retries are where the exactly once question comes home. Because the queue is at-least-once, a task can run twice: a worker finishes the work, crashes before reporting success, the lease expires, and another worker runs it again. If the task charges a card, you just charged twice. The fix is to make every task idempotent , so running it again is safe. Give each task an idempotency key , often derived from the run ID and task ID, and have the downstream service ignore a repeat of the same key. This is the same pattern behind how Stripe prevents double payments . Write results with a unique constraint so a duplicate insert fails harmlessly. Check whether the work is already done before doing it. The general technique is the Idempotent Receiver pattern, and it is not optional. If any task can produce a duplicate effect, you will see duplicates the first time a worker crashes at the wrong moment. For workflows that change state across several services, a plain retry is not enough, because you may need to undo earlier steps when a later one fails for good. That is the Saga pattern: each step has a matching compensating step, and the engine runs the compensations in reverse when the workflow cannot complete. Durable execution engines make sagas natural because the workflow code can simply try the forward steps and run compensations in the except block, with the engine remembering exactly how far it got. Scaling the Engine The three tiers scale on different signals, and you scale them independently. Workers scale on queue depth. When tasks pile up in a queue, add workers. When the queue is empty, scale them down. Because workers are stateless and pull their work, this is the easy tier: it is just horizontal scaling behind a queue. The scheduler scales with care, because there must be one decision maker per workflow. If two schedulers both decide the same task is due, you get duplicate runs. Two approaches solve this. Leader election. One scheduler is the leader and does the deciding, the rest stand by. If the leader dies, the standbys elect a new one and take over in seconds. Election needs a majority agreement so two leaders never coexist, which is the majority quorum idea, usually delegated to a system like ZooKeeper, etcd, or a database lock. Sharding the workflows. Split the set of workflows across scheduler nodes so each node owns a slice and no two nodes touch the same workflow. Airflow 2 took a different route and allows multiple active schedulers that coordinate through row level locks in the metadata database, which is a pragmatic version of the same idea. The state store is usually the first real bottleneck , and the answer is sharding. Distribute workflow state across partitions using a hash of the workflow ID : shard = hash(workflow_id) % number_of_shards Each shard owns a slice of the workflows and can be served by a different node. This is the design Cadence and Temporal use, where they call them history shards , and ownership of shards rebalances automatically as nodes join and leave. Picking the shard by hashing the ID is consistent hashing in action, the same technique that spreads keys across a cache cluster, and it keeps all the events for one workflow together on one shard so replay is local and fast. One more rule that keeps the store healthy: do not put large payloads in it . The store holds control data, the status and history, not the gigabyte of data your task is crunching. Pass a pointer (an S3 key, a row ID) between tasks, not the data itself. Engines that ignore this end up with a metadata database full of blobs and a scheduler that crawls. Timers and Durable Sleep A workflow often needs to wait. Wait five minutes before retrying a payment. Wait two days for a human to approve. Wait until the first of the month. The naive way, sleeping a thread, does not survive a restart and wastes a worker for the whole wait. The engine handles waits as durable timers . When a workflow asks to sleep, the engine records a timer in the store with a fire time and frees the worker. A timer service watches for due timers, the same polling or timer wheel choice as the scheduler, and when the time arrives it wakes the workflow by appending a “timer fired” event and scheduling it onto a worker. The wait costs nothing while it is pending because no worker is tied up. This is what lets a Temporal workflow sleep for a month, and it is why the scheduler and the timer service often share the same machinery. A subtle point: a timer’s fire time has to be durable and exact across restarts, and ordering matters when many timers fire at once. Some engines lean on logical clocks rather than wall clocks for ordering across nodes, which is the world of Lamport clocks and hybrid logical clocks. For a single engine, a monotonic timestamp plus the unique event ID is usually enough to order things correctly. Observability: Knowing What Ran A workflow engine without visibility is a black box that occasionally pages you. Because the engine already records every state change, it is sitting on the data it needs, and three views matter. Per workflow history. The full event log of a run: which task started when, what it returned, which retries happened, where it is now. This is the first thing you open when a run misbehaves. Metrics. Queue depth, task latency, retry rate, worker saturation, and scheduler loop time. Rising queue depth means add workers. A rising retry rate means a dependency is sick. Distributed traces. A workflow spans many services, so a single trace ID threaded through every task lets you follow one run across the whole system. The tooling choices are covered in Distributed tracing: Jaeger vs Tempo vs Zipkin , and wiring it up cleanly is the subject of the OpenTelemetry production setup guide. Failure Modes and Their Fixes The same handful of failures show up in every workflow engine postmortem. | Failure | Symptom | Fix | |----|----|----| | Worker crashes mid task | Task stuck in RUNNING forever | Worker leases with a TTL; expired lease reschedules the task | | Scheduler crashes | No new tasks get scheduled | Leader election with fast failover to a standby scheduler | | Two schedulers both fire a task | Duplicate runs | Single leader, or shard workflows so one node owns each | | Queue redelivers a task | Same task runs twice | Idempotent tasks plus an idempotency key, not “hope it does not happen” | | Non deterministic workflow code | Replay diverges, workflow gets stuck | Keep all side effects in activities; never read the clock or random in workflow code | | Metadata store overloaded | Scheduler loop slows, latency climbs | Index the hot queries, shard by workflow ID, keep payloads out of the store | | Poison task retries forever | Workers burn on one broken task | Cap retries, move to a dead letter state, alert a human | | Backfill storm after an outage | A paused workflow wakes and fires 100 runs | Choose skip vs catch up per workflow on purpose | | Clock skew across nodes | Timers fire early or in the wrong order | NTP, monotonic clocks, logical clocks for cross node ordering | A useful safety valve for the worst case is the circuit breaker. If a downstream dependency starts failing past a threshold, the breaker opens and the engine stops scheduling tasks that need it, holding them instead of throwing them at a service that is already down. Better to pause cleanly than to hammer a sick system. A Minimal Working Reference The whole idea fits in a small sketch. This is a polling scheduler and a worker for a task based engine, with the parts that matter and none of the production hardening. It is meant to make the boxes concrete, not to run your business. import time, random # --- scheduler: find ready tasks, push to the queue --- def schedule_loop(store, queue): while True: # start any workflow whose cron is due for wf in store.due_workflows(now=time.time()): run_id = store.create_run(wf.id) # Snowflake-style unique id store.seed_tasks(run_id, wf.definition) # all tasks -> PENDING # promote tasks whose parents have all succeeded for task in store.pending_tasks(): if store.all_parents_succeeded(task): store.set_status(task.id, "SCHEDULED") queue.push(task.id) time.sleep(1) # poll interval # --- worker: pull a task, run it, report back --- def worker_loop(store, queue, registry): while True: task_id = queue.pull(lease_seconds=30) if task_id is None: continue task = store.load(task_id) store.set_status(task_id, "RUNNING") try: registry[task.task_id](task.input) # run the real code store.set_status(task_id, "SUCCESS") except Exception: if task.attempt >= MAX_ATTEMPTS: store.set_status(task_id, "DEAD") # terminal, alert a human else: delay = min(60, 2 ** (task.attempt - 1)) * random.uniform(0.5, 1.0) store.retry_later(task_id, delay) # backoff with jitter To turn this into something you would actually run you add worker lease renewal, idempotency keys on every task, the partial index behind due_workflows , leader election around the scheduler loop, metrics on queue depth, durable timers for sleeps, and sharding once one store cannot keep up. The shape does not change. Comparison With Real Engines The engines you will actually use are different points in the same design space. Knowing where each one sits tells you which to reach for. | Engine | Model | Workflow defined as | Best fit | |----|----|----|----| | Apache Airflow | Task based scheduler | Python DAG files | Scheduled ETL and data pipelines | | Temporal | Durable execution | Code (Go, Java, TS, Python, .NET) | Long running, stateful business logic | | Cadence | Durable execution | Code (Go, Java) | The predecessor to Temporal, built at Uber | | Netflix Conductor | Task based, JSON defined | JSON workflow definitions | Microservice orchestration with a UI | | AWS Step Functions | Task based, state machine | Amazon States Language (JSON) | Serverless and cross-AWS orchestration | | Argo Workflows | Task based, container native | Kubernetes YAML | Container and CI/CD pipelines on Kubernetes | | Prefect / Dagster | Task based, Python first | Python | Modern data orchestration with better local dev | A few patterns are worth pulling out. Airflow, Conductor, Step Functions, and Argo all keep the task based shape: a DAG of stateless steps with state in a store, retries from the start, and a UI to rerun failed tasks. Temporal and Cadence are the durable execution camp, where the workflow is code and the engine replays history. Cadence came out of Uber, and Temporal is its successor built by the same engineers, which is why they share so much. The cloud and Kubernetes options (Step Functions, Argo) trade some flexibility for not having to operate the engine yourself. The honest advice: use one of these, do not build your own , unless you are learning or you have a genuinely narrow need they make awkward. The value of understanding the internals is that you use them better. When an Airflow task reruns from scratch, you know why. When a Temporal workflow replays, you know what is happening. When the metadata database is your bottleneck, you know which knob to turn. Practical Lessons for Developers A few things become obvious once you have run an engine in production. Separate the decision from the work. The scheduler decides, the workers do. Merge them and a slow task starves the scheduling loop, which is how a single bad job stalls everything. Idempotency is the price of admission. Every task, retry, timer, and signal must be safe to run twice, because at some point it will be. Bake in idempotency keys and unique constraints from day one. Keep workflow logic deterministic. If you use durable execution, the clock, randomness, and any network call belong in activities, never in the workflow function. The store is for control, not cargo. Status, history, and pointers go in the state store. Actual data goes in object storage, with only a reference passed between tasks. Decide backfill behavior on purpose. Catch up versus skip is a one line setting and a huge behavioral difference. Choose it for each workflow before the first outage chooses for you. Plan for the scheduler to die. Leader election with a fast failover turns a scheduler crash from an outage into a blip. Test the failover before you need it. Wrapping Up A workflow orchestration engine looks like a scheduler from the outside and is a tour of every interesting problem in distributed systems from the inside. A graph of dependencies. A decision loop split from a worker pool. A durable store that is the only thing you trust. Unique IDs minted with no coordination. At-least-once delivery turned into exactly once effects by idempotent tasks. Event sourcing and replay. Sharding by a hash of the ID. Leader election so the brain can fail over. The two models are the thing to hold onto. Task based engines like Airflow track each step in a store and retry it from the start, which is simple and perfect for data pipelines. Durable execution engines like Temporal record a history and replay it, which keeps a workflow alive across crashes and fits long running business logic. Pick the one that matches your work, lean on idempotency and unique IDs to stay correct, and shard the store before it becomes the bottleneck. \
View original source — Hacker Noon ↗



