Robust ETL Processes
5 principles that separate a robust ETL process from a fragile one
Get these five right, and problems stop being emergencies.
1️⃣ Deterministic
The same input must always produce the same output.
The classic trap is hidden non-determinism, like a ROW_NUMBER ranking over an ORDER BY that isn't unique: when rows tie, the engine is free to pick a different winner on every run. If a process returns different results across runs, you can't trust the numbers, and you can't debug it either.
2️⃣ Repeatable
Running the same load twice must leave the data unchanged.
It must not create duplicate rows or alter any values, and records that have already been processed must not be treated as new on a second run. This is idempotency, and it's what lets you safely re-run a load without wondering what it might break.
3️⃣ Fault-tolerant
Bad data should never stop the pipeline.
Detect it, track it, and decide upfront how to handle it: ignore, correct with a default, or quarantine for review. The same goes for missing deliveries. When a source sends nothing, the process must not read an empty input as "all records deleted" and wipe what is already there. It should recognize the empty case and leave the existing data untouched.
4️⃣ Restartable
A failed run must be safe to restart.
Restarting it should require no manual cleanup and no fixing of half-processed data by hand. It picks up where it stopped and finishes the job. You can also let it retry automatically, but cap the number of attempts, so a recurring problem doesn't flood the logs with endless restarts.
5️⃣ Backlog-ready
When several deliveries pile up, one run should process all of them.
They must be applied in the order they arrived. The sequence matters, especially for historized data, because applying an older load after a newer one leaves the history out of step with what really happened. You shouldn't have to trigger the same pipeline five times to work through five files that came in overnight.