Hash Functions

Why Are Hash Functions Useful?

  • Hash keys as surrogate keys for composite natural keys significantly simplify joins.
  • Hash keys enable parallel loading of multiple tables, because no key lookups on existing tables are required.
  • Hash diffs make it easier to detect changes, especially when many columns need to be compared.

What Should Be Considered?

  • There is always a theoretical risk of hash collisions. With sufficiently large hash values, this risk is negligible in practice. Nevertheless, ETL processes should ideally be designed to detect potential collisions.
  • When hash values are derived from multiple columns, some databases require these values to be concatenated into a single string. The concatenation must be unambiguous, and suitable separators should be chosen to avoid ambiguities.
  • Hash keys consume more storage than numeric surrogate keys and can increase index sizes.
  • Computing a hash value takes time; the more complex the hash function and the wider its input, the higher the computational cost.
Note: These are common technical trade-offs that typically do not limit the usefulness of hash functions in data warehousing.

Use in Data Warehousing

Possible use cases include:

  • In the Persistent Staging Area (PSA)
    • Hash diffs to efficiently identify changes in source data during Change Data Capture (CDC)
  • In the Data Vault model
    • Hash keys as surrogate keys for business keys (to simplify and parallelize loading)
    • Hash diffs to detect changes in satellites and load only the records that have actually changed
  • In the dimensional model
    • Hash keys as surrogate keys for business keys
    • Hash diffs to identify changes in Slowly Changing Dimensions, especially when attribute sets are wide

Hash Functions

select '''123''' as parameter
     , md5(v)    as md5_value
     , sha1(v)   as sha1_value
     , sha256(v) as sha256_value
  from (select '123' as v)
union all
select 'null'
     , md5(v)
     , sha1(v)
     , sha256(v)
  from (select null as v)
;
parameter|md5_value                       |sha1_value                              |sha256_value                                                    |
---------+--------------------------------+----------------------------------------+----------------------------------------------------------------+
'123'    |202cb962ac59075b964b07152d234b70|40bd001563085fc35165329ea1ff5c5ecbdbbeef|a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3|
null     |                                |                                        |                                                                |