Hash Functions
Why Are Hash Functions Useful?
- Hash keys as surrogate keys for composite natural keys significantly simplify joins.
- Hash keys enable parallel loading of multiple tables, because no key lookups on existing tables are required.
- Hash diffs make it easier to detect changes, especially when many columns need to be compared.
What Should Be Considered?
- There is always a theoretical risk of hash collisions. With sufficiently large hash values, this risk is negligible in practice. Nevertheless, ETL processes should ideally be designed to detect potential collisions.
- When hash values are derived from multiple columns, some databases require these values to be concatenated into a single string. The concatenation must be unambiguous, and suitable separators should be chosen to avoid ambiguities.
- Hash keys consume more storage than numeric surrogate keys and can increase index sizes.
- Computing a hash value takes time; the more complex the hash function and the wider its input, the higher the computational cost.
Note: These are common technical trade-offs that typically do not limit the usefulness of hash functions in data warehousing.
Use in Data Warehousing
Possible use cases include:
- In the Persistent Staging Area (PSA)
- Hash diffs to efficiently identify changes in source data during Change Data Capture (CDC)
- In the Data Vault model
- Hash keys as surrogate keys for business keys (to simplify and parallelize loading)
- Hash diffs to detect changes in satellites and load only the records that have actually changed
- In the dimensional model
- Hash keys as surrogate keys for business keys
- Hash diffs to identify changes in Slowly Changing Dimensions, especially when attribute sets are wide
Hash Functions
select '''123''' as parameter
, md5(v) as md5_value
, sha1(v) as sha1_value
, sha256(v) as sha256_value
from (select '123' as v)
union all
select 'null'
, md5(v)
, sha1(v)
, sha256(v)
from (select null as v)
;
parameter|md5_value |sha1_value |sha256_value | ---------+--------------------------------+----------------------------------------+----------------------------------------------------------------+ '123' |202cb962ac59075b964b07152d234b70|40bd001563085fc35165329ea1ff5c5ecbdbbeef|a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3| null | | | |
Hash Functions
select '''123''' as parameter
, md5('123') as md5_value
, encode(sha256('123'), 'hex') as sha256_value
, encode(sha512('123'), 'hex') as sha512_value
union all
select 'null'
, md5(null)
, encode(sha256(null), 'hex')
, encode(sha512(null), 'hex')
;
parameter|md5_value |sha256_value |sha512_value | ---------+--------------------------------+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ '123' |202cb962ac59075b964b07152d234b70|a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3|3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2| null | | | |
Hash Functions
select '123' as parameter
, md5(123) as md5_value
, sha1(123) as sha1_value
, sha2(123) as sha2_value
, sha2(123, 512) as sha_2_512_value
union all
select '''123'''
, md5('123')
, sha1('123')
, sha2('123')
, sha2('123', 512)
union all
select 'null'
, md5(null)
, sha1(null)
, sha2(null)
, sha2(null, 512)
union all
select 'timestamp ''2025-12-31'''
, md5(timestamp '2025-12-31')
, sha1(timestamp '2025-12-31')
, sha2(timestamp '2025-12-31')
, sha2(timestamp '2025-12-31', 512)
union all
select '''2025-12-31 00:00:00.000'''
, md5('2025-12-31 00:00:00.000')
, sha1('2025-12-31 00:00:00.000')
, sha2('2025-12-31 00:00:00.000')
, sha2('2025-12-31 00:00:00.000', 512)
union all
select 'true'
, md5(true)
, sha1(true)
, sha2(true)
, sha2(true, 512)
union all
select '''true'''
, md5('true')
, sha1('true')
, sha2('true')
, sha2('true', 512)
order by md5_value, parameter
;
PARAMETER |MD5_VALUE |SHA1_VALUE |SHA2_VALUE |SHA_2_512_VALUE | -------------------------+--------------------------------+----------------------------------------+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ '123' |202cb962ac59075b964b07152d234b70|40bd001563085fc35165329ea1ff5c5ecbdbbeef|a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3|3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2| 123 |202cb962ac59075b964b07152d234b70|40bd001563085fc35165329ea1ff5c5ecbdbbeef|a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3|3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2| '2025-12-31 00:00:00.000'|5736d0ca885a2c9442ade32db045d5be|af67befd01be402198eabc000641c9685171ab83|d0b940b223cc0ca639ec6f4089d9ad882f7874e1288355ce2249759a0830cbd7|97f70b23d6e3e567d257a190f71fe48ffbbe2c29d43337938793d349b15533cdf16188251b80e1875fef56131085cc93c67fa831fa81e528aa08138753be3183| timestamp '2025-12-31' |5736d0ca885a2c9442ade32db045d5be|af67befd01be402198eabc000641c9685171ab83|d0b940b223cc0ca639ec6f4089d9ad882f7874e1288355ce2249759a0830cbd7|97f70b23d6e3e567d257a190f71fe48ffbbe2c29d43337938793d349b15533cdf16188251b80e1875fef56131085cc93c67fa831fa81e528aa08138753be3183| 'true' |b326b5062b2f0e69046810717534cb09|5ffe533b830f08a0326348a9160afafc8ada44db|b5bea41b6c623f7c09f1bf24dcae58ebab3c0cdd90ad966bc43a45b44867e12b|9120cd5faef07a08e971ff024a3fcbea1e3a6b44142a6d82ca28c6c42e4f852595bcf53d81d776f10541045abdb7c37950629415d0dc66c8d86c64a5606d32de| true |b326b5062b2f0e69046810717534cb09|5ffe533b830f08a0326348a9160afafc8ada44db|b5bea41b6c623f7c09f1bf24dcae58ebab3c0cdd90ad966bc43a45b44867e12b|9120cd5faef07a08e971ff024a3fcbea1e3a6b44142a6d82ca28c6c42e4f852595bcf53d81d776f10541045abdb7c37950629415d0dc66c8d86c64a5606d32de| null | | | | |