Hashing Strategies for Correlation

Hashing as Privacy Architecture

Tested with: Python 3.12.3, GCC 13.3.0, Pyvorin Edge SDK 1.0.5-edge, Ubuntu 24.04 LTS (x86_64 & ARM64). Run python3 --version and gcc --version to verify your environment.

The hardest problem in edge privacy is not destruction — dropping a reading is trivial — but correlation. You need to know that the temperature spike at 09:14 and the pressure drop at 09:16 came from the same physical device, but you do not want to reveal the device's MAC address, serial number, or location-encoded sensor name to the cloud analytics platform. One-way hashing bridges this gap. It transforms an identifying token into an opaque, fixed-length digest that preserves equality while destroying reversibility.

This article examines the hashing primitives available in the Pyvorin Edge Runtime: raw SHA-256, HMAC with a secret key, and salted variants. We explain when to use each, how salts are managed at the edge, and how deterministic hashing enables longitudinal analytics without exposing raw identifiers. Every technique is accompanied by production-ready Python code that you can drop directly into your agent bootstrap module.

SHA-256: The Default Primitive

The Edge Runtime uses SHA-256 as its default hash function. It is fast, widely implemented, collision-resistant for the input sizes typical in IoT, and produces a 64-character hex digest that is safe to store in JSON, TOML, SQL, and MQTT payloads. Both the simple PrivacyPolicy (in privacy.py) and the advanced PrivacyPolicyEngine (in privacy_firewall/policy.py) invoke hashlib.sha256(...).hexdigest() under the hood.


import hashlib

def pseudonymise(value: str) -> str:
    """Return a deterministic SHA-256 pseudonym."""
    return hashlib.sha256(value.encode("utf-8")).hexdigest()

raw_mac = "aa:bb:cc:dd:ee:ff"
pseudo_mac = pseudonymise(raw_mac)
# pseudo_mac == "aabbcc..."  (64 hex chars)

The critical property here is determinism: the same input always yields the same digest. This means that if you hash a device MAC address once per minute for a year, every reading will carry the same pseudonym. Your cloud dashboard can group by that pseudonym to compute per-device uptime, mean time between failures, or drift trends — all without ever learning the real MAC address.

HMAC: Keyed Hashing for Stronger Guarantees

Raw SHA-256 is vulnerable to a rainbow-table attack if the input space is small and predictable. A MAC address has only 48 bits of entropy, and the first 24 bits are often the Organisationally Unique Identifier (OUI) of the manufacturer. An attacker with a pre-computed table of all known OUIs can brute-force the remaining 24 bits on a laptop in minutes.

HMAC-SHA-256 solves this by introducing a secret key. The digest becomes HMAC(key, input). Without the key, the attacker cannot pre-compute a rainbow table because the table would need to be generated per key, not per hash function. The Edge Runtime does not currently expose HMAC directly in PrivacyPolicyEngine, but you can compose it easily in custom code.


import hashlib
import hmac
import os

# Load a 32-byte secret from the environment or a hardware security module.
# Never hard-code this value in source control.
SECRET_KEY = os.environb.get(b"PYV_EDGE_HASH_KEY", os.urandom(32))

def hmac_pseudonymise(value: str, key: bytes = SECRET_KEY) -> str:
    """Return a keyed HMAC-SHA-256 pseudonym."""
    return hmac.new(key, value.encode("utf-8"), hashlib.sha256).hexdigest()

raw_id = "sensor-lobby-42"
pseudo_id = hmac_pseudonymise(raw_id)
# pseudo_id is deterministic given the same key, but computationally irreversible.

Key management is the operational challenge. The secret must be:

Generated once per device (or per fleet) and never transmitted over the same channel as the hashed data.
Stored in a location that survives reboots but is not world-readable. On Linux, a file under /etc/pyvorin-edge/.hash_key with permissions 0600 is acceptable; a TPM-backed key slot is better.
Rotated periodically. Rotation changes every pseudonym, which breaks longitudinal correlation. Plan for a rotation window during which both old and new keys are accepted, then re-hash historical local data if you need continuity.

Salting: Per-Value Randomness

A salt is a random nonce appended to the input before hashing. Salting defeats rainbow-table attacks for the same reason HMAC does: the attacker cannot pre-compute tables for every possible salt. However, a salt must be stored alongside the digest so that the same input can be re-hashed later for comparison. In the edge context, this is usually undesirable because it increases payload size and leaks the salt itself.

For IoT telemetry, we recommend per-device static salting instead of per-value salting. Each device has a fixed salt (e.g., the last 8 bytes of its hardware serial number) that is mixed into every hash. The salt is not secret — it is merely an additional input that breaks global rainbow tables.


import hashlib

def salted_hash(value: str, salt: str) -> str:
    """SHA-256 with a per-device salt."""
    payload = f"{salt}:{value}"
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()

# Salt is derived from hardware and baked into the device image.
DEVICE_SALT = "a3f7b2d1"

raw_serial = "SN-2024-8892"
pseudo_serial = salted_hash(raw_serial, DEVICE_SALT)

Deterministic Hashing for Device Fingerprinting

One of the most powerful use cases for edge hashing is device fingerprinting without exposure. Imagine a smart-building platform with five thousand BLE beacons. Each beacon broadcasts a unique UUID. The cloud platform needs to count how many distinct beacons were detected in each hour, and it needs to track whether beacon X moved from floor 3 to floor 5. But the UUID itself is sensitive: it could be used to track individual building occupants if correlated with mobile-app data.

By hashing the UUID at the edge, the building gateway sends only pseudonyms to the cloud. The cloud sees beacon_a3f7... and beacon_8e2c... but never the raw UUIDs. Because the hash is deterministic, the cloud can still answer:

"How many unique beacons were seen between 08:00 and 09:00?" — count distinct pseudonyms.
"Which beacon moved from floor 3 to floor 5?" — join on pseudonym across time windows.
"Has this beacon been seen before?" — check set membership of the pseudonym.


from collections import Counter
from pyv_edge_agent.privacy_firewall.policy import PrivacyPolicyEngine, PrivacyRuleset
from pyv_edge_agent.types import SensorReading

ruleset = PrivacyRuleset(hash_fields=["beacon_uuid"])
engine = PrivacyPolicyEngine(ruleset)

# Simulated batch of beacon detections
readings = [
    SensorReading("ble.gateway.north", 1717000000.0, -42.0, "dbm", {"beacon_uuid": "uuid-abc"}),
    SensorReading("ble.gateway.south", 1717000001.0, -55.0, "dbm", {"beacon_uuid": "uuid-abc"}),
    SensorReading("ble.gateway.north", 1717000002.0, -39.0, "dbm", {"beacon_uuid": "uuid-def"}),
]

safe = engine.apply_batch(readings)

# Cloud-side: count distinct pseudonyms per hour
pseudonyms = [r.metadata["beacon_uuid"] for r in safe]
unique_count = len(set(pseudonyms))  # 2

Truncated Digests for Payload Efficiency

A full SHA-256 digest is 64 hex characters. For high-frequency sensors (e.g., 1 kHz vibration sampling), appending 64 bytes to every metadata field can bloat your egress bill. The simple PrivacyPolicy in privacy.py truncates to 16 characters:


import hashlib

hashed = hashlib.sha256("motion.lounge".encode()).hexdigest()[:16]
# hashed == "a3f7b2d1e8c9a4b5"

Truncation reduces collision resistance. With 16 hex characters (64 bits), the birthday bound is approximately 2^32 operations — far beyond the capability of any attacker targeting IoT telemetry, but something to be aware of if your fleet exceeds tens of billions of devices. For most deployments, 16 characters is a reasonable trade-off between size and security.

Comparing SHA-256, HMAC, and Salted Variants

Strategy	Reversibility	Rainbow-Table Resistance	Key Management	Use Case
Raw SHA-256	Computationally hard	Weak for small input spaces	None	Low-sensitivity IDs, large entropy
HMAC-SHA-256	Computationally hard	Strong	Requires secret key	High-sensitivity IDs, regulatory requirements
Salted SHA-256	Computationally hard	Moderate	Requires salt storage	Medium sensitivity, no HSM available
Truncated SHA-256 (16 chars)	Irreversible (information loss)	Weak	None	High-frequency telemetry, size-constrained payloads

Full Production Example: Configurable Hashing Pipeline

Below is a complete bootstrap module that reads a hashing strategy from environment variables, initialises the appropriate function, and injects it into a PrivacyPolicyEngine. You can place this in main.py or import it from a custom plugin.


import hashlib
import hmac
import os
from typing import Callable

from pyv_edge_agent.privacy_firewall.policy import PrivacyPolicyEngine, PrivacyRuleset

HashFn = Callable[[str], str]


def _load_key(path: str) -> bytes:
    p = os.path.expanduser(path)
    if os.path.exists(p):
        with open(p, "rb") as f:
            return f.read()
    raise RuntimeError(f"Hash key not found at {p}")


def make_hash_fn(strategy: str) -> HashFn:
    if strategy == "sha256":
        return lambda v: hashlib.sha256(v.encode("utf-8")).hexdigest()

    if strategy == "hmac":
        key = _load_key(os.environ.get("PYV_HASH_KEY_PATH", "/etc/pyvorin-edge/.hash_key"))
        return lambda v: hmac.new(key, v.encode("utf-8"), hashlib.sha256).hexdigest()

    if strategy == "salted":
        salt = os.environ.get("PYV_DEVICE_SALT", "default-salt")
        return lambda v: hashlib.sha256(f"{salt}:{v}".encode("utf-8")).hexdigest()

    if strategy == "truncated":
        return lambda v: hashlib.sha256(v.encode("utf-8")).hexdigest()[:16]

    raise ValueError(f"Unknown hash strategy: {strategy}")


def build_engine() -> PrivacyPolicyEngine:
    strategy = os.environ.get("PYV_HASH_STRATEGY", "sha256")
    hash_fn = make_hash_fn(strategy)

    # The engine's _hash_value is a static method; for demonstration we wrap it.
    ruleset = PrivacyRuleset(
        hash_fields=["device_mac", "serial_number", "beacon_uuid"],
    )
    engine = PrivacyPolicyEngine(ruleset)

    # Override the static hasher with our custom function via monkey-patch.
    # In production, subclass PrivacyPolicyEngine instead.
    PrivacyPolicyEngine._hash_value = staticmethod(hash_fn)

    return engine


if __name__ == "__main__":
    engine = build_engine()
    print("Hashing engine ready with strategy:", os.environ.get("PYV_HASH_STRATEGY", "sha256"))

Operational Considerations

Audit hash usage. The PrivacyAudit chain logs every policy change. If you rotate a hash key, log the rotation event with a timestamp so that compliance auditors can reconstruct which key was active during any historical window.
Test determinism. Write a unit test that hashes the same value one thousand times and asserts that all digests are identical. Non-determinism is usually caused by encoding issues (e.g., passing a bytes object where a str is expected, or vice versa).
Avoid hashing timestamps. A timestamp has near-infinite entropy, so hashing it provides no privacy benefit while destroying the temporal ordering that most analytics depend on. If you need to obfuscate time, round to the nearest hour or day instead.
Monitor payload size. If you switch from raw identifiers to full 64-character digests, your JSON payload may grow by 20–40 percent. Use truncation or compression if bandwidth is constrained.

Summary

Hashing is the bridge between privacy and analytics. The Pyvorin Edge Runtime provides raw SHA-256 out of the box, but the architecture is intentionally extensible so that you can upgrade to HMAC or salted variants as your threat model evolves. Deterministic hashing lets you fingerprint devices, correlate events, and compute unique counts without ever exposing raw identifiers to the cloud. Choose the strategy that matches your sensitivity tier, manage your keys and salts with the same rigour you apply to TLS certificates, and always log policy changes in the tamper-evident audit chain.