Disaster Recovery and Data Integrity

When the Edge Fails

Tested with: Python 3.12.3, GCC 13.3.0, Pyvorin Edge SDK 1.0.5-edge, Ubuntu 24.04 LTS (x86_64 & ARM64). Run python3 --version and gcc --version to verify your environment.

Edge gateways operate in hostile environments: warehouses with unreliable power, factories with electromagnetic interference, and remote sites with no on-site IT staff. A sudden power loss can corrupt an SQLite database, a failed SD card can erase months of audit logs, and a firmware bug can deadlock the sync queue. When disaster strikes, you need a recovery playbook that is tested, documented, and executable by a technician with a laptop and a USB cable.

This article covers SQLite WAL recovery, CloudSyncQueue corruption repair, factory reset procedures, automated backup scripts, and how to test recovery without risking production data.

SQLite WAL Recovery

The SQLiteStore in pyv_edge_agent/local_store/sqlite_store.py runs with PRAGMA journal_mode = WAL. Write-Ahead Logging provides better concurrency and crash safety than the default rollback journal, but it introduces two extra files: <db>-wal and <db>-shm. If these files are lost or truncated during a crash, the database may appear corrupt or contain incomplete transactions.

Understanding WAL Mode

<db>-wal — Contains the actual changes that have not yet been checkpointed into the main database file.
<db>-shm — Shared memory file used by readers to map the WAL contents. It is disposable and rebuilt automatically.
Checkpoint — The process of transferring committed WAL pages back into the main database file and truncating the WAL.

Manual Checkpoint and Integrity Check


# Connect to the edge store and force a checkpoint
sqlite3 /var/lib/pyvorin/edge_store.db << 'EOF'
PRAGMA journal_mode;
PRAGMA wal_checkpoint(TRUNCATE);
PRAGMA integrity_check;
EOF

# If integrity_check returns anything other than "ok", the database is damaged.
# Common causes:
#   - Power loss during a write
#   - SD card wear or bad sectors
#   - Filesystem corruption (e.g., ext4 journal replay failure)

Recovering from WAL Corruption

If the WAL file is corrupt but the main database is intact, you can often recover by discarding the WAL and running an integrity check. Any un-checkpointed transactions will be lost, but the committed data is preserved.


#!/bin/bash
set -euo pipefail

DB_PATH="/var/lib/pyvorin/edge_store.db"
WAL_PATH="${DB_PATH}-wal"
SHM_PATH="${DB_PATH}-shm"
BACKUP_DIR="/var/lib/pyvorin/backups/recovery-$(date +%Y%m%d_%H%M%S)"

mkdir -p "${BACKUP_DIR}"

# 1. Stop the Edge Agent to prevent further writes
systemctl stop pyvorin-edge-agent || true

# 2. Copy everything before touching it
cp "${DB_PATH}" "${BACKUP_DIR}/"
cp "${WAL_PATH}" "${BACKUP_DIR}/" 2>/dev/null || true
cp "${SHM_PATH}" "${BACKUP_DIR}/" 2>/dev/null || true

# 3. Attempt a normal checkpoint
sqlite3 "${DB_PATH}" "PRAGMA wal_checkpoint(TRUNCATE);"

# 4. Run integrity check
RESULT=$(sqlite3 "${DB_PATH}" "PRAGMA integrity_check;")

if [ "${RESULT}" != "ok" ]; then
    echo "Integrity check failed: ${RESULT}"

    # 5. Nuclear option: remove WAL and let SQLite rebuild
    rm -f "${WAL_PATH}" "${SHM_PATH}"

    # 6. Re-run integrity check on the truncated database
    RESULT2=$(sqlite3 "${DB_PATH}" "PRAGMA integrity_check;")
    if [ "${RESULT2}" != "ok" ]; then
        echo "Database is unrecoverable. Restore from backup."
        exit 1
    fi
    echo "Recovered by truncating WAL. Uncommitted transactions lost."
else
    echo "Database integrity verified."
fi

# 7. Restart the agent
systemctl start pyvorin-edge-agent

Python Recovery Helper


import sqlite3
import shutil
from datetime import datetime
from pathlib import Path

DB_PATH = Path("/var/lib/pyvorin/edge_store.db")
BACKUP_ROOT = Path("/var/lib/pyvorin/backups")


def recover_sqlite(db_path: Path) -> dict:
    """Attempt WAL recovery on an SQLite database. Returns status dict."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_dir = BACKUP_ROOT / f"recovery-{timestamp}"
    backup_dir.mkdir(parents=True, exist_ok=True)

    # Backup current state
    shutil.copy2(db_path, backup_dir / db_path.name)
    for extra in [db_path.with_suffix(".db-wal"), db_path.with_suffix(".db-shm")]:
        if extra.exists():
            shutil.copy2(extra, backup_dir / extra.name)

    conn = sqlite3.connect(str(db_path))
    try:
        conn.execute("PRAGMA wal_checkpoint(TRUNCATE);")
        result = conn.execute("PRAGMA integrity_check;").fetchone()[0]
        return {"status": "ok" if result == "ok" else "failed", "detail": result}
    finally:
        conn.close()

Queue Corruption Repair

The CloudSyncQueue in pyv_edge_agent/cloud_sync/queue.py persists outbound messages in SQLite. Corruption here manifests as failed dequeues, missing messages, or foreign key errors (if your schema uses them). The queue table is intentionally simple to maximise recoverability.

Diagnosing Queue Health


from pyv_edge_agent.cloud_sync.queue import CloudSyncQueue

queue = CloudSyncQueue(db_path="/var/lib/pyvorin/sync_queue.db")

# Basic stats
stats = queue.get_stats()
print(f"Pending items:      {stats['depth']}")
print(f"Oldest item:        {stats['oldest_item_timestamp']}")
print(f"Total retries:      {stats['total_retries']}")
print(f"Retrying items:     {stats['retrying_items']}")

# pending_count is a thin wrapper for total row count
count = queue.pending_count()
print(f"Raw pending count:  {count}")

Repairing a Corrupt Queue

If the queue database is corrupt but the edge store is healthy, the safest repair is to export any salvageable payloads, rebuild the queue database, and re-enqueue valid items.


#!/bin/bash
set -euo pipefail

QUEUE_DB="/var/lib/pyvorin/sync_queue.db"
RECOVERY_DB="/var/lib/pyvorin/sync_queue_recovery.db"
EXPORT_FILE="/tmp/queue_export.jsonl"

# 1. Stop the agent
systemctl stop pyvorin-edge-agent

# 2. Try to export whatever rows are readable
sqlite3 "${QUEUE_DB}" << 'EOF' > "${EXPORT_FILE}"
.mode json
SELECT json_object(
  'id', id,
  'payload', payload,
  'priority', priority,
  'created_at', created_at,
  'ttl_seconds', ttl_seconds,
  'retry_count', retry_count,
  'next_retry_at', next_retry_at
) FROM sync_queue;
EOF

# 3. If export succeeded, rebuild the queue
mv "${QUEUE_DB}" "${QUEUE_DB}.corrupt.$(date +%s)"

# 4. Let the Edge Agent recreate the queue schema on startup
systemctl start pyvorin-edge-agent

# 5. Re-inject exported items via Python (see below)
echo "Queue rebuilt. Re-inject ${EXPORT_FILE} manually."


import json
from pathlib import Path
from pyv_edge_agent.cloud_sync.queue import CloudSyncQueue, Priority

EXPORT_FILE = Path("/tmp/queue_export.jsonl")
queue = CloudSyncQueue(db_path="/var/lib/pyvorin/sync_queue.db")

with open(EXPORT_FILE, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            record = json.loads(line)
            queue.enqueue(
                payload=json.loads(record["payload"]),
                priority=Priority(record["priority"]),
                ttl_seconds=record["ttl_seconds"],
            )
        except Exception as exc:
            print(f"Failed to re-inject record: {exc}")

print("Queue re-injection complete.")

Factory Reset Procedure

A factory reset returns the device to a known-good state while preserving the ability to restore from backup. It is the last resort before physical re-flashing.


#!/bin/bash
set -euo pipefail

AGENT_USER="pyvorin"
DATA_DIR="/var/lib/pyvorin"
CONFIG_DIR="/etc/pyvorin"
BACKUP_DIR="/opt/pyvorin-backups/$(date +%Y%m%d_%H%M%S)"

echo "=== Pyvorin Edge Factory Reset ==="

# 1. Stop the agent
systemctl stop pyvorin-edge-agent

# 2. Create a final backup
mkdir -p "${BACKUP_DIR}"
tar czf "${BACKUP_DIR}/data.tar.gz" -C "${DATA_DIR}" .
tar czf "${BACKUP_DIR}/config.tar.gz" -C "${CONFIG_DIR}" .

# 3. Clear mutable data (preserve certificates and config)
rm -f "${DATA_DIR}"/*.db
rm -f "${DATA_DIR}"/*.db-wal
rm -f "${DATA_DIR}"/*.db-shm
rm -rf "${DATA_DIR}"/backups/*
rm -rf "${DATA_DIR}"/logs/*

# 4. Reset queue and store to empty schemas
sqlite3 "${DATA_DIR}/edge_store.db" "VACUUM;"
sqlite3 "${DATA_DIR}/sync_queue.db" "VACUUM;"

# 5. Regenerate audit chain genesis
python3 -c "
from pyv_edge_agent.privacy_firewall.audit import PrivacyAudit
audit = PrivacyAudit(db_path='${DATA_DIR}/edge_store.db')
audit.log_action('factory_reset', details={'backup_path': '${BACKUP_DIR}'})
print('Audit genesis logged.')
"

# 6. Restart
systemctl start pyvorin-edge-agent
echo "Factory reset complete. Backup at ${BACKUP_DIR}"

Automated Backup and Restore Scripts

Backups must be automatic, encrypted, and test-restored regularly. An untested backup is merely a hope.

Automated Backup Script


#!/bin/bash
set -euo pipefail

DATA_DIR="/var/lib/pyvorin"
BACKUP_DIR="/opt/pyvorin-backups"
RETENTION_DAYS=30
GPG_RECIPIENT="backups@pyvorin.com"

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
TMP_BACKUP="/tmp/pyvorin-backup-${TIMESTAMP}"
mkdir -p "${TMP_BACKUP}"

# 1. Checkpoint WAL before copying
sqlite3 "${DATA_DIR}/edge_store.db" "PRAGMA wal_checkpoint(TRUNCATE);"
sqlite3 "${DATA_DIR}/sync_queue.db" "PRAGMA wal_checkpoint(TRUNCATE);"

# 2. Copy databases
cp "${DATA_DIR}/edge_store.db" "${TMP_BACKUP}/"
cp "${DATA_DIR}/sync_queue.db" "${TMP_BACKUP}/"

# 3. Copy config
cp -r /etc/pyvorin "${TMP_BACKUP}/config"

# 4. Tar and encrypt
tar czf - -C "${TMP_BACKUP}" . | \
  gpg --encrypt --recipient "${GPG_RECIPIENT}" \
      --output "${BACKUP_DIR}/pyvorin-edge-${TIMESTAMP}.tar.gz.gpg"

# 5. Clean up temp files
rm -rf "${TMP_BACKUP}"

# 6. Rotate old backups
find "${BACKUP_DIR}" -name 'pyvorin-edge-*.tar.gz.gpg' -mtime +${RETENTION_DAYS} -delete

echo "Backup complete: ${BACKUP_DIR}/pyvorin-edge-${TIMESTAMP}.tar.gz.gpg"

Restore Script


#!/bin/bash
set -euo pipefail

BACKUP_FILE="$1"
DATA_DIR="/var/lib/pyvorin"
CONFIG_DIR="/etc/pyvorin"

if [ ! -f "${BACKUP_FILE}" ]; then
    echo "Backup file not found: ${BACKUP_FILE}"
    exit 1
fi

echo "=== Pyvorin Edge Restore ==="

# 1. Stop agent
systemctl stop pyvorin-edge-agent

# 2. Decrypt and extract
TMP_DIR=$(mktemp -d)
gpg --decrypt "${BACKUP_FILE}" | tar xzf - -C "${TMP_DIR}"

# 3. Restore databases (keep current as .restored)
for db in edge_store.db sync_queue.db; do
    if [ -f "${DATA_DIR}/${db}" ]; then
        mv "${DATA_DIR}/${db}" "${DATA_DIR}/${db}.restored"
    fi
    cp "${TMP_DIR}/${db}" "${DATA_DIR}/${db}"
    chown pyvorin:pyvorin "${DATA_DIR}/${db}"
    chmod 640 "${DATA_DIR}/${db}"
done

# 4. Restore config selectively (do not overwrite certs unless explicitly requested)
rsync -av --exclude='certs' "${TMP_DIR}/config/" "${CONFIG_DIR}/"

# 5. Cleanup
rm -rf "${TMP_DIR}"

# 6. Verify
sqlite3 "${DATA_DIR}/edge_store.db" "PRAGMA integrity_check;"
sqlite3 "${DATA_DIR}/sync_queue.db" "PRAGMA integrity_check;"

# 7. Restart
systemctl start pyvorin-edge-agent
echo "Restore complete from ${BACKUP_FILE}"

Testing Recovery Procedures

Recovery scripts that have never been executed are Schrodinger's backups: simultaneously valid and useless until observed. Test them on non-production hardware at least quarterly.

Simulated Corruption Test


#!/bin/bash
# test_corruption_recovery.sh — run on a staging Pi 5

DB_PATH="/var/lib/pyvorin/edge_store.db"

# 1. Inject some data
python3 -c "
from pyv_edge_agent.local_store.sqlite_store import SQLiteStore
from pyv_edge_agent.types import SensorReading
store = SQLiteStore(db_path='${DB_PATH}')
for i in range(1000):
    store.store_reading(SensorReading(
        sensor_name='test_temp',
        timestamp=1700000000.0 + i,
        value=22.5,
        unit='°C',
        metadata={'test': True},
    ))
print('Data injected.')
"

# 2. Corrupt the WAL by overwriting random bytes
WAL="${DB_PATH}-wal"
if [ -f "${WAL}" ]; then
    dd if=/dev/urandom of="${WAL}" bs=1 count=512 conv=notrunc
    echo "WAL corrupted."
fi

# 3. Run the recovery script
bash /opt/pyvorin/scripts/recover_sqlite.sh

# 4. Verify data count
COUNT=$(sqlite3 "${DB_PATH}" "SELECT COUNT(*) FROM readings;")
echo "Readings after recovery: ${COUNT}"

# 5. Assert count is reasonable (some loss acceptable)
if [ "${COUNT}" -lt 900 ]; then
    echo "FAIL: too much data lost"
    exit 1
fi

echo "PASS: corruption recovery test succeeded"

Automated Restore Verification


import subprocess
import tempfile
from pathlib import Path

def test_backup_restore_cycle() -> None:
    """Verify that backup and restore scripts are functional."""
    with tempfile.TemporaryDirectory() as tmp:
        # Create a mock data directory
        data_dir = Path(tmp) / "data"
        data_dir.mkdir()
        db = data_dir / "edge_store.db"

        # Create a simple SQLite db
        import sqlite3
        conn = sqlite3.connect(str(db))
        conn.execute("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT);")
        conn.execute("INSERT INTO test VALUES (1, 'hello');")
        conn.commit()
        conn.close()

        # Run backup script (modified to use tmp paths)
        backup_script = Path("/opt/pyvorin/scripts/backup.sh")
        env = {"DATA_DIR": str(data_dir), "BACKUP_DIR": str(Path(tmp) / "backups")}
        subprocess.run([str(backup_script)], env=env, check=True)

        # Corrupt original
        db.write_bytes(b"CORRUPT")

        # Run restore script
        restore_script = Path("/opt/pyvorin/scripts/restore.sh")
        backup_file = next((Path(tmp) / "backups").glob("*.tar.gz.gpg"))
        subprocess.run([str(restore_script), str(backup_file)], check=True)

        # Verify
        conn = sqlite3.connect(str(db))
        row = conn.execute("SELECT value FROM test WHERE id=1;").fetchone()
        assert row and row[0] == "hello", "Restore verification failed"
        conn.close()
        print("Backup/restore cycle verified.")

Operational Best Practices

Checkpoint before backup. Always run PRAGMA wal_checkpoint(TRUNCATE) to ensure the backup contains a complete, self-contained database file.
Encrypt backups at rest. Use GPG with a recipient key stored off-device. Do not rely on filesystem encryption alone.
Test restore to spare hardware. Keep a staging Pi 5 with identical firmware for quarterly disaster-recovery drills.
Monitor queue depth. If CloudSyncQueue.pending_count() grows monotonically, the queue may be in a retry loop caused by corruption. Investigate immediately.
Keep 30 days of backups locally. Edge devices may be offline for extended periods. Local backup retention ensures you can recover even without cloud access.
Log every recovery event. Use PrivacyAudit.log_action() to record recovery operations with timestamps and technician IDs.

Summary

Disaster recovery for Pyvorin Edge is a combination of SQLite WAL management, queue rebuild procedures, factory reset playbooks, and encrypted automated backups. By checkpointing WAL files before backup, testing restore scripts on staging hardware, and monitoring pending_count() for queue anomalies, you can recover from power-loss corruption, SD card failures, and software bugs without data loss or prolonged downtime. The scripts in this article are designed to be run by a field technician with minimal training — copy them to /opt/pyvorin/scripts/, customise the paths for your fleet, and schedule quarterly drills.