When the Edge Fails
Edge gateways operate in hostile environments: warehouses with unreliable power, factories with electromagnetic interference, and remote sites with no on-site IT staff. A sudden power loss can corrupt an SQLite database, a failed SD card can erase months of audit logs, and a firmware bug can deadlock the sync queue. When disaster strikes, you need a recovery playbook that is tested, documented, and executable by a technician with a laptop and a USB cable.
This article covers SQLite WAL recovery, CloudSyncQueue corruption repair, factory
reset procedures, automated backup scripts, and how to test recovery without risking production
data.
SQLite WAL Recovery
The SQLiteStore in pyv_edge_agent/local_store/sqlite_store.py runs
with PRAGMA journal_mode = WAL. Write-Ahead Logging provides better concurrency
and crash safety than the default rollback journal, but it introduces two extra files:
<db>-wal and <db>-shm. If these files are lost or truncated
during a crash, the database may appear corrupt or contain incomplete transactions.
Understanding WAL Mode
- <db>-wal — Contains the actual changes that have not yet been checkpointed into the main database file.
- <db>-shm — Shared memory file used by readers to map the WAL contents. It is disposable and rebuilt automatically.
- Checkpoint — The process of transferring committed WAL pages back into the main database file and truncating the WAL.
Manual Checkpoint and Integrity Check
# Connect to the edge store and force a checkpoint
sqlite3 /var/lib/pyvorin/edge_store.db << 'EOF'
PRAGMA journal_mode;
PRAGMA wal_checkpoint(TRUNCATE);
PRAGMA integrity_check;
EOF
# If integrity_check returns anything other than "ok", the database is damaged.
# Common causes:
# - Power loss during a write
# - SD card wear or bad sectors
# - Filesystem corruption (e.g., ext4 journal replay failure)
Recovering from WAL Corruption
If the WAL file is corrupt but the main database is intact, you can often recover by discarding the WAL and running an integrity check. Any un-checkpointed transactions will be lost, but the committed data is preserved.
#!/bin/bash
set -euo pipefail
DB_PATH="/var/lib/pyvorin/edge_store.db"
WAL_PATH="${DB_PATH}-wal"
SHM_PATH="${DB_PATH}-shm"
BACKUP_DIR="/var/lib/pyvorin/backups/recovery-$(date +%Y%m%d_%H%M%S)"
mkdir -p "${BACKUP_DIR}"
# 1. Stop the Edge Agent to prevent further writes
systemctl stop pyvorin-edge-agent || true
# 2. Copy everything before touching it
cp "${DB_PATH}" "${BACKUP_DIR}/"
cp "${WAL_PATH}" "${BACKUP_DIR}/" 2>/dev/null || true
cp "${SHM_PATH}" "${BACKUP_DIR}/" 2>/dev/null || true
# 3. Attempt a normal checkpoint
sqlite3 "${DB_PATH}" "PRAGMA wal_checkpoint(TRUNCATE);"
# 4. Run integrity check
RESULT=$(sqlite3 "${DB_PATH}" "PRAGMA integrity_check;")
if [ "${RESULT}" != "ok" ]; then
echo "Integrity check failed: ${RESULT}"
# 5. Nuclear option: remove WAL and let SQLite rebuild
rm -f "${WAL_PATH}" "${SHM_PATH}"
# 6. Re-run integrity check on the truncated database
RESULT2=$(sqlite3 "${DB_PATH}" "PRAGMA integrity_check;")
if [ "${RESULT2}" != "ok" ]; then
echo "Database is unrecoverable. Restore from backup."
exit 1
fi
echo "Recovered by truncating WAL. Uncommitted transactions lost."
else
echo "Database integrity verified."
fi
# 7. Restart the agent
systemctl start pyvorin-edge-agent
Python Recovery Helper
import sqlite3
import shutil
from datetime import datetime
from pathlib import Path
DB_PATH = Path("/var/lib/pyvorin/edge_store.db")
BACKUP_ROOT = Path("/var/lib/pyvorin/backups")
def recover_sqlite(db_path: Path) -> dict:
"""Attempt WAL recovery on an SQLite database. Returns status dict."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_dir = BACKUP_ROOT / f"recovery-{timestamp}"
backup_dir.mkdir(parents=True, exist_ok=True)
# Backup current state
shutil.copy2(db_path, backup_dir / db_path.name)
for extra in [db_path.with_suffix(".db-wal"), db_path.with_suffix(".db-shm")]:
if extra.exists():
shutil.copy2(extra, backup_dir / extra.name)
conn = sqlite3.connect(str(db_path))
try:
conn.execute("PRAGMA wal_checkpoint(TRUNCATE);")
result = conn.execute("PRAGMA integrity_check;").fetchone()[0]
return {"status": "ok" if result == "ok" else "failed", "detail": result}
finally:
conn.close()
Queue Corruption Repair
The CloudSyncQueue in pyv_edge_agent/cloud_sync/queue.py persists
outbound messages in SQLite. Corruption here manifests as failed dequeues, missing messages, or
foreign key errors (if your schema uses them). The queue table is intentionally simple to
maximise recoverability.
Diagnosing Queue Health
from pyv_edge_agent.cloud_sync.queue import CloudSyncQueue
queue = CloudSyncQueue(db_path="/var/lib/pyvorin/sync_queue.db")
# Basic stats
stats = queue.get_stats()
print(f"Pending items: {stats['depth']}")
print(f"Oldest item: {stats['oldest_item_timestamp']}")
print(f"Total retries: {stats['total_retries']}")
print(f"Retrying items: {stats['retrying_items']}")
# pending_count is a thin wrapper for total row count
count = queue.pending_count()
print(f"Raw pending count: {count}")
Repairing a Corrupt Queue
If the queue database is corrupt but the edge store is healthy, the safest repair is to export any salvageable payloads, rebuild the queue database, and re-enqueue valid items.
#!/bin/bash
set -euo pipefail
QUEUE_DB="/var/lib/pyvorin/sync_queue.db"
RECOVERY_DB="/var/lib/pyvorin/sync_queue_recovery.db"
EXPORT_FILE="/tmp/queue_export.jsonl"
# 1. Stop the agent
systemctl stop pyvorin-edge-agent
# 2. Try to export whatever rows are readable
sqlite3 "${QUEUE_DB}" << 'EOF' > "${EXPORT_FILE}"
.mode json
SELECT json_object(
'id', id,
'payload', payload,
'priority', priority,
'created_at', created_at,
'ttl_seconds', ttl_seconds,
'retry_count', retry_count,
'next_retry_at', next_retry_at
) FROM sync_queue;
EOF
# 3. If export succeeded, rebuild the queue
mv "${QUEUE_DB}" "${QUEUE_DB}.corrupt.$(date +%s)"
# 4. Let the Edge Agent recreate the queue schema on startup
systemctl start pyvorin-edge-agent
# 5. Re-inject exported items via Python (see below)
echo "Queue rebuilt. Re-inject ${EXPORT_FILE} manually."
import json
from pathlib import Path
from pyv_edge_agent.cloud_sync.queue import CloudSyncQueue, Priority
EXPORT_FILE = Path("/tmp/queue_export.jsonl")
queue = CloudSyncQueue(db_path="/var/lib/pyvorin/sync_queue.db")
with open(EXPORT_FILE, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
record = json.loads(line)
queue.enqueue(
payload=json.loads(record["payload"]),
priority=Priority(record["priority"]),
ttl_seconds=record["ttl_seconds"],
)
except Exception as exc:
print(f"Failed to re-inject record: {exc}")
print("Queue re-injection complete.")
Factory Reset Procedure
A factory reset returns the device to a known-good state while preserving the ability to restore from backup. It is the last resort before physical re-flashing.
#!/bin/bash
set -euo pipefail
AGENT_USER="pyvorin"
DATA_DIR="/var/lib/pyvorin"
CONFIG_DIR="/etc/pyvorin"
BACKUP_DIR="/opt/pyvorin-backups/$(date +%Y%m%d_%H%M%S)"
echo "=== Pyvorin Edge Factory Reset ==="
# 1. Stop the agent
systemctl stop pyvorin-edge-agent
# 2. Create a final backup
mkdir -p "${BACKUP_DIR}"
tar czf "${BACKUP_DIR}/data.tar.gz" -C "${DATA_DIR}" .
tar czf "${BACKUP_DIR}/config.tar.gz" -C "${CONFIG_DIR}" .
# 3. Clear mutable data (preserve certificates and config)
rm -f "${DATA_DIR}"/*.db
rm -f "${DATA_DIR}"/*.db-wal
rm -f "${DATA_DIR}"/*.db-shm
rm -rf "${DATA_DIR}"/backups/*
rm -rf "${DATA_DIR}"/logs/*
# 4. Reset queue and store to empty schemas
sqlite3 "${DATA_DIR}/edge_store.db" "VACUUM;"
sqlite3 "${DATA_DIR}/sync_queue.db" "VACUUM;"
# 5. Regenerate audit chain genesis
python3 -c "
from pyv_edge_agent.privacy_firewall.audit import PrivacyAudit
audit = PrivacyAudit(db_path='${DATA_DIR}/edge_store.db')
audit.log_action('factory_reset', details={'backup_path': '${BACKUP_DIR}'})
print('Audit genesis logged.')
"
# 6. Restart
systemctl start pyvorin-edge-agent
echo "Factory reset complete. Backup at ${BACKUP_DIR}"
Automated Backup and Restore Scripts
Backups must be automatic, encrypted, and test-restored regularly. An untested backup is merely a hope.
Automated Backup Script
#!/bin/bash
set -euo pipefail
DATA_DIR="/var/lib/pyvorin"
BACKUP_DIR="/opt/pyvorin-backups"
RETENTION_DAYS=30
GPG_RECIPIENT="backups@pyvorin.com"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
TMP_BACKUP="/tmp/pyvorin-backup-${TIMESTAMP}"
mkdir -p "${TMP_BACKUP}"
# 1. Checkpoint WAL before copying
sqlite3 "${DATA_DIR}/edge_store.db" "PRAGMA wal_checkpoint(TRUNCATE);"
sqlite3 "${DATA_DIR}/sync_queue.db" "PRAGMA wal_checkpoint(TRUNCATE);"
# 2. Copy databases
cp "${DATA_DIR}/edge_store.db" "${TMP_BACKUP}/"
cp "${DATA_DIR}/sync_queue.db" "${TMP_BACKUP}/"
# 3. Copy config
cp -r /etc/pyvorin "${TMP_BACKUP}/config"
# 4. Tar and encrypt
tar czf - -C "${TMP_BACKUP}" . | \
gpg --encrypt --recipient "${GPG_RECIPIENT}" \
--output "${BACKUP_DIR}/pyvorin-edge-${TIMESTAMP}.tar.gz.gpg"
# 5. Clean up temp files
rm -rf "${TMP_BACKUP}"
# 6. Rotate old backups
find "${BACKUP_DIR}" -name 'pyvorin-edge-*.tar.gz.gpg' -mtime +${RETENTION_DAYS} -delete
echo "Backup complete: ${BACKUP_DIR}/pyvorin-edge-${TIMESTAMP}.tar.gz.gpg"
Restore Script
#!/bin/bash
set -euo pipefail
BACKUP_FILE="$1"
DATA_DIR="/var/lib/pyvorin"
CONFIG_DIR="/etc/pyvorin"
if [ ! -f "${BACKUP_FILE}" ]; then
echo "Backup file not found: ${BACKUP_FILE}"
exit 1
fi
echo "=== Pyvorin Edge Restore ==="
# 1. Stop agent
systemctl stop pyvorin-edge-agent
# 2. Decrypt and extract
TMP_DIR=$(mktemp -d)
gpg --decrypt "${BACKUP_FILE}" | tar xzf - -C "${TMP_DIR}"
# 3. Restore databases (keep current as .restored)
for db in edge_store.db sync_queue.db; do
if [ -f "${DATA_DIR}/${db}" ]; then
mv "${DATA_DIR}/${db}" "${DATA_DIR}/${db}.restored"
fi
cp "${TMP_DIR}/${db}" "${DATA_DIR}/${db}"
chown pyvorin:pyvorin "${DATA_DIR}/${db}"
chmod 640 "${DATA_DIR}/${db}"
done
# 4. Restore config selectively (do not overwrite certs unless explicitly requested)
rsync -av --exclude='certs' "${TMP_DIR}/config/" "${CONFIG_DIR}/"
# 5. Cleanup
rm -rf "${TMP_DIR}"
# 6. Verify
sqlite3 "${DATA_DIR}/edge_store.db" "PRAGMA integrity_check;"
sqlite3 "${DATA_DIR}/sync_queue.db" "PRAGMA integrity_check;"
# 7. Restart
systemctl start pyvorin-edge-agent
echo "Restore complete from ${BACKUP_FILE}"
Testing Recovery Procedures
Recovery scripts that have never been executed are Schrodinger's backups: simultaneously valid and useless until observed. Test them on non-production hardware at least quarterly.
Simulated Corruption Test
#!/bin/bash
# test_corruption_recovery.sh — run on a staging Pi 5
DB_PATH="/var/lib/pyvorin/edge_store.db"
# 1. Inject some data
python3 -c "
from pyv_edge_agent.local_store.sqlite_store import SQLiteStore
from pyv_edge_agent.types import SensorReading
store = SQLiteStore(db_path='${DB_PATH}')
for i in range(1000):
store.store_reading(SensorReading(
sensor_name='test_temp',
timestamp=1700000000.0 + i,
value=22.5,
unit='°C',
metadata={'test': True},
))
print('Data injected.')
"
# 2. Corrupt the WAL by overwriting random bytes
WAL="${DB_PATH}-wal"
if [ -f "${WAL}" ]; then
dd if=/dev/urandom of="${WAL}" bs=1 count=512 conv=notrunc
echo "WAL corrupted."
fi
# 3. Run the recovery script
bash /opt/pyvorin/scripts/recover_sqlite.sh
# 4. Verify data count
COUNT=$(sqlite3 "${DB_PATH}" "SELECT COUNT(*) FROM readings;")
echo "Readings after recovery: ${COUNT}"
# 5. Assert count is reasonable (some loss acceptable)
if [ "${COUNT}" -lt 900 ]; then
echo "FAIL: too much data lost"
exit 1
fi
echo "PASS: corruption recovery test succeeded"
Automated Restore Verification
import subprocess
import tempfile
from pathlib import Path
def test_backup_restore_cycle() -> None:
"""Verify that backup and restore scripts are functional."""
with tempfile.TemporaryDirectory() as tmp:
# Create a mock data directory
data_dir = Path(tmp) / "data"
data_dir.mkdir()
db = data_dir / "edge_store.db"
# Create a simple SQLite db
import sqlite3
conn = sqlite3.connect(str(db))
conn.execute("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT);")
conn.execute("INSERT INTO test VALUES (1, 'hello');")
conn.commit()
conn.close()
# Run backup script (modified to use tmp paths)
backup_script = Path("/opt/pyvorin/scripts/backup.sh")
env = {"DATA_DIR": str(data_dir), "BACKUP_DIR": str(Path(tmp) / "backups")}
subprocess.run([str(backup_script)], env=env, check=True)
# Corrupt original
db.write_bytes(b"CORRUPT")
# Run restore script
restore_script = Path("/opt/pyvorin/scripts/restore.sh")
backup_file = next((Path(tmp) / "backups").glob("*.tar.gz.gpg"))
subprocess.run([str(restore_script), str(backup_file)], check=True)
# Verify
conn = sqlite3.connect(str(db))
row = conn.execute("SELECT value FROM test WHERE id=1;").fetchone()
assert row and row[0] == "hello", "Restore verification failed"
conn.close()
print("Backup/restore cycle verified.")
Operational Best Practices
- Checkpoint before backup. Always run
PRAGMA wal_checkpoint(TRUNCATE)to ensure the backup contains a complete, self-contained database file. - Encrypt backups at rest. Use GPG with a recipient key stored off-device. Do not rely on filesystem encryption alone.
- Test restore to spare hardware. Keep a staging Pi 5 with identical firmware for quarterly disaster-recovery drills.
- Monitor queue depth. If
CloudSyncQueue.pending_count()grows monotonically, the queue may be in a retry loop caused by corruption. Investigate immediately. - Keep 30 days of backups locally. Edge devices may be offline for extended periods. Local backup retention ensures you can recover even without cloud access.
- Log every recovery event. Use
PrivacyAudit.log_action()to record recovery operations with timestamps and technician IDs.
Summary
Disaster recovery for Pyvorin Edge is a combination of SQLite WAL management, queue rebuild
procedures, factory reset playbooks, and encrypted automated backups. By checkpointing WAL files
before backup, testing restore scripts on staging hardware, and monitoring
pending_count() for queue anomalies, you can recover from power-loss corruption,
SD card failures, and software bugs without data loss or prolonged downtime. The scripts in this
article are designed to be run by a field technician with minimal training — copy them to
/opt/pyvorin/scripts/, customise the paths for your fleet, and schedule quarterly
drills.