Disaster recovery plan
Last reviewed 2026-05-19
Recovery objectives
| Scope | RTO (Recovery Time) | RPO (Recovery Point) |
|---|---|---|
| Application (Vercel) | 1 h | 0: stateless, git-sourced |
| Firestore data | 4 h target | 24 h target from managed daily backups; 7-day PITR available for point-in-time recovery |
| Cloud Storage (backup/export buckets) | 4 h target | 24 h target for supported export workflows; bucket soft-delete/versioning evidence captured for the primary backup/export bucket |
| Identity (Firebase Auth) | 4 h | 0: rebuild via admin SDK export if needed |
| DNS | 1 h | 0 |
| Email (SendGrid) | 8 h | 0 (queues retry) |
| End-to-end user-facing restore | 4 h | 24 h |
Backup strategy
Current posture verified on 2026-05-07:
- Google Cloud provides managed platform durability and encryption for Firestore and Cloud Storage.
- Firestore is in a US (North America) multi-region Google Cloud location, with PITR enabled for a 7-day window.
- Firestore managed daily backups are configured with 98-day retention, and ready backup snapshots are present.
- A separate US multi-region backup/export bucket uses Nearline storage, object versioning, public access prevention, uniform bucket-level access, 90-day soft delete, and an unlocked 90-day retention policy.
- A manual local JSON export of Firestore can be produced with a retention-managed export script.
- Live Google Cloud backup-configuration evidence is retained in the internal audit evidence set; restore procedures reference the verified live settings rather than checking backup configuration into the repository.
Recovery scenarios
Scenario A: Firestore data corruption or accidental deletion
Trigger: application error, rule misconfig, human error causing data loss on a subset.
1. Identify affected paths; snapshot current state. 2. Use Firestore PITR (gcloud firestore export --database=(default) --collection-ids=... --point-in-time=<timestamp>) for recoveries within the 7-day PITR window, or use the latest managed backup / confirmed export for older recovery points. 3. Restore to a staging project; diff; merge back into production. 4. Verify with targeted queries; log evidence.
Target RTO: 1-2 h for targeted restore.
Scenario B: Full Firestore loss / region outage
1. Provision a new Firestore instance (alternate region if original is unavailable). 2. Import the most recent verified backup or manual export. 3. Apply the version-controlled Firestore Security Rules and Firestore indexes from git. 4. Point the application at the new project (update the relevant Firebase configuration in the hosting platform). 5. Verify signin, core reads/writes. 6. Communicate 24-hour data-loss window to any affected users.
Target RTO: 4 h.
Scenario C: Vercel hosting outage
1. Detect via external monitoring and Vercel status page. 2. If prolonged: deploy next build && next export static bundle to Firebase Hosting as read-only. 3. Re-enable writes once Vercel restored.
Target RTO: 2 h for read-only; depends on Vercel for writes.
Scenario D: Compromised credentials / production breach
See incident response for full procedure. At minimum: 1. Revoke all sessions (sessions collection purge). 2. Rotate all secrets listed in the internal secrets rotation log (available on request via [email protected]). 3. Force re-auth for all users. 4. Restore from last known-clean backup if data tampering suspected.
Scenario E: Primary incident-owner unavailability
- Recovery relies on documented recovery procedures, recovery materials version-controlled in git, and the offline credential-recovery process.
Scenario F: Sub-processor outage
- OpenAI → AI features degraded; failover to an alternate AI provider requires code/config changes plus DPA execution and sub-processor inventory update before production traffic flows.
- SendGrid → Manual DNS + template switch to AWS SES or Resend (pre-staged).
- Stripe → No automatic failover; pause signups with banner; existing subs unaffected.
- Deepgram / ElevenLabs → Google STT/TTS fallback wired at application layer.
Drill schedule
| Frequency | Scope |
|---|---|
| Quarterly | Tabletop scenario review |
| Quarterly | Synthetic safe restore drill using npm run backup:restore-drill:safe |
| Weekly / on demand | Automated continuity evidence bundle (BCP/DRP documentation checks, synthetic restore verification, public health probes, and optional read-only Google Cloud backup checks) |
| Semi-annually | Partial restore drill (one collection to staging) |
| Annually | Full restore drill, RTO measured |
Current evidence and gaps
As of 2026-05-07, live GCP evidence confirms Firestore PITR, managed daily Firestore backups with 98-day retention, and a separate US multi-region backup/export bucket with versioning, 90-day soft delete, and a 90-day retention policy. The evidence record is retained in the internal audit evidence set.
As of 2026-05-19, an automated continuity evidence bundle runs the BCP/DRP documentation-review checks, a safe synthetic restore drill, HECVAT caveat checks, and read-only health probes. The bundle complements (does not replace) the human tabletop and full live restore/failover drills.
Last-drill log
| Date | Scenario | Duration | RTO achieved | RPO achieved | Findings | Record |
|---|---|---|---|---|---|---|
| 2026-04-24 | Targeted Firestore PITR: single subcollection to staging | 47 min | ≤ 1 h ✅ | 30 min ✅ | Doc fix: prefer gcloud storage over deprecated gsutil. | Restricted evidence record |
| 2026-05-02 | Documentation-review tabletop (automated) | <1 min | n/a (not a live restore) | n/a | 27/27 checks PASS. Drill records, sections, retention claims, sub-processor list, and repo artifacts all internally consistent. Live restore + RTO measurement deferred to next semi-annual cycle. | Internal drill record |
| 2026-05-07 | Synthetic safe local restore drill | <1 min | n/a (no production restore) | n/a | PASS. Synthetic Firestore-shaped export restored into isolated local scratch target and checksum matched. No production data read or written. | Restricted evidence record |
| 2026-05-14 | Automated continuity evidence bundle | <1 min | n/a (not a live restore) | n/a | PASS: 0 blocking failures. Bundle covered BCP/DRP documentation checks, synthetic restore verification, HECVAT caveat check, and read-only public health probes. | Restricted evidence record |
| 2026-05-19 | Automated continuity evidence bundle | <1 min | n/a (not a live restore) | n/a | PASS: 0 blocking failures. Bundle covered BCP/DRP documentation checks, synthetic restore verification, HECVAT caveat check, and read-only public health probes; no production writes performed. | Restricted evidence record |
Each drill appends a row here. Detailed drill records are retained in the restricted disaster-recovery evidence set.