Business continuity plan
Last reviewed 2026-05-19
Objective
Maintain the HiringCoachAI service for users and customers during disruption to people, technology, or third-party services. Keep customer data safe and recoverable throughout.
Business impact
| Function | Tolerable downtime | Tolerable data loss |
|---|---|---|
| Authentication (signin, signup) | 4 h | 0 |
| Customer-facing web app | 4 h | 0 |
| Resume / cover-letter generation (AI) | 8 h | 0 |
| Payment processing (Stripe) | 4 h | 0 |
| Transactional email (SendGrid) | 8 h | 0 |
| Admin tooling | 24 h | 0 |
| Analytics ingestion | 7 d | 7 d |
These inform the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) in disaster recovery plan.
Key dependencies
| Dependency | Role | Failure impact | Alternative |
|---|---|---|---|
| Vercel | Application hosting | App unreachable | Fail over to Firebase Hosting (static) + direct Cloud Functions (documented in DRP) |
| Firebase Auth | Authentication | No signin | No alternative (re-provision in another GCP project in worst case) |
| Firestore | Primary database | No reads/writes | Restore from backup to new Firestore project |
| Cloud Storage | Backup/export buckets and Cloud Functions source buckets; user file uploads are not live in production yet | Backup/export workflows affected; live user workflows continue unless they depend on restore/export operations | Alternate bucket in different region |
| Stripe | Payments | No new subscriptions; existing charges continue | Payments pause with customer notice |
| SendGrid | Magic-link signin & notifications fail; in-app still works | Failover to AWS SES or Resend (requires DNS change + template migration) | |
| OpenAI (called both directly and via Vercel AI Gateway) | AI generation | AI features degraded | Vercel AI Gateway provides a routing layer to alternate AI providers; failover would require code/config changes plus DPA execution and sub-processor inventory update for any new provider before production traffic flows |
| Deepgram | Transcription | Voice features disabled | Queue for later or switch to Google STT |
| ElevenLabs | Speech synthesis | Voice features disabled | Google TTS fallback already wired |
| DNS (Namecheap/Cloudflare) | Resolution | Site offline | Alternate registrar record + longer TTL |
Activation triggers
The Security Officer activates the BCP when any of:
- Customer-facing downtime exceeds 1 h
- Data loss is confirmed or suspected
- Security incident at Severity 1 or 2 (per incident response)
- Loss of access to production systems
- Natural disaster or personnel incident affecting the Security Officer
Response structure
| Role | Responsibility |
|---|---|
| Incident Commander | Security Officer. Coordinates response, declares severity, authorizes customer comms. |
| Technical Lead | Diagnoses, orchestrates recovery. |
| Communications Lead | Updates status page, sends customer comms. Often same person as IC in small-team mode. |
Contact info and escalation paths are maintained in an internal on-call register; the current Incident Commander is reachable at [email protected].
Communications
- Status page:
hiringcoach.ai/status: updated within 30 min of declared incident. - Customer email: Sent via backup SES/Resend channel if SendGrid is the failing dependency.
- Regulatory: Per breach notification if personal data is affected.
Continuity of operations
- All source of truth lives in git (policies, config, code).
- Firestore PITR, managed daily Firestore backups with 98-day retention, and US multi-region backup/export bucket evidence are retained in the restricted continuity evidence set.
- Business-continuity recovery relies on documented continuity procedures, recovery materials version-controlled in git, and the Incident Commander role defined in this plan.
Testing
- Quarterly: Tabletop scenario review (15 min, Security Officer with any additional engineer available).
- Annually: Full restore drill (logged in DRP).
- Post-incident: Plan update within 30 days.
- Weekly / on demand: an automated continuity evidence bundle runs on a scheduled workflow or on demand. It exercises BCP/DRP documentation reviews, a synthetic safe restore drill, public health probes, and optional read-only Google Cloud backup checks when credentials are available; results are archived in the internal drill evidence set. This complements (does not replace) the human tabletop and full live restore/failover drills.
Last-drill log
| Date | Scope | Outcome | Record |
|---|---|---|---|
| 2026-04-24 | Targeted Firestore PITR restore to staging (single collection, DRP-linked continuity restore drill) | PASS: partial-scope RTO 47 min; RPO 30 min | Internal drill record |
| 2026-05-02 | Documentation-review tabletop (automated) | PASS: 23/23 checks | Internal drill record |
| 2026-05-14 | Automated continuity evidence bundle | PASS: 0 blocking failures across BCP and DRP documentation checks, synthetic restore, HECVAT caveat check, and read-only health probes | Internal drill record |
| 2026-05-19 | Automated continuity evidence bundle | PASS: 0 blocking failures across BCP and DRP documentation checks, synthetic restore, HECVAT caveat check, and read-only public health probes; no production writes performed | Internal drill record |
Each drill appends a row here. The automated runner verifies BCP sections, cross-reference resolution, and dependency-list parity with the sub-processors. Targeted restore drills, human tabletop sessions, and full live failover exercises are tracked here too with their own dated records.
Resumption
An "all clear" is declared by the Incident Commander when: 1. Service is restored. 2. Data integrity verified. 3. Root cause understood. 4. Short-term mitigation in place.
Long-term remediation is tracked in the post-mortem (see incident response).
Related
- disaster recovery plan: technical recovery procedures
- incident response: severity levels, comms, post-mortems
- breach notification: regulatory timelines