Skip to main content

1. Purpose

To ensure that all machine learning models developed, deployed, or procured by Reducto, Inc. maintain high levels of availability, reliability, and recoverability under adverse conditions, safeguarding business continuity, data integrity, and stakeholder trust.

2. Scope

Applies to every predictive, prescriptive, or generative model in production or pre-production, including:
  • In-house developed models
  • Third-party and open-source models
  • Cloud-hosted and on-premises deployments

3. Definitions

  • Model Resilience: Ability of a model and its supporting infrastructure to withstand, adapt to, and rapidly recover from failures, degradations, or attacks.
  • Failure Modes: Potential breakdown points, including hardware outages, software bugs, data corruption, adversarial inputs, and model drift.
  • Recovery Point Objective (RPO): Maximum tolerable data loss, measured in time.
  • Recovery Time Objective (RTO): Maximum tolerable downtime, measured in time.

4. Governance Structure

  • Model Resilience Committee (MRC): Cross-functional team (Data Science, IT, Security, Compliance) chaired by the Head of AI Governance.
  • Model Owners: Responsible for day-to-day oversight of resilience controls.
  • Incident Response Team (IRT): Activated upon resilience incidents.

5. Resilience Requirements

5.1 Availability Targets

  • Critical models: ≥ 99.9% uptime
  • Non-critical models: ≥ 99.0% uptime

5.2 RPO / RTO Standards

Critical models:
  • RPO ≤ 15 minutes
  • RTO ≤ 1 hour
Non-critical models:
  • RPO ≤ 1 hour
  • RTO ≤ 4 hours

5.3 Redundancy & Failover

  • Dual-region or multi-zone deployments for critical services
  • Automated load-balancing and failover mechanisms

5.4 Data Integrity Controls

  • Immutable storage for training datasets
  • Checksums and end-to-end encryption (in transit & at rest)

6. Testing & Validation

Pre-deployment stress testing

  • Simulate hardware failure, network latency spikes, corrupted inputs
  • Validate failover triggers and data consistency post-failover

Chaos engineering

  • Quarterly “chaos drills” introducing controlled faults (e.g., instance termination, API throttling, model rollback)

Drift & degradation monitoring

  • Real-time performance monitoring with automated alerts
  • Scheduled retraining triggers based on drift detection

7. Incident Management

Detection & alerting

  • Integrated with centralized monitoring (e.g., Prometheus, Datadog) for 24/7 coverage
  • Severity levels (P1–P4) with corresponding SLAs for response & resolution

Response playbooks

  • Documented steps for each failure mode (e.g., data pipeline break, artifact corruption)
  • Escalation path: Model Owner → IRT → MRC

Post-incident review

  • Root-cause analysis within 48 hours of resolution
  • Track corrective actions and assign ownership

8. Documentation & Change Control

Resilience design records

  • Architecture diagrams showing redundancy layers, failover paths, dependencies
  • Configuration details (backup schedules, snapshot retention)

Change management

  • All updates must pass resilience impact assessment
  • Version control for code, configuration, and training data

9. Continuous Improvement

Periodic audits

  • Annual third-party audit of resilience controls and SLA performance

Policy review cycle

  • MRC to review/update this policy at least every 12 months or upon major changes

Training & awareness

  • Semi-annual resilience best-practices workshops for Data Science and DevOps teams
I