Reliability Dashboard

Observability • Chaos Engineering • SLOs

A transparency log of my portfolio's reliability engineering practices. Validating resilience continuously through Chaos Mesh and Prometheus.

Service Level Objectives (SLOs)

Objective Target Status (30d) Burn Rate
Availability (Uptime) 99.9% 99.99% ✅ 0.1x (Safe)
Latency (P95) < 200ms 45ms ✅ 0x
Error Rate (5xx) < 0.1% 0.00% ✅ 0x

Chaos Experiments

Pod Kill Experiment

Scenario: Randomly terminate a backend pod every 60s.

Result: Cluster self-healed. 0% Downtime observed.

Chaos Mesh PASSED ✅
[2025-12-22T14:48:00Z] Experiment: pod-kill-backend
[2025-12-22T14:48:01Z] Action: Pod backend-5f4d3a killed
[2025-12-22T14:48:02Z] Alert: PodRestarted (Severity: Info)
[2025-12-22T14:48:04Z] Recovery: New pod Running
[2025-12-22T14:48:05Z] SLO Check: Availability > 99.9% (TRUE)
                    

Network Latency

Scenario: Inject 200ms delay to backend traffic.

Result: Latency SLO alert fired within 30s.

Chaos Mesh VERIFIED ✅

Hybrid Architecture

graph TD subgraph "AWS Production (Always On)" User((User)) -->|HTTPS| CF[CloudFront CDN] CF -->|Static Assets| S3[S3 Bucket] CF -->|API Calls| APIG[API Gateway] APIG -->|JSON| Lambda[Lambda Function] Lambda -->|Read/Write| DDB[(DynamoDB)] end subgraph "Reliability Lab (Ephemeral)" SRE((SRE/Admin)) -->|Start Lab| LabScript[lab.sh] LabScript -->|Provisions| EC2[EC2 Spot Node] EC2 -->|Hosts| K3s[K3s Cluster] subgraph "K8s Namespace: Default" BackendPod[Backend Service] VerifyPod[Reliability Service] VerifyPod -->|Synthetic Traffic| BackendPod end subgraph "K8s Namespace: Monitoring" Prom[Prometheus] Graf[Grafana] Prom -->|Scrape| BackendPod end subgraph "K8s Namespace: Chaos Mesh" Chaos[Chaos Daemon] Chaos -->|Pod Kill| BackendPod end end classDef aws fill:#ff990033,stroke:#ff9900,color:#fff classDef k8s fill:#326ce533,stroke:#326ce5,color:#fff classDef chaos fill:#ef444433,stroke:#ef4444,color:#fff class S3,Lambda,DDB,CF,APIG aws class K3s,BackendPod,VerifyPod,Prom,Graf k8s class Chaos chaos