Reliability Dashboard

Observability • Chaos Engineering • SLOs

A transparency log of my portfolio's reliability engineering practices. Validating resilience continuously through Chaos Mesh and Prometheus.

Service Level Objectives (SLOs)

Objective	Target	Status (30d)	Burn Rate
Availability (Uptime)	99.9%	99.99% ✅	0.1x (Safe)
Latency (P95)	< 200ms	45ms ✅	0x
Error Rate (5xx)	< 0.1%	0.00% ✅	0x

Chaos Experiments

Pod Kill Experiment

Scenario: Randomly terminate a backend pod every 60s.

Result: Cluster self-healed. 0% Downtime observed.

Chaos Mesh PASSED ✅

[2025-12-22T14:48:00Z] Experiment: pod-kill-backend
[2025-12-22T14:48:01Z] Action: Pod backend-5f4d3a killed
[2025-12-22T14:48:02Z] Alert: PodRestarted (Severity: Info)
[2025-12-22T14:48:04Z] Recovery: New pod Running
[2025-12-22T14:48:05Z] SLO Check: Availability > 99.9% (TRUE)

Network Latency

Scenario: Inject 200ms delay to backend traffic.

Result: Latency SLO alert fired within 30s.

Chaos Mesh VERIFIED ✅

Hybrid Architecture

graph TD subgraph "AWS Production (Always On)" User((User)) -->|HTTPS| CF[CloudFront CDN] CF -->|Static Assets| S3[S3 Bucket] CF -->|API Calls| APIG[API Gateway] APIG -->|JSON| Lambda[Lambda Function] Lambda -->|Read/Write| DDB[(DynamoDB)] end subgraph "Reliability Lab (Ephemeral)" SRE((SRE/Admin)) -->|Start Lab| LabScript[lab.sh] LabScript -->|Provisions| EC2[EC2 Spot Node] EC2 -->|Hosts| K3s[K3s Cluster] subgraph "K8s Namespace: Default" BackendPod[Backend Service] VerifyPod[Reliability Service] VerifyPod -->|Synthetic Traffic| BackendPod end subgraph "K8s Namespace: Monitoring" Prom[Prometheus] Graf[Grafana] Prom -->|Scrape| BackendPod end subgraph "K8s Namespace: Chaos Mesh" Chaos[Chaos Daemon] Chaos -->|Pod Kill| BackendPod end end classDef aws fill:#ff990033,stroke:#ff9900,color:#fff classDef k8s fill:#326ce533,stroke:#326ce5,color:#fff classDef chaos fill:#ef444433,stroke:#ef4444,color:#fff class S3,Lambda,DDB,CF,APIG aws class K3s,BackendPod,VerifyPod,Prom,Graf k8s class Chaos chaos