Gaurav Yadav
Production Reliability Analyst
Production Reliability Engineer with 3+ years of experience supporting AWS-based microservices in 24x7 environments. Skilled in incident response, observability, and root cause analysis using Datadog, logs, and metrics.
About Me
Production Reliability Engineer with 3+ years of experience supporting AWS-based microservices in 24x7 environments. Skilled in incident response, observability, and root cause analysis using Datadog, logs, and metrics.
Experienced in improving monitoring quality, reducing MTTR, and collaborating with engineering teams to enhance system reliability.
Technical Skills
Work Experience
Technical Support Analyst III
Cars24 Services Private Limited
- Owned production reliability and incident resolution for distributed microservices running on AWS in a 24x7 environment, managing 200+ production incidents annually while maintaining 99%+ service availability.
- Supported P0/P1 incident triage and performed deep root cause analysis using logs, metrics, and traces, reducing MTTR by 35% through faster issue isolation and coordinated response.
- Built and optimized Datadog dashboards, monitors, and alerts to track application performance, API failures (4xx/5xx), latency spikes, and infrastructure health.
- Investigated system anomalies using Datadog APM, logs, and SQL queries to validate production data and diagnose performance or integration issues.
- Troubleshot API failures, database connectivity issues, and third-party integrations, collaborating with Engineering, DevOps, and Product teams to escalate defects and validate fixes.
- Documented incidents, runbooks, and post-mortems in JIRA and Confluence, identifying recurring issues and recommending monitoring and process improvements to enhance system reliability.
- Participated in post-incident reviews and contributed to improvements in monitoring, alerting, and operational workflows.
Education
BRCM College of Engineering and Technology
2012 - 2016
Bachelor of Technology
Featured Projects
Kubernetes Reliability & Monitoring Lab
- Containerized a FastAPI application with Docker and deployed it on Kubernetes running on AWS EC2.
- Built CI/CD automation using GitHub Actions to build images and deploy updates to the cluster.
- Implemented monitoring with Prometheus and Grafana to track application metrics and container health.
- Improved reliability using Kubernetes readiness and liveness probes and simulated pod failures for incident debugging.
AWS Infrastructure Automation
Automated provisioning of VPC, EC2, RDS, security groups, and load balancers using Terraform modules. Implemented state locking and improves consistency reducing manual effort.
CI/CD Pipeline Engineering
Built a full Jenkins-based CI/CD pipeline using Git, Docker, and AWS. Automated build, test, and deploy steps; added notification, rollback logic, and validation hooks to improve deployment reliability.
Get In Touch
Always open to discussing new opportunities, reliability engineering, or just having a chat about cloud tech.
+91-98967 44504
gyadav456@gmail.com
Gurugram, Haryana