Hello, I'm

Gaurav Yadav

Production Reliability Analyst

Production Reliability Engineer with 3+ years of experience supporting AWS-based microservices in 24x7 environments. Skilled in incident response, observability, and root cause analysis using Datadog, logs, and metrics.

Reliability
Automation
Observability
Loading...

Total Visitors

Loading...

Resume Downloads

System Status

About Me

Gaurav Yadav

Production Reliability Engineer with 3+ years of experience supporting AWS-based microservices in 24x7 environments. Skilled in incident response, observability, and root cause analysis using Datadog, logs, and metrics.

Experienced in improving monitoring quality, reducing MTTR, and collaborating with engineering teams to enhance system reliability.

AWS Certified CloudOps Associate
CKA (Certified Kubernetes Admin)
HashiCorp Terraform Associate

Technical Skills

cloud_platforms.yml
aws_services:
- "EC2, S3, RDS" - "VPC, EKS"
systems:
- "Linux" - "Networking Fundamentals"
orchestration.yml
containers:
- "Docker"
orchestration:
- "Kubernetes" - "Helm" - "Istio"
automation.yml
iac:
- "Terraform" - "Ansible"
scripting:
- "Python" - "Bash"
observability.yml
monitoring:
- "Datadog (APM, Logs, Traces)"
visualization:
- "Grafana Cloud"
cicd_security.yml
pipelines:
- "Jenkins" - "GitHub Actions"
gitops:
- "ArgoCD"
tools.yml
data:
- "SQL"
collaboration:
- "JIRA" - "Confluence" - "Slack"

Work Experience

April 2022 - October 2025

Technical Support Analyst III

Cars24 Services Private Limited

  • Owned production reliability and incident resolution for distributed microservices running on AWS in a 24x7 environment, managing 200+ production incidents annually while maintaining 99%+ service availability.
  • Supported P0/P1 incident triage and performed deep root cause analysis using logs, metrics, and traces, reducing MTTR by 35% through faster issue isolation and coordinated response.
  • Built and optimized Datadog dashboards, monitors, and alerts to track application performance, API failures (4xx/5xx), latency spikes, and infrastructure health.
  • Investigated system anomalies using Datadog APM, logs, and SQL queries to validate production data and diagnose performance or integration issues.
  • Troubleshot API failures, database connectivity issues, and third-party integrations, collaborating with Engineering, DevOps, and Product teams to escalate defects and validate fixes.
  • Documented incidents, runbooks, and post-mortems in JIRA and Confluence, identifying recurring issues and recommending monitoring and process improvements to enhance system reliability.
  • Participated in post-incident reviews and contributed to improvements in monitoring, alerting, and operational workflows.

Education

BRCM College of Engineering and Technology

2012 - 2016

Bachelor of Technology

Featured Projects

Kubernetes Reliability & Monitoring Lab

  • Containerized a FastAPI application with Docker and deployed it on Kubernetes running on AWS EC2.
  • Built CI/CD automation using GitHub Actions to build images and deploy updates to the cluster.
  • Implemented monitoring with Prometheus and Grafana to track application metrics and container health.
  • Improved reliability using Kubernetes readiness and liveness probes and simulated pod failures for incident debugging.
K8s Prometheus Grafana GitHub Actions

AWS Infrastructure Automation

Automated provisioning of VPC, EC2, RDS, security groups, and load balancers using Terraform modules. Implemented state locking and improves consistency reducing manual effort.

Terraform AWS IaC

CI/CD Pipeline Engineering

Built a full Jenkins-based CI/CD pipeline using Git, Docker, and AWS. Automated build, test, and deploy steps; added notification, rollback logic, and validation hooks to improve deployment reliability.

Jenkins Docker Git

Get In Touch

Always open to discussing new opportunities, reliability engineering, or just having a chat about cloud tech.

+91-98967 44504

gyadav456@gmail.com

Gurugram, Haryana

Say Hello