Hello, I'm

Gaurav Yadav

Production Support | Site Reliability Engineer

Reliability-driven engineer supporting distributed systems, triaging production incidents, and improving service performance. Focused on automation, observability, and reducing operational toil.

Reliability
Automation
Observability
Loading...

Total Visitors

Loading...

Resume Downloads

System Status

About Me

Gaurav Yadav

Reliability-driven engineer with 3+ years of experience supporting distributed systems, triaging production incidents, and improving service performance. Skilled in AWS, Kubernetes, Terraform, CI/CD pipelines, and Datadog, with a strong focus on automation, observability, and reducing operational toil.

Proven ability to enhance system reliability, optimize monitoring, drive MTTR reduction, and collaborate with engineering teams to build resilient, scalable platforms.

AWS Certified CloudOps Associate
CKA (Certified Kubernetes Admin)
HashiCorp Terraform Associate

Technical Skills

cloud_net.yml
aws_stack:
- "EC2, S3, RDS" - "VPC, IAM, EKS"
systems:
- "Linux" - "Networking"
containers.yml
orchestration:
- "Docker" - "Kubernetes" - "Helm" - "Istio"
cicd_gitops:
- "Jenkins/Git" - "ArgoCD"
iac_auto.yml
provisioning:
- "Terraform" - "Ansible"
scripting:
- "Python" - "Bash"
security.yml
policy_code:
- "Kyverno"
practices:
- "Secure CI/CD" - "Artifact Scan" - "Dependency Scan"
sre.yml
datadog_suite:
- "APM & Logs" - "Monitors" - "Dashboards"
metrics:
- "Prometheus" - "Grafana"
ops_mgmt.yml
database:
- "SQL"
collaboration:
- "JIRA" - "Confluence" - "Slack"

Work Experience

April 2022 - October 2025

Technical Support Analyst III

Cars24 Services Private Limited

  • Managed production reliability for microservices using Datadog metrics, logs, traces, and dashboards, improving anomaly detection and significantly reducing MTTR.
  • Led incident response end-to-end: triage, mitigation, coordination, and deep RCA using SQL and log analysis, driving long-term fixes aligned with SLOs.
  • Identified performance bottlenecks and reliability gaps across distributed systems; collaborated with Engineering and DevOps to deploy preventive fixes.
  • Developed automation scripts and diagnostic tools in Python/Bash to eliminate repetitive operational tasks and accelerate root-cause identification.
  • Improved alerting and escalation workflows, reducing noise and enhancing on-call efficiency.
  • Authored runbooks, incident documentation, and reliability best practices in Confluence.

Education

BRCM College of Engineering and Technology

2012 - 2016

Bachelor of Technology

Featured Projects

Kubernetes Deployment & Observability

Containerised and deployed a scalable microservice app on Kubernetes. Implemented HPA-based auto-scaling, Helm-based deployments, and integrated Datadog for logs, metrics, and APM to achieve end-to-end observability.

K8s Helm Datadog

AWS Infrastructure Automation

Automated provisioning of VPC, EC2, RDS, security groups, and load balancers using Terraform modules. Implemented state locking and improves consistency reducing manual effort.

Terraform AWS IaC

CI/CD Pipeline Engineering

Built a full Jenkins-based CI/CD pipeline using Git, Docker, and AWS. Automated build, test, and deploy steps; added notification, rollback logic, and validation hooks to improve deployment reliability.

Jenkins Docker Git

Get In Touch

Always open to discussing new opportunities, reliability engineering, or just having a chat about cloud tech.

+91-98967 44504

gyadav456@gmail.com

Gurugram, Haryana

Say Hello