What you'll do:
Documenting Incidents, Runbooks, and incident response report
Writing Postmortem report
Ensure proper Logging, Monitoring and Alerting
Change management and maintain an Incident management system.
Drive Root cause analysis exercise for issues
Adopting Site Reliability Engineering practices in the group
Tailoring processes to manage time-sensitive issues and bring them to appropriate closure.
Owning end-to-end availability and performance of mission-critical services and building
Automation to prevent the recurrence of the problem.
Key Responsibilities
Works with Cross-product teams to ensure the high availability of the system.
Experience in at least one of the following languages and willingness to learn new ones: Bash,
Php, Golang
Ability to identify system bottlenecks and recommend solutions to solve the availability issue.
Proven expertise in system-level debugging
Working experience in building massively scalable high-performance services
Strong Linux systems knowledge
Guide new SRE engineers on aspects of system debugging.
Hands-on experience working with AWS systems and components.
Skills: Production Support, L3 Support
Experience: 5.00-10.00 Years