Company

Hnm SolutionsSee more

addressAddressChennai, Tamil Nadu
CategoryEngineering

Job description

Role : Site Reliability Engineer
Location : Chennai, India
Experience : 12 years
Description :
Candidate should have a strong background in both software engineering, Monitoring and operations, with a focus on ensuring the reliability, performance, and scalability of our web applications.
Skills :
- Strong understanding of Modern single page web applications with Angular/React, NodeJS etc and mobile applications.
- Deep knowledge of monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, ELK stack, Datadog, AppDynamics, New Relic, etc.)
- Familiarity with Configuration Management tools (Ansible, Puppet, etc.) and shell scripting
AWS Cloud : VPC, subnets, network access control lists, security groups, EC2 instances, S3 buckets, IAM, Route 53, Lambda.
- Experience in Containerization tools like Docker, VM, Kubernetes.
- Strong knowledge towards SRE Principles into implementing monitoring.
Responsibilities :
1. Monitoring and Alerting :
- Implement and manage monitoring solutions to track the health and performance of services.
- Proactively monitor application stability.
- Set up alerting and automated responses to minimize downtime.
- Perform root cause analysis and manage incidents for issue resolution.
- Monitor system performance, identify bottlenecks, and collaborate on optimizations.
2. Service Reliability :
- Ensure the reliability and availability of our web applications by setting and meeting Service Level Objectives (SLOs).
- Collaborate with development teams to improve the overall reliability of applications and services.
3. Automation :
- Develop and maintain automation scripts and tools for repetitive operational tasks.
4. Product Continuous Improvement :
- Maintain open communication with the Product Owner for product alignment.
- Ensure SRE tasks align with the product's strategic goals.
- Participate in backlog refinement meetings to prioritize SRE-related work items.
- Identify, document, and communicate defects and improvement opportunities.
5. Capacity Planning :
- Conduct capacity planning to ensure that systems can handle expected loads.
- Analyze data and predict future resource requirements, scaling systems as needed.
6. Incident Response :
- Participate in an on-call rotation to respond to incidents and outages promptly.
- Follow incident management procedures and conduct post-incident reviews.
7. Change Management :
- Assess risks associated with changes to the production environment.
- Coordinate and execute deployments, ensuring rollback plans are in place.
8. Performance Analysis :
- Analyze performance bottlenecks and work on optimizing systems for efficiency and cost-effectiveness.
9. Documentation :
- Maintain comprehensive documentation for systems, processes, and procedures.
10.Collaboration :
- Work closely with cross-functional teams, including development, operations, and security, to achieve common goals.
- Foster a culture of reliability within the organization.
11.Other :
- Execute releases and contribute to the deployment process.
- Provide on-call support.

(ref:hirist.tech)
Refer code: 953288. Hnm Solutions - The previous day - 2024-03-17 07:30

Hnm Solutions

Chennai, Tamil Nadu
Popular Site Reliability Engineer jobs in top cities

Share jobs with friends

Related jobs

Site Reliability Engineer - Configuration Management

Site Reliability Engineer

A Client Of Freshersworld

Tuticorin, Tamil Nadu

2 months ago - seen

Site Reliability Engineer  

Ford Business Solutions

Chennai, Tamil Nadu

4 months ago - seen