Site reliability engineer
Pune, INDIA IT development
Job description
The Position
KEY ROLES & RESPONSIBILITIES (required):
Responsibilities:
- Design, implement, and maintain site reliability engineering (SRE) practices that ensure the reliability and performance of our production systems.
- Design and implement SRE practices that align with the company's overall reliability and performance goals.
- Develop and maintain automated monitoring and alerting systems to proactively identify and address potential issues.
- Implement incident response procedures to effectively resolve incidents and minimize downtime.
- Collaborate with developers and other engineers to define and implement service level agreements (SLAs).
- Conduct regular reviews of SRE practices to ensure they remain effective and aligned with evolving needs.
- Monitor and troubleshoot production systems to identify and resolve issues before they impact users.
- Continuously monitor production systems for performance degradation, potential failures, and security vulnerabilities.
- Thoroughly investigate and troubleshoot incidents to identify the root cause and implement corrective actions.
- Proactively identify potential issues by analyzing system logs, metrics, and trends.
- Collaborate with developers and other engineers to implement workarounds and fixes for identified issues.
- Document incident investigations and corrective actions to prevent recurrence and improve future troubleshooting efforts.
- Develop and implement automated monitoring and alerting systems.
- Design and implement automated monitoring systems to collect and analyze real-time data from production systems.
- Configure alerting systems to notify appropriate personnel of potential issues or performance deviations.
- Continuously evaluate and improve the effectiveness of automated monitoring and alerting systems.
- Automate repetitive tasks to improve efficiency and reduce manual intervention.
- Collaborate with developers and other engineers to design and implement new features and infrastructure changes.
- Work closely with developers to understand the impact of new features and code changes on system reliability and performance.
- Provide guidance and recommendations to developers on SRE best practices and design for reliability.
- Participate in code reviews to identify potential reliability issues and suggest improvements.
- Collaborate with infrastructure engineers to ensure that new infrastructure components are designed and deployed with reliability and performance in mind.
- Stay up-to-date on the latest technologies and trends in SRE and DevOps to contribute to continuous innovation and improvement.
- Prepare and deliver technical presentations and documentation.
- Prepare and deliver technical presentations to share SRE best practices, incident investigations, and lessons learned.
- Document SRE practices, procedures, and guidelines to ensure knowledge transfer and consistency.
- Contribute to internal documentation and knowledge bases to aid troubleshooting and problem-solving.
- Present findings and recommendations to management and stakeholders to inform decision-making processes.
KNOWLEDGE/ SKILLS/ATTRIBUTES (required): The minimum education, knowledge, experience, skills and attributes required to perform the essential functions of this job.
Required Experience, Skills and Qualifications
- 3 – 6 years of relevant experience
- Proven hands-on Software/Application support with Cloud as main technology area.
- Troubleshooting and the ability to delve deeply into technical details & acquire/create the necessary
- Knowledge to effectively troubleshoot and repair of the applications
- Knowledge Splunk, VictorsOps, Appdynamics, web automation like selenium and ability to learn new tools and technologies.
- Collaborative team player with excellent influence and interpersonal skills; inspires confidence.
- Experience with public Cloud providers, including Amazon Web Services architecture, tools, and Cloud methodologies.
- Proven ability to design, implement, and maintain SRE practices that ensure system reliability and performance.
- Experience in monitoring and troubleshooting production systems to identify and resolve incidents.
- Familiarity with automated monitoring and alerting systems, including tool selection, configuration, and maintenance.
- Experience collaborating with developers and other engineers to design, implement, and operate reliable systems.
- Excellent written and verbal communication
- Exposure to handling customers from various geographies
- Ability to work with minimum supervision
- Team player who shares ideas and resources
- Flexibility to work in shifts or weekends as per schedule
Education
- Bachelor degree engineering or informatics.
Who we are
At Roche, more than 100,000 people across 100 countries are pushing back the frontiers of healthcare. Working together, we’ve become one of the world’s leading research-focused healthcare groups. Our success is built on innovation, curiosity and diversity.
Roche is an Equal Opportunity Employer.