DevOps, Lead SRE
Bengaluru (Bangalore Urban) IT development
Job description
Introduction
At IBM, work is more than a job - it's a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you've never thought possible. Are you ready to lead in this new era of technology and solve some of the world's most challenging problems? If so, lets talk.
Your Role and Responsibilities
The IBM Technology Lifecycle Services (TLS) group is looking for a technically oriented, talented, innovative and enthusiastic DevOps SRE to join our TLS AI DevOps team. As the SRE Lead you will work in an agile, collaborative environment to build, deploy, configure, and maintain AI based solutions for IBM internal and external clients. In this role you will be responsible for ensuring availability and responsiveness of our applications by setting up and maintaining policies, procedures, and tools. This includes leading the problem resolution process for our users, from analysis and troubleshooting to deploying workarounds or fixes. Working closely with our worldwide teams, you will have a unique opportunity to grow your experience in modern AI technologies, including watsonx.ai technology as part of our AI4Infrastructure initiative. AI4Infrastructure consists of various use cases that shall help us increase productivity for our support engineers, reduce time-to- resolution for client cases and increase support deflection rates. This role may require some attention after normal working hours in order to troubleshoot and resolve production issues experienced by clients. This role may also require shifting working hours as needed basis for follow-the-sun coverage.
The undisputed priority for IBM is to scale watsonx as the AI for business platform for enterprise clients and establish IBM as the top-of-mind choice for AI and foundation models.
Our watsonx suite makes it possible for clients to build, train, tune and deploy AI across their business, leveraging critical, trusted data wherever it resides. TLS is aiming to become client zero in capturing the full potential of this platform.
As the SRE Lead, you will:
· Ensure availability and responsiveness of application by setting up and maintaining the required documentation method and tools
· You will provide expertise and insights for project engineering teams and advise on best approaches to solve for avoiding infrastructure and security challenges.
· Define roadmaps and milestones for devOps tasks in support of multiple projects handled on a monthly and quarterly basis
· Handle resolution of blockers, escalation to stakeholders, and provisioning of resources
· Meet with stakeholders and internal teams to communicate and agree on plans and manage notifications when issues arise
· Document plans for maintenance, schedules and status to leadership team and stakeholders
· Manage Ansible, Jenkins, Tekton and other CI/CD solutions
· Diagnose environmental issues and introduce/implement technologies to solve them
· Provision and maintenance of DevOps Infrastructure for projects
· Monitor and support of platform infrastructure and manage escalations
· Look for enhancements and innovative solutions to help the services scale and improve existing technical support tools, procedures, or processes.
· Develop troubleshooting techniques to effectively identify and investigate issues and provide advice and guidance to clients
· Work in a global team, collaborating with IBMers to share recommendations, solutions and ideas
· Potential on-duty rotation including weekend and holiday support as needed basis
Required Technical and Professional Expertise
· 8-12 years of relevant industry experience
· Minimum of 2 years of experience in a DevOps Developer or Engineer role or similar.
· Minimum of 3 years of SRE experience
· Strong experience in cloud deployment and deployment of monitoring capabilities
· Proficiency in scripting and Python programming language
· Experience in analytics and interactive visualization platforms like Grafana
· Strong understanding of CI/CD pipelines and tools like Tekton
· Experience in Terraform modules.
· Experience with cloud-based services
· Working knowledge with ML Ops and maintenance
· Proficiency in containerization technologies (e.g., Docker, Kubernetes, OCP).
· Experience in Monitoring and operating Kubernetes clusters
· Understanding of networking and security concepts
· Self-starter, organised with self-learning skills, ability to work independently
· Great communication skills, self-managed and a team player
· Ability to work effectively as part of a worldwide, agile development team.
· Fluent English
Preferred Technical and Professional Expertise
· Skills for implementation, operations and maintenance of DevOps environnent - MLOps, DevOps, UX Design
· Experience in deploying and maintaining services and pipelines in IBM Cloud containerized environments.
· Experience in automating the deployment and scheduling of micro- services