Site Reliability Engineer (SRE) - IBM - Dublin

Job description

Introduction
A career in IBM Software means you'll be part of a team that transforms our customers challenges into solutions.

Seeking new possibilities and always staying curious, we are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.

We are seeking a skilled back-end developer to join our IBM Software team. As part of our team, you will be responsible for developing and maintaining high-quality software products, working with a variety of technologies and programming languages.

Your Role and Responsibilities
We’re looking for an experienced Site Reliability Engineer to join our team. At IBM, the Software Defined Networking (SDN) business, which includes IBM Hybrid Cloud Mesh, NS1, and other offerings focuses on software-based networking, an architecture approach that enables networks to be intelligently and centrally controlled using software with main focus on automating network functions, allowing for simpler provisioning and management of network resources, everywhere from the data center to the campus to the edge.

Ideally, you will have experience supporting a SaaS platform with one or more customers running production workloads. Your experience will include troubleshooting and debugging live production issues, rectifying problems while working to minimize downtime, as well as pre-emptively making changes to prevent issues from occurring. Your previous experience with cloud computing, observability tools, and SRE best practices, amongst other things, enabled you to carry out this role effectively.

Ideally, you’ll bring the following experience:
· Cloud Computing (Preferably AWS or IBM Cloud)
· Configuration management and infrastructure-as-code experience (Terraform and Ansible preferred)
· Collaborating with product development engineers to identify, implement and report on service level indicators and objectives
· Software development and scripting (GoLang/python/bash)
· Deploying and troubleshooting complex, global production systems
· Multiple hosting models preferred (managed, colo, and AWS/multi-cloud)
· Admin-level Linux skills

Required Technical and Professional Expertise

· Minimum of 4 to 7 years' experience in hands-on global production system deployment, administration and troubleshooting
· Proven experience in systems performance analysis and debugging in a Linux environment
· Experience in software development and scripting: bash and python are required (Golang preferred)
· Experience in automation is required
· 2+ year’s experience with provisioning and configuration management systems (terraform, ansible) across multiple cloud providers
· 2+ years experience in observability and alerting systems, Splunk, Loki, open telemetry or similar systems
· 2+ years experience in working with different cloud providers such as IBM Cloud, AWS, Azure, GCP
· 3+ years experience with operating systems running on Kubernetes / OpenShift platforms.
· Experience on Postgres DBA and Kafka (or similar)
· Collaborating with product development engineers to identify, implement and report on service level indicators and objectives
· Willingness to participate in an on-call rotation.

Preferred Technical and Professional Expertise

Experience with the following would be an asset:
· Working on integration and delivery systems such as Jenkins
· Containerized applications
· Experience with remote bare metal hardware provisioning. PXE boot, working with remote hands

Offers “IBM”

Job description