Software Developer - Production Engineer
POLAND IT development
Job description
Introduction
Watson Orders is a Silicon Valley based technology development group within IBM targeting the development of world-class conversational AI. Our mission is to deliver advanced technology solutions that address real-world, data driven needs in customer-facing the quick service restaurant, environment. We are focused on using state-of-the-art Machine Learning, AI, and related technologies to completely transform the customer experience
Your Role and Responsibilities
We are currently looking for skilled Software Developer – Production Engineering to ensure performance and reliability for AI & ML driven voice agent microservices, Edge Kubernetes clusters, network services, and storage layers.
Responsibilities:
· Work closely with other Watson Orders development teams in an embedded SRE model to help define & implement key metrics for uptime, reliability, and performance of these services and develop runbooks for incident management.
· Develop deep service telemetry through metric collection, distributed tracing, visualization, and reporting via Open Telemetry, Prometheus, and related tooling.
· Implement stability and performance optimizations in Python.
· Design, develop and maintain CI\\CD pipelines for integration and edge Kubernetes clusters.
· Participate in the definition and management of SLIs, SLOs and error budgets for infrastructure and production services.
· Design and implement infrastructure-as-code pipelines.
Required Technical and Professional Expertise
· AWS experience designing, implementing, and support cloud-based infrastructure
· Experience architecting, deploying, and supporting Kubernetes in cloud environments
· Experience designing and supporting distributed systems
· Experience writing production code in one of more languages such as Python (preferred), Java, Go in a microservices environments
· Linux experience configuring, supporting, and optimizing
Preferred Technical and Professional Expertise
· Familiarity running distributed ML workloads in cluster orchestrated environments
· Experience building and supporting telemetry and related infrastructure (Open telemetry, Jaeger, Grafana, Prometheus)
· Experience designing and implementing infrastructure as code pipelines
· PubSub Experience (Kafka, SQS, SNS, MQTT)
· Experience designing and implementing traffic routing strategies in edge and microservices environments.