Netflix is the world's leading streaming entertainment service with over 209 million paid memberships in over 190 countries enjoying TV series, documentaries, and feature films across a wide variety of genres and languages. Members can watch as much as they want, anytime, anywhere, on any Internet-connected screen. Members can play, pause and resume watching, all without commercials or commitments.
About the Team:
We build and support products such as Titus , our multi-tenant Kubernetes-based container platform and runtime, which enable and enhance fleet-wide agility, efficiency, and reliability while empowering engineers through valuable abstractions and reduced operational burden. We take an active role in driving our compute primitives forward, evolving them to meet our customer’s needs.
The Compute Platform’s products power workloads across our Data Platforms, Stream Processing, Studio and Content, Encoding, Streaming, Content Delivery, Machine Learning, and Engineering Tooling. We provide a worldwide highly available compute fabric launching over a million containers every day. As a critical component of our streaming service, the container management platform is a tier-one system. The team not only designs and develops this tier-one system but also operates and supports it 24x7.
About the Role:
We are looking for a Senior Software Engineer to join us in growing our Compute Platform built on our Kubernetes-based container platform. In this role, you will work on our container orchestration system, with a focus on enabling a best-in-class experience for machine learning and big data processing workloads, and batch-enabled platforms running on top of our products.
People who do well in this role are self-motivated engineers experienced at building and supporting distributed systems, who love to delight the customer both in terms of regularly delivering value as well as in providing stellar support for our products. A proven ability to successfully tackle complex and ambiguous problems and deliver quality results quickly are essential skills for this role.
We believe talent is equally geographically distributed but opportunities are not. Our US-based team is happy to embrace remote work and our general support hours are 10am - 4pm Pacific Time. We believe safe spaces where everyone can be their authentic selves is the key to a strong team so we welcome and embrace all identities, cultures, and backgrounds.
What we are building
· A globally available and extensible container runtime and orchestration platform, built on Kubernetes
· Advanced and industry-leading ML-based scheduling across service and batch jobs, including capacity management, bin packing, and over-subscription, fault-tolerance, and cross workload optimization
· An operationally resilient and global-scale control plane
· An intelligent and full-featured batch system supporting optimized scheduling and management of all workload types
· Linux and container runtimes providing industry-leading security and multi-tenant isolation with deep integration to AWS EC2 networking and security as well as Netflix platform infrastructure systems
· Creating clarity within ambiguity to produce and execute designs and plans
· Participating in the creation and curation of a fantastic team culture
· Championing projects, managing and communicating impact, and delivering results
· Collaborating with the team, Product Managers, partners, and stakeholders on our roadmap
· Operating our systems and responding to incidents, issues, and user support requests as part of an on-call rotation
· Evolving the platform to solve novel challenges while handling web-scale load
Skills we are looking for
· Ability to break down abstract problems into concrete solutions
· Demonstrated experience in improving the reliability and operational automation of complex, multi-tier systems
· Experience beyond the usage of container management platforms and/or container runtimes. Specifically, we are looking for engineers who have extended and improved these platforms vs. operated them.
· Experience with addressing performance issues across the whole stack from applications to operating systems
· Ability to program across the core project languages Java and Golang
What sets you apart
· Experience building a business-critical large-scale distributed system with extreme availability
· Understanding of systemic security challenges within infrastructure-as-a-service offerings
· Demonstrated community advocacy or open source contribution
· Deep experience with batch systems such as Luigi, Airflow, AWS Batch, Chronos, or other big data infrastructure platforms
What might be interesting to you
· Identifying and implementing improvements to systems and architecture with Netflix-wide impact
· Working with stunning colleagues at the top of their field
· A demonstrated commitment (including funding and time) to creating a more inclusive working environment and diverse workforce
· The ability to choose between working remotely or in the office
Does this sound interesting? Or does this sound interesting-but-intimidating? Please don’t self-select out, let’s figure it out together. We’d love to talk to you!
Netflix is a global company, with a diverse member base, which is why the content we produce reflects that: global perspectives, global stories. As we grow globally, we know that we must have the most talented employees with diverse backgrounds, cultures, perspectives, and experiences to support our innovation and creativity. We are an equal opportunity employer and strive to build balanced teams from all walks of life.