As a Site Reliability Engineer on the Meraki Backend Infrastructure Team, you are responsible for everything from our server hardware and operating systems to tools for code deployment and service monitoring. You build software and systems to monitor, scale and deploy our distributed cloud services.
Meraki's Backend Infrastructure Team is responsible for building and scaling the cloud that supports millions of Meraki devices across the world. Meraki's customer base has grown by a factor of 2-3 every year, serving more than 2.3 billion HTTP requests per day across six datacentres. Our customers depend on the Meraki cloud to monitor and manage their critical infrastructure of network switches, security appliances, wireless APs, security cameras, and phones.
In this role, you will be part of a small engineering team that is based out of our UK office in Bishopsgate, London. You will make crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure.
Example projects of a Meraki Site Reliability Engineer:
- Collecting metrics, crunching data and improving service monitoring to detect problems before they're visible to our customers.
- Building systems to automate our server lifecycle, from configuration management to server bootstrap and decommission.
- Scaling our continuous deployment system to accommodate a rapidly growing team and increasing feature velocity without compromising stability.
- Troubleshooting, performing root cause analysis, and resolving production issues from the network and application layers all the way down to the system level. This might include anything from digging into source code (our own or from open source projects), hunting memory leaks, tracing bottlenecks in upstream networks, or database query optimization.
- Advising other development teams when building new products so that they're scalable, maintainable, and performing well.
You are an ideal candidate if you:
- Script or code with 1-2 languages like Ruby, Scala, Python or Bash. You are comfortable digging into other people's source code in search of the root cause of a problem and you automate all the things.
- Care about the customer experience. You have experience supporting an externally-facing production environment.
- Have experience on a pager rotation where you responded to escalations quickly to minimise customer downtime. This role requires being part of a one-week-in-four on-call rotation during London business hours and some weekends, with occasional off-hour cover.
- Believe in the Unix way. You build large systems out of small components that each do one job and do it well. We run Debian.
- Are familiar with logging and monitoring tools such as Graphite, Grafana, Logstash, ElasticSearch, statsd, collectd and flapjack.
- Are willing to travel to Meraki HQ in San Francisco 2-4 times a year for departmental events, team collaboration, and visibility.
Keywords: Site Reliability Engineering, DevOps, System Administration, Software Engineering, Production Engineering
Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis. Cisco will consider for employment, on a case by case basis, qualified applicants with arrest and conviction records.