Site Reliability Engineer – Compute
Cairo, EGYPT IT development
Job description
Introduction
Site Reliability engineers apply Software Engineering principles to perform infrastructure
management tasks more efficiently.
We’re seeking skilled, automation-focused Network Engineers to maintain and administer the
Power Virtual Server Cloud Infrastructure-as-a-Service environment and provide reliable and secure
network operations.
The Network Infrastructure Operations Site Reliability Engineer works with clients to ensure
their specific networking requirements are provided, and handles issues reported by
monitoring/automation. Adhering to strict change control, the SRE will make required
configuration changes in the environment and perform various updates/upgrades to the Cisco
ACI-based software-defined networking environment. Constant attention to automating
manual toil is a core focus of this role.
Power Virtual Server is a fast-paced environment, our engineers provide technical support and resolve
client networking issues within the Power Virtual Server IaaS offering. They identify repetitive tasks and develop automation to reduce manual toil and seek proactive avoidance of client-impacting
events
Your Role and Responsibilities
As a Compute Operations Site Reliability Engineer, you will perform the following tasks:
• Remotely administer Power Server hardware environments across numerous datacenter
locations around the world (currently 20 datacenters and growing).
• Develop automation to reduce manual toil (automated, repetitive tasks) using shell
scripts (bash, etc), Python, Ansible, and related tools and languages.
• Perform code stack updates on infrastructure systems (VIOS, firmware, PowerVC, HMC,
Novalink, NIM servers) as well as cloud supporting systems (jump servers, sobox,
network nodes, gateways, TSM servers).
• Upload/maintain stock images.
• Remotely administer AIX and Linux servers
• Maintain User IDs (Add/delete) and passwords.
• Monitor daily/weekly backups to ensure they are working.
• Manage and maintain Nagios monitoring environment, troubleshoot scripts/plug-ins in
case of issues.
• Perform periodic Live Partition migrations, inactive migrations, or remote restarts of
customer VMs to perform system maintenance, balance workloads, or free up
resources.
• Monitor and provide details of Capacity utilized in each Datacenter.
• Attend scheduled meetings planned by customer for cutover/maintenance windows.
• Verify capacity requirements in case of provisioning failure issues by customers.
• Work with customers to resolve any RSCT issues so that LPM activities can be performed
without impacting customer workloads
Required Technical and Professional Expertise
· In-depth knowledge of Power Server hardware. • Significant scripting/coding experience for automating all aspects of IBM Power systems
administration.
• Automation using Python, shell scripting (bash, etc), Ansible, and related tools and
languages.
• Experience with AIX and Linux administration, commands, and networking.
• Strong experience in one or more of the following: VIO, Novalink, and PowerVC.
Familiarity with one more (to include installation, configuration, administration).
• In-depth knowledge of PowerVM including installation/configuration and
administration.
• High level knowledge of Power Systems supported Operating Systems (AIX and IBMi).
• In-depth knowledge of how storage is connected and allocated to Power systems via
NPIV connections.
• Good understanding of Power Systems network configuration at the system level.
Preferred Technical and Professional Expertise
• Experience with configuring and tuning PowerVC
• Access PowerVS resources using IBM Cloud Portal,
• IBM Cloud CLI, APIs, Terraform
• 3+ years’ experience supporOng customers using ServiceNow or Salesforce.
• Experience training new personnel on tooling and processes.
• Storage & Power RTS, MVS Network for Cisco, Juniper; general support skills