Senior Cloud SRE Engineer

Job Location

Cairo

Deadline

January 14, 2022

Department

Technology

About the job

Job Overview: 


We are looking for a Cloud SRE Engineer to help us grow and maintain our system that improves the experience of hundreds of thousands of users on a daily basis.

Cloud SRE Engineer is a key role at Capiter, it’s more than setting up CI tools or managing servers on the cloud. a Site Reliability Engineer at Capiter write code (tooling to support dev team, or even in the core microservices), they debug and identify production issues at many levels (code/frameworks, microservices/datastores, containers/OS, clustering platform, servers, cloud providers), scale the fast growing infrastructure, own the software development process and make sure it is healthy and newcomers are aware of it.

If you feel like to ride the roller coaster, having a sense of ownership, willing to share knowledge, learn and grow, team centric and focused on end value rather than self achievement, this is could be the life opportunity for you.


Responsibilities and Duties:


●  Spread DevOps/SRE culture and continuously enhance the process of software development.

●  Assure required security level for the infrastructure, datastores and the different environments: production/staging/development (ACL, VPNs, authorization, etc..)

●  Maintain the infrastructure on the cloud (GCP) allocating new resources, setting up new platforms/clusters with the proper configurations

●  Develop and deploy solutions to optimize the infrastructure and external services cost (ex: setup caching datastores, make changes to the code to integrate them)

●  Maintain our datastores, monitor the load, design and implement a backup and restore plans, scaling, clustering (sharding/replication)

●  Contribute to both infrastructure architecture and microservices design

●  Develop and integrate tools/scripts to automate the process of development/deployment.

●  Implement automation tools and frameworks (CI/CD pipelines).

●  Ability to have hands on code (could write and push hot-fixes to production in urgent cases)

●  Integrate/configure tools for system and inter-microservices monitoring and alerting (mostly over kubernetes)

●  Handle critical production issues around the hour and prepare incident report

●  Perform root cause analysis for production issues

●  Design procedures for system troubleshooting and maintenance

●  Work closely and support the dev team with the infrastructure and architecturedecisions, debugging production issues, new services deployment and new cloud resources setup and allocation


Qualifications


●  Solid experience in software development life cycle (got to work with agile teams)

●  Software Engineering background

●  Excellent system design skills

●  Familiarity with different open source web development languages/frameworks, andhow they’re deployed (ex: Java, Python, Javascript, etc..)

●  Strong experience in cloud providers (mainly GCP)

●  Deep understanding of standard networking protocols and components such as HTTP,DNS, TCP/IP, the OSI Model, networking and load balancing.

●  Solid experience in Unix like Operating Systems

●  Experience in Linux containers, container orchestration platforms (Docker, Kubernetes),and related tools and technologies (Helm)

●  Admin experience with databases including MySQL and Elasticsearch.

●  Experience with CI/CD principles, architecture and operations.

●  Experience with instrumentation for monitoring and logging the health and availabilityof services.

●  Familiarity with deployment and management systems such as Puppet, Ansible, Packer,Terraform, etc.

●  Familiarity with messaging systems like RabbitMQ, and Kafka is a must

●  Sense of ownership

●  Good communication skills