Roles & Responsibilities:
- Ensure high system reliability and uptime.
- Develop and maintain monitoring systems.
- Lead incident response and root cause analysis.
- Automate repetitive tasks for efficiency.
- Perform capacity planning and resource scaling.
- Lead infrastructure as code (e.g., Terraform, Kubernetes).
- Collaborate with development and operations teams.
- Maintain clear documentation and share knowledge.
- Optimize system and application performance.
- Ensure security and compliance standards are met.
- Define, measure, and monitor Service Level Objectives (SLOs) and Service-Level Agreements (SLAs) to align with business goals.
- Drive continuous process and system improvements.
- Define guidelines, standards, strategies, security policies, and organizational change policies to support the Data Lake.
What we expect of you
Basic Qualifications and Experience:
- Master's degree in computer science or engineering field and 1 to 3 years of relevant experience OR
- Bachelor's degree in computer science or engineering field and 3 to 5 years of relevant experience OR
- Diploma and Minimum of 8+ years of relevant work experience.
Must-Have Skills:
- Proficiency in programming/scripting (Python, Java).
- Experience in Linux/Unix system administration.
- Experience with cloud platforms (AWS, Databricks, Azure, Snowflake).
- Proficiency in containerization and orchestration (Docker, Kubernetes).
- Knowledge of Infrastructure as Code (Terraform, Ansible).
- Familiarity with monitoring and logging tools (Prometheus, Grafana).
- Understanding of CI/CD pipelines (Jenkins, GitLab CI/CD).
- Strong networking knowledge and troubleshooting skills.
- Understanding of security principles and compliance.
- Familiarity with database management (SQL and NoSQL).
- Strong troubleshooting and debugging skills.
- Experience in performance optimization.
- Experience with backup and storage solutions.
Good-to-Have Skills:
- Familiarity with the use of AI for development productivity, such as GitHub Copilot, Databricks Assistant, Amazon Q Developer, or equivalent.
- Knowledge of Agile and DevOps practices.
- Skills in disaster recovery planning.
- Familiarity with load testing tools (JMeter, Gatling).
- Basic understanding of AI/ML for monitoring.
- Knowledge of distributed systems and microservices.
- Data visualization skills (Tableau, Power BI).
- Strong communication and leadership skills.
- Understanding of compliance and auditing requirements.
Soft Skills:
- Excellent analytical and problem-solving skills.
- Excellent written and verbal communication skills (English) in translating technology content into business language at various levels.
- Ability to work effectively with global, virtual teams.
- High degree of initiative and self-motivation.
- Ability to handle multiple priorities successfully.
- Team-oriented, with a focus on achieving team goals.
- Strong problem-solving and analytical skills.
- Strong time and task leadership skills to estimate and successfully meet project timelines with the ability to bring consistency and quality assurance across various projects.