Key Areas of Responsibilities
- Own and support monitoring and SRE operations, ensuring system reliability, availability, and performance.
- Build, enhance, and maintain monitoring solutions using ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, and Grafana.
- Develop, optimize, and maintain alerting rules, dashboards, and observability pipelines.
- Troubleshoot and resolve complex issues during major incidents, providing clear and timely communication.
- Troubleshoot Linux servers (RHEL 7/8/9), including upgrades, configurations, patching, and maintenance, while determining appropriate monitoring requirements for system changes.
- Analyze logs, investigate issues, and perform fault finding to identify performance exceptions.
- Collaborate with engineering, application, and infrastructure teams to improve system resilience, stability, security, efficiency, and scalability.
- Contribute to automation strategies, deployment processes, and continuous operational improvements.
- Participate in on‑call rotations, including off‑hours and scheduled weekend support.
- Participate in Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
- Continuously research and adopt modern monitoring and SRE tools and practices.
Requirements
- Bachelor's degree in computer science / engineering
- Minimum 8 years experience within IT / Investment bank.
- Strong experience with monitoring and observability platforms, including: ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, Grafana, and Kibana.
- Hands-on experience building and implementing Prometheus pipelines, including exporters, scraping configurations, relabelling, metric routing, and integrations with long‑term storage (e.g., Victoria‑Metrics).
- Experience building and maintaining Logstash pipelines, including ingestion, parsing, filtering, enrichment, and routing of logs into Elasticsearch.
- Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems.
- Solid understanding of metrics, logging, alerting, dashboards, and observability pipelines.
- Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, configuration, patching, and performance optimization.
- Good understanding of SRE principles, high availability, scalability, incident management and DR (Disaster Recovery) / BCP (Business Continuity Planning) activities
- Experience with automation (e.g., Bash, Python, Ansible, CI/CD tools) is an advantage.
- Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems.
- Prior experience in Production Support, SRE, Monitoring Engineering, or Shared Services Operations with participation in on‑call rotations, including after-hours and weekend support.
- Strong analytical, problem‑solving and communication skills with the ability to work collaboratively under pressure.
- Self-motivated, adaptable and able to prioritize, learn continuously and manage multiple responsibilities effectively.
- Excellent/Fluent in English