
Search by job, company or skills
Experience: 7 years+
We are seeking an exceptionally skilled and dedicated ELK Platform Site
Reliability Engineer to join our dynamic infrastructure team, ensuring the robust health,
performance, and scalability of our critical ELK (Elasticsearch, Logstash, Kibana) stack.
This pivotal role involves maintaining cutting-edge observability systems, automating
operational tasks, and collaborating across engineering disciplines to uphold the highest
standards of reliability and efficiency for our data-driven applications. The successful
candidate will be instrumental in managing complex, high-volume logging and analytics
infrastructure, directly contributing to our organization operational excellence and data
insights capabilities.
Key Responsibilities
Architecting, deploying, managing, and maintaining highly available and fault-
tolerant ELK clusters across diverse environments, encompassing, Logstash,
Kibana, and Beats agents.
Implementing a Fleet managed large scale deployment of Elastic agents.
Developing and implementing comprehensive monitoring, alerting, and
dashboarding strategies using Kibana visualizations and integrated alerting
mechanisms to proactively identify and address system anomalies and
performance degradations.
Automating routine operational tasks, deployment pipelines, and cluster
upgrades through sophisticated scripting (e.g., Python, Bash) and infrastructure-
as-code principles utilizing tools like Ansible.
Performing in-depth performance tuning and optimization of Elasticsearch
indices, query performance, and underlying hardware/cloud resources to ensure
maximum throughput and minimal latency.
Managing the ingestion pipelines, configuring Logstash filters and outputs, and
ensuring efficient data flow from various sources into the Elasticsearch
datastores.
Implementing and enforcing robust security measures across the ELK stack,
including access control, encryption (TLS/SSL), and regular vulnerability
assessments.
Troubleshooting complex issues across the entire stack, from data sources and
ingestion agents through to the Elasticsearch cluster and Kibana interface,
employing systematic diagnostic methodologies.
Collaborating closely with development and operations teams to understand
application requirements, optimize data schemas, and facilitate effective log
analysis and troubleshooting.
Designing and executing disaster recovery and business continuity plans
specifically tailored for the ELK platform, ensuring data integrity and service
availability.
Maintaining detailed documentation for system architecture, operational
procedures, troubleshooting guides, and configuration standards.
Critical Qualifications
Demonstrable extensive hands-on experience managing large-scale
Elasticsearch clusters, including deep understanding of index management,
shard allocation, replication strategies, and cluster health monitoring.
Proven expertise in administering and troubleshooting complex Linux operating
systems (e.g., RHEL, Debian) at an expert level, including performance analysis.
Solid foundational knowledge of web applications, their underlying architectures,
and how they interact with logging and monitoring systems.
A bachelor's degree in computer science, Information Technology, Engineering,
or a closely related technical field, or equivalent practical experience.
Possession of relevant industry certifications such as Elastic Certified Engineer,
AWS Certified SysOps Administrator, Red Hat Certified Engineer (RHCE), or
equivalent validation of core competencies.
A minimum of five to seven years of progressive experience in Site Reliability
Engineering, Systems Administration, or DevOps roles with a strong focus on
large-scale distributed systems.
Proficiency with essential infrastructure management tools, including
configuration management systems (Ansible, Chef, Puppet) and orchestration
platforms (OpenShift).
Expertise in scripting languages such as Bash for automation, system
administration tasks, and developing operational tooling.
Thorough understanding of networking concepts, including TCP/IP, HTTP/S
protocols, DNS, load balancing, and firewall configurations relevant to distributed
systems.
Preferred Qualifications
Experience with message queuing technologies like Kafka or RabbitMQ for
buffering and decoupling data ingestion processes.
Hands-on experience with container orchestration systems such as OpenShift,
including deploying and managing Logstash within containerized environments.
Familiarity with various data collection agents beyond Beats, such as Fluentd or
Vector, and their respective configuration nuances.
Knowledge of distributed tracing systems (e.g., Jaeger, Zipkin) and their potential
integration or correlation with ELK data.
Familiarity with CI/CD pipelines and integrating ELK stack deployments and
updates into automated release processes.
A strong grasp of system security best practices, including intrusion detection,
vulnerability management, and security hardening techniques for distributed
systems.
Additional Requirements
Demonstrated technical proficiency in leveraging Elasticsearch Query DSL
(Domain Specific Language) for complex data retrieval and aggregation
operations.
Expertise in designing and implementing robust data retention and lifecycle
management policies within Elasticsearch to optimize storage costs and
performance.
Deep understanding of JVM tuning parameters and garbage collection algorithms
specifically applied to Elasticsearch performance optimization.
Understanding of High Availability (HA) and Disaster Recovery (DR) strategies
for Elasticsearch clusters, including snapshot/restore mechanisms and cross-
cluster replication.
Proficiency in analyzing and optimizing Logstash pipeline performance, including
understanding codec usage, filter plugin efficiency, and output plugin buffering
strategies.
Hands-on experience configuring and managing Kibana security features,
including role-based access control (RBAC), authentication integration (LDAP,
SAML), and space management.
Specific technical knowledge and practical application of AI/ML techniques and
algorithms within the ELK ecosystem, such as anomaly detection using
Elasticsearch Machine Learning features, time-series forecasting, or leveraging
ELK data for predictive analytics models.
Ability to script and automate interactions with the Elasticsearch and Kibana APIs
for advanced management and monitoring tasks.
Experience with network troubleshooting tools and techniques (e.g., tcpdump,
Wireshark) for diagnosing connectivity issues impacting data ingestion or cluster
communication.
A thorough understanding of distributed systems concepts, including consensus
algorithms (e.g., Raft for Elasticsearch coordination), eventual consistency, and
failure modes in clustered environments.
Job ID: 145330327