Search by job, company or skills

  • Posted 15 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Experience: 7 years+

We are seeking an exceptionally skilled and dedicated ELK Platform Site

Reliability Engineer to join our dynamic infrastructure team, ensuring the robust health,

performance, and scalability of our critical ELK (Elasticsearch, Logstash, Kibana) stack.

This pivotal role involves maintaining cutting-edge observability systems, automating

operational tasks, and collaborating across engineering disciplines to uphold the highest

standards of reliability and efficiency for our data-driven applications. The successful

candidate will be instrumental in managing complex, high-volume logging and analytics

infrastructure, directly contributing to our organization operational excellence and data

insights capabilities.

Key Responsibilities

Architecting, deploying, managing, and maintaining highly available and fault-

tolerant ELK clusters across diverse environments, encompassing, Logstash,

Kibana, and Beats agents.

Implementing a Fleet managed large scale deployment of Elastic agents.

Developing and implementing comprehensive monitoring, alerting, and

dashboarding strategies using Kibana visualizations and integrated alerting

mechanisms to proactively identify and address system anomalies and

performance degradations.

Automating routine operational tasks, deployment pipelines, and cluster

upgrades through sophisticated scripting (e.g., Python, Bash) and infrastructure-

as-code principles utilizing tools like Ansible.

Performing in-depth performance tuning and optimization of Elasticsearch

indices, query performance, and underlying hardware/cloud resources to ensure

maximum throughput and minimal latency.

Managing the ingestion pipelines, configuring Logstash filters and outputs, and

ensuring efficient data flow from various sources into the Elasticsearch

datastores.

Implementing and enforcing robust security measures across the ELK stack,

including access control, encryption (TLS/SSL), and regular vulnerability

assessments.

Troubleshooting complex issues across the entire stack, from data sources and

ingestion agents through to the Elasticsearch cluster and Kibana interface,

employing systematic diagnostic methodologies.

Collaborating closely with development and operations teams to understand

application requirements, optimize data schemas, and facilitate effective log

analysis and troubleshooting.

Designing and executing disaster recovery and business continuity plans

specifically tailored for the ELK platform, ensuring data integrity and service

availability.

Maintaining detailed documentation for system architecture, operational

procedures, troubleshooting guides, and configuration standards.

Critical Qualifications

Demonstrable extensive hands-on experience managing large-scale

Elasticsearch clusters, including deep understanding of index management,

shard allocation, replication strategies, and cluster health monitoring.

Proven expertise in administering and troubleshooting complex Linux operating

systems (e.g., RHEL, Debian) at an expert level, including performance analysis.

Solid foundational knowledge of web applications, their underlying architectures,

and how they interact with logging and monitoring systems.

A bachelor's degree in computer science, Information Technology, Engineering,

or a closely related technical field, or equivalent practical experience.

Possession of relevant industry certifications such as Elastic Certified Engineer,

AWS Certified SysOps Administrator, Red Hat Certified Engineer (RHCE), or

equivalent validation of core competencies.

A minimum of five to seven years of progressive experience in Site Reliability

Engineering, Systems Administration, or DevOps roles with a strong focus on

large-scale distributed systems.

Proficiency with essential infrastructure management tools, including

configuration management systems (Ansible, Chef, Puppet) and orchestration

platforms (OpenShift).

Expertise in scripting languages such as Bash for automation, system

administration tasks, and developing operational tooling.

Thorough understanding of networking concepts, including TCP/IP, HTTP/S

protocols, DNS, load balancing, and firewall configurations relevant to distributed

systems.

Preferred Qualifications

Experience with message queuing technologies like Kafka or RabbitMQ for

buffering and decoupling data ingestion processes.

Hands-on experience with container orchestration systems such as OpenShift,

including deploying and managing Logstash within containerized environments.

Familiarity with various data collection agents beyond Beats, such as Fluentd or

Vector, and their respective configuration nuances.

Knowledge of distributed tracing systems (e.g., Jaeger, Zipkin) and their potential

integration or correlation with ELK data.

Familiarity with CI/CD pipelines and integrating ELK stack deployments and

updates into automated release processes.

A strong grasp of system security best practices, including intrusion detection,

vulnerability management, and security hardening techniques for distributed

systems.

Additional Requirements

Demonstrated technical proficiency in leveraging Elasticsearch Query DSL

(Domain Specific Language) for complex data retrieval and aggregation

operations.

Expertise in designing and implementing robust data retention and lifecycle

management policies within Elasticsearch to optimize storage costs and

performance.

Deep understanding of JVM tuning parameters and garbage collection algorithms

specifically applied to Elasticsearch performance optimization.

Understanding of High Availability (HA) and Disaster Recovery (DR) strategies

for Elasticsearch clusters, including snapshot/restore mechanisms and cross-

cluster replication.

Proficiency in analyzing and optimizing Logstash pipeline performance, including

understanding codec usage, filter plugin efficiency, and output plugin buffering

strategies.

Hands-on experience configuring and managing Kibana security features,

including role-based access control (RBAC), authentication integration (LDAP,

SAML), and space management.

Specific technical knowledge and practical application of AI/ML techniques and

algorithms within the ELK ecosystem, such as anomaly detection using

Elasticsearch Machine Learning features, time-series forecasting, or leveraging

ELK data for predictive analytics models.

Ability to script and automate interactions with the Elasticsearch and Kibana APIs

for advanced management and monitoring tasks.

Experience with network troubleshooting tools and techniques (e.g., tcpdump,

Wireshark) for diagnosing connectivity issues impacting data ingestion or cluster

communication.

A thorough understanding of distributed systems concepts, including consensus

algorithms (e.g., Raft for Elasticsearch coordination), eventual consistency, and

failure modes in clustered environments.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145330327

Similar Jobs