
Search by job, company or skills

What You'll Work On
• Design and develop a next-generation scalable observability platform for modern cloud-native and hybrid infrastructures that works in tandem with AI agents.
• Create intelligent AI agents to analyze logs, traces, and metrics in real time, delivering automated insights and remediation.
* Build scalable and fault tolerant AI agent frameworks
• Engineer and optimize large-scale analytics pipelines to process high-velocity telemetry data.
• Build resilient distributed systems with high reliability, performance, and fault tolerance.
• Implement and fine-tune LLMs for natural language querying and automated troubleshooting.
• Partner with ML engineers to streamline AI model deployment and management.
What We're Looking For
• Strong programming skills in Python and Golang (experience with Rust is a plus)
• Track record of building distributed systems and large-scale analytics pipelines
• Hands-on experience with cloud infrastructure (AWS, GCP, or Azure) and Kubernetes
• Deep understanding of observability technologies (Prometheus, OpenTelemetry, Grafana, Elastic, etc.)
• Knowledge of LLMs, AI agents, agent frameworks liks langchain, autogen is a plus
• Experience with stream processing and real-time data processing frameworks
• Proficiency in database technologies (SQL & NoSQL, Time-Series DBs)
• 5+ years** of relevant experience
• Bachelor's degree in Computer Science, Engineering, or related field (Master's/PhD is a plus)
Job ID: 106218035
Skills:
Memory Management, Ceph, Networking Protocols, System profiling and tracing tools, Distributed storage systems, Linux kernel internals, lustre, HDFS, Crash dump analysis, GlusterFS, Concurrency and synchronization
We don’t charge any money for job offers