
Search by job, company or skills

What You'll Work On
• Design and develop a next-generation scalable observability platform for modern cloud-native and hybrid infrastructures that works in tandem with AI agents.
• Create intelligent AI agents to analyze logs, traces, and metrics in real time, delivering automated insights and remediation.
* Build scalable and fault tolerant AI agent frameworks
• Engineer and optimize large-scale analytics pipelines to process high-velocity telemetry data.
• Build resilient distributed systems with high reliability, performance, and fault tolerance.
• Implement and fine-tune LLMs for natural language querying and automated troubleshooting.
• Partner with ML engineers to streamline AI model deployment and management.
What We're Looking For
• Strong programming skills in Python and Golang (experience with Rust is a plus)
• Track record of building distributed systems and large-scale analytics pipelines
• Hands-on experience with cloud infrastructure (AWS, GCP, or Azure) and Kubernetes
• Deep understanding of observability technologies (Prometheus, OpenTelemetry, Grafana, Elastic, etc.)
• Knowledge of LLMs, AI agents, agent frameworks liks langchain, autogen is a plus
• Experience with stream processing and real-time data processing frameworks
• Proficiency in database technologies (SQL & NoSQL, Time-Series DBs)
• 5+ years** of relevant experience
• Bachelor's degree in Computer Science, Engineering, or related field (Master's/PhD is a plus)
Job ID: 106218035