About the Role
We are looking for a Software Debugging Engineer with deep expertise in diagnosing and resolving complex issues in large-scale software systems. In this role, you will be the go-to expert for uncovering hard-to-find bugs, performance bottlenecks, and production failures across distributed infrastructure.
Your work will directly improve system reliability, performance, and operational stability in production environments.
What You'll Do
- Debug complex issues across large-scale and distributed software systems
- Perform root cause analysis for production incidents and outages
- Diagnose performance bottlenecks, memory leaks, and resource contention
- Analyze logs, traces, and metrics to identify system failures
- Build debugging tools, scripts, and automation to accelerate issue resolution
- Create reproducible test cases from real production failures
- Partner with engineering teams to implement fixes and preventive measures
- Build and maintain observability systems including logging, tracing, and alerting
- Write clear post-mortems and technical documentation
- Improve system reliability through better monitoring, error handling, and diagnostics
What We're Looking For
- 3+ years of software engineering experience with a strong focus on debugging
- Proven experience debugging large-scale or distributed systems
- Strong proficiency in Python for scripting, automation, and analysis
- Deep understanding of Linux internals, system calls, and command-line tooling
- Hands-on experience with debugging tools such as gdb, strace, perf, tcpdump, and Wireshark
- Experience using profiling tools for CPU, memory, and I/O analysis
- Familiarity with observability stacks such as Prometheus, Grafana, ELK, Jaeger, or similar
- Excellent analytical and problem-solving skills
- Strong written communication skills for documentation and post-mortems
Nice to Have
- Experience debugging containerized systems using Docker and Kubernetes
- Background in SRE, reliability engineering, or infrastructure roles
- Knowledge of database internals and query optimization (PostgreSQL, Redis)
- Experience with asynchronous systems and message queues (Kafka, Rabbit, MQ)
- Familiarity with memory debugging tools such as Valgrind or AddressSanitizer
- Experience participating in production incident response or on-call rotations
Why Join Us
- The go-to expert for debugging complex, real-world systems
- Direct impact on system reliability and production stability
- Work with modern infrastructure, observability, and performance tooling
- Collaborate with world-class engineers and researchers
- Contribute to systems trusted by leading AI labs and Fortune 500 enterprises