Position Overview
We ayre seeking an experienced L3 Couchbase DBA to manage, optimize, and ensure the reliability of high-availability Couchbase clusters supporting mission-critical applications. The ideal candidate will be responsible for installation, configuration, performance tuning, troubleshooting, monitoring, and Disaster Recovery (DR) operations across large-scale distributed database environments.
This role demands deep expertise in Couchbase internals, hands-on problem-solving skills, and the ability to drive operational excellence in a fast-paced environment.
Required Skills & Qualifications
- 5+ years of experience as a Couchbase DBA, with strong exposure to large-scale distributed systems.
- Deep understanding of:
- Couchbase cluster architecture
- N1QL Query Engine
- Indexing (GSI/FTS)
- Bucket design and memory management
- Couchbase security modules
- Strong hands-on experience with Linux OS, shell scripting, networking basics, and debugging tools.
- Experience integrating monitoring with Prometheus/Grafana/Nagios.
- Proven ability to troubleshoot complex production incidents independently (L3 level).
- Excellent communication and documentation abilities.
Work Timings: Rotational Shifts
Roles & Responsibilities1. Installation & Configuration
- Deploy and configure Couchbase clusters on Linux-based servers in both bare-metal and VM environments.
- Set up services such as KV, Query, Index, Search, Eventing, ensuring optimal resource allocation and cluster efficiency.
- Plan and execute Couchbase patching cycles and version upgrades with minimal downtime.
- Validate cluster health post-upgrade using tools like cbstats, cbcollect_info, and internal cluster diagnostics.
2. Monitoring & Alerting
- Implement monitoring and alerting for:
- Node failures
- Disk watermarks
- Rebalance operations
- Index/storage fragmentation
- Slow-running N1QL queries
- Integrate Couchbase metrics with Prometheus, Grafana, Nagios, or similar monitoring tools.
3. Troubleshooting & Incident Management
- Diagnose and resolve issues related to:
- Rebalance failures
- Index fragmentation
- XDCR delays/lag
- Query performance bottlenecks
- Perform in-depth root cause analysis (RCA) and maintain detailed incident reports.
- Create and maintain runbooks for recurring or high-impact issues to streamline operations.
4. Performance Tuning
- Optimize:
- N1QL queries
- Index design
- Bucket settings
- Memory quotas and Node resource allocation
- Monitor and fine-tune:
- Compaction processes
- Thread concurrency
- Disk I/O performance
- Latency and throughput across Couchbase services
5. Backup & Disaster Recovery
- Automate backup and restore processes using cbbackupmgr, ensuring backup integrity and recoverability.
- Regularly validate backups and maintain compliance with retention policies.
- Design, configure, and maintain XDCR (Cross Data Center Replication) for active-active or active-passive DR setups.
- Conduct routine failover drills and ensure DR readiness.
6. Security & Compliance
- Implement and maintain:
- RBAC (Role-Based Access Control)
- TLS/mTLS for encrypted connections
- Audit logging
- LDAP/SAML-based authentication/authorization
- Ensure compliance with internal and industry security standards.
7. Documentation & Standards
- Maintain detailed:
- Runbooks
- Patching guidelines
- Operational SOPs
- Architecture and configuration documentation
- Mentor junior DBAs and promote Couchbase best practices within the team.
- Contribute to continuous improvement of database standards, policies, and operational frameworks.