Search by job, company or skills

Oracle

SRE- Database Management/AI

new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities
  • Operate and optimize Oracle Database and Exadata environments to meet stringent availability, performance, and scalability targets in 24x7 production.

  • Lead database reliability engineering initiatives including HA design patterns, capacity planning, demand forecasting, and performance analysis/system tuning.

  • Deliver advanced performance tuning (SQL optimization, indexing strategies, configuration and storage tuning) and drive measurable improvements in latency, throughput, and stability.

  • Design and maintain backup, recovery, and disaster recovery strategies validate restore procedures and ensure readiness for mission-critical environments.

  • Apply SRE best practices including defining SLIs/SLOs, managing error budgets, and improving incident response through post-incident reviews and durable corrective actions.

  • Build automation and tools (Python/Shell/PowerShell) to eliminate toil, reduce MTTR, improve deployment reliability, and prevent recurring incidents.

  • Instrument and enhance observability using monitoring/APM stacks (e.g., Prometheus, Grafana, APM) to improve signal quality and reduce alert noise.

  • Partner with engineering and architecture teams on service and database design, data modeling decisions, and system architecture improvements for distributed systems.

Qualifications & Skills

Mandatory

  • Education: Bachelor's or Master's degree in Computer Science, Engineering, or related field (or equivalent practical experience).

  • Experience: 6+ years in SRE, Cloud Engineering, DevOps, Database Reliability, or similar production-operations engineering roles.

  • Oracle Database expertise: Expert hands-on experience with Oracle Database and Exadata administration, high availability architectures, and production operations.

  • Performance tuning: Demonstrated capability in SQL tuning, indexing strategies, resource utilization analysis, and system tuning for high-scale workloads.

  • Backup/DR: Proven experience designing and operating backup, recovery, and disaster recovery solutions for 24x7 mission-critical systems.

  • Automation/scripting: Strong hands-on proficiency in Python and/or Shell/PowerShell for automation, tooling, and operational workflows.

  • Reliability & distributed systems: Solid understanding of cloud concepts, distributed systems behaviors, and SRE fundamentals (SLIs/SLOs, incident response, RCA).

  • Operational excellence: Strong troubleshooting, analytical thinking, and clear communication skills comfortable acting as an escalation point during critical incidents.

  • Good-to-Have

  • Cloud platforms: OCI preferred AWS/Azure/GCP experience also valuable.

  • IaC & configuration management: Terraform, Ansible, and Infrastructure-as-Code best practices.

  • Containers: Kubernetes and Docker exposure in production environments.

  • Observability depth: Experience with database observability, APM tooling, tracing, and alert quality/noise reduction initiatives.

  • AI familiarity: Exposure to LLMs, RAG, or AI agents (especially in operational tooling/automation contexts).

  • Certifications: Oracle Database/Exadata, OCI (or other cloud architect), SRE/DevOps-related certifications.

  • Self-Assessment Questions

  1. Have I owned production Oracle Database/Exadata environments and successfully improved availability or performance through concrete tuning or architecture changes

  2. Can I confidently diagnose performance issues end-to-end (SQL, indexing, configuration, storage, and workload characteristics) and explain tradeoffs to stakeholders

  3. Have I designed and validated backup/restore and DR processes (including regular testing) for systems that require 24x7 reliability

  4. Do I routinely build automation in Python/Shell/PowerShell to reduce manual operational work, improve MTTR, or prevent recurring incidents

  5. Am I comfortable applying SRE practices (SLIs/SLOs, error budgets, incident response, RCA/postmortems) and driving improvements across teams

Career Level - IC3

More Info

About Company

Oracle Corporation is an American multinational computer technology corporation headquartered in Austin, Texas.In 2020, Oracle was the second-largest software company in the world by revenue and market capitalization.The company sells database software and technology (particularly its own brands), cloud engineered systems, and enterprise software products, such as enterprise resource planning (ERP) software, human capital management (HCM) software, customer relationship management (CRM) software (also known as customer experience), enterprise performance management (EPM) software, and supply chain management (SCM) software.

Job ID: 143160251