Provide L2/L3 production support for big data platforms and applications, ensuring high availability and performance.
Monitor, troubleshoot, and resolve issues related to big data jobs, data pipelines, and batch workloads using ESP or similar schedulers.
Manage incidents, problems, and service requests through ServiceNow, adhering to defined SLAs and escalation procedures.
Apply ITIL best practices for incident, problem, change, and release management within the big data environment.
Perform root cause analysis for recurring issues and implement permanent corrective actions to improve platform stability.
Collaborate with data engineering, infrastructure, and application teams to deploy fixes, enhancements, and configuration changes.
Participate in change advisory processes, assess risks, and support planned maintenance and releases on big data systems.
Maintain and update support documentation, runbooks, and knowledge base articles for common issues and procedures.
Proactively identify performance bottlenecks, capacity risks, and operational gaps, and recommend improvements.
Provide on-call support as required, including support during off-hours for critical incidents and scheduled activities. Minimum Qualifications:
Bachelor's degree in Engineering, Computer Science, or related field (B.Tech or equivalent).
5–8 years of hands-on experience in production support or operations for big data platforms or large-scale data systems.
Strong working knowledge of big data ecosystems and related operational concepts (job monitoring, data pipelines, batch processing).
Proven experience working within ITIL frameworks, including incident, problem, and change management processes.
Practical experience using ServiceNow (or similar ITSM tools) for ticketing, workflow, and SLA management.
Experience supporting and monitoring workloads scheduled through ESP or equivalent enterprise schedulers.
Solid troubleshooting skills with the ability to analyze logs, identify patterns, and resolve production issues under time pressure. Good to have skills: Monitoring tools (e.g., Splunk, Dynatrace, AppDynamics), Shell scripting, SQL, Linux/Unix administration, Cloud platforms (AWS/Azure/GCP)
Knowledge of more than one technology
Basics of Architecture and Design fundamentals
Knowledge of Testing tools
Knowledge of agile methodologies
Understanding of Project life cycle activities on development and maintenance projects
Understanding of one or more Estimation methodologies, Knowledge of Quality processes
Basics of business domain to understand the business requirements
Analytical abilities, Strong Technical Skills, Good communication skills
Good understanding of the technology and domain
Ability to demonstrate a sound understanding of software quality assurance principles, SOLID design principles and modelling methods
Awareness of latest technologies and trends
Excellent problem solving, analytical and debugging skills