Job Description
Key Result Areas
Production Support Ownership
- Designs and delivers end to end AI/ML/GenAI solutions, including model development, deployment, and production grade implementation.
- Builds scalable backend services and APIs using Flask/FastAPI with cloud native deployment on Azure (DevOps, AKS, Cognitive Services).
- Leads PoCs and innovation initiatives, experimenting with GenAI, Agentic AI, NLP, CV, OCR, and emerging open source frameworks.
Manages AI project lifecycle and teams, overseeing development, resolving model/system issues, and mentoring junior engineers.
- Take end to end ownership of the AI based system application's production support, ensuring system stability, uptime, and reliability.
- Lead and monitor system health proactively, identify recurring issues, and drive permanent fixes in collaboration with technical teams.
- Ensure timely response and resolution of incidents, service requests, and operational issues within agreed SLAs.
Issue Analysis, Troubleshooting & Root Cause Management
- Perform detailed analysis of incidents and problems, identify root causes, and implement preventive actions to avoid recurrence.
- Coordinate with development, infrastructure, and vendor teams to ensure effective resolution of complex technical issues.
- Maintain accurate documentation of incidents, RCA reports, and knowledge articles to support continuous improvement.
Stakeholder Coordination & Communication
- Lead and collaborate closely with business users, product owners, and technology teams to understand issues, communicate updates, and manage expectations.
- Ensure clear, concise, and timely communication regarding incident status, risks, and impact.
- Provide leadership and build strong working relationships with internal teams and external vendors to support smooth issue resolution.
Compliance, Risk & Audit Adherence
- Ensure all production support activities comply with regulatory controls, audit requirements, and internal risk guidelines.
- Lead and maintain operational discipline by following defined PMO, QMS, and architecture standards.
- Identify risks related to data security, application stability, and system availability, and ensure appropriate mitigation steps.
Operational Excellence & Continuous Improvement
- Identify opportunities to streamline support processes, reduce manual effort, and improve turnaround times.
- Lead team to contribute to automation initiatives, monitoring enhancements, and process improvements.
- Track platform performance, incident trends, and recurring issues, and propose corrective actions to enhance system performance.
Requirements Validation & Support for Enhancements
- Support requirement gathering for fixes, enhancements, and regulatory changes by providing production insights and validating feasibility.
- Provide consultation, leadership and assist project teams by verifying functional correctness, solution completeness, and alignment with business needs before deployment.
- Participate in UAT, release validation, and change readiness activities to ensure smooth production deployment.
Technology Best Practices & Quality Assurance
- Adhere to Agile and modern engineering practices while supporting continuous delivery and deployment cycles.
- Maintain high-quality documentation, including SOPs, support guides, workflow diagrams, and configuration details.
- Ensure adherence to best practices in monitoring, alerting, logging, and operational reliability.
Vendor & External Partner Collaboration
- Work with vendor teams to ensure timely delivery of fixes, updates, and support escalations.
- Review vendor performance and ensure value delivery aligned with SLA and platform requirements.
Operating Environment, Framework and Boundaries, Working Relationships
Operating Environment
- Provide leadership for L1/L2 production support for the EDMS application, ensuring high availability, system stability, and timely issue resolution.
- Work in a complex technical environment involving Python, AI Agent technologies like OpenAI, Olama, GeminAI along with Java and Unix/Linux
- Designs and delivers end to end AI/ML/GenAI solutions, including model development, deployment, and production grade implementation.
- Builds scalable backend services and APIs using Flask/FastAPI with cloud native deployment on Azure (DevOps, AKS, Cognitive Services).
- Perform root cause analysis, incident recovery, and ongoing monitoring, including support during off business hours when required.
- Manage and oversee AI based solutions in production including process start/stop, execution, monitoring, and exception handling.
Technical Framework & Skill Requirements
- Minimum 4 to 7 years of hands-on experience in production support roles, preferably as a support lead or senior engineer.
- Strong working knowledge of: - Python Scripting
- Flask/FastAPI with cloud native deployment on Azure (DevOps, AKS, Cognitive Services).
- GenAI, Agentic AI, NLP, CV, OCR, and emerging open source frameworks.
- Proficiency in Microsoft Excel, PowerPoint, and related tools for reporting, presentations, and status tracking.
Reporting, Governance & Documentation
- Prepare and deliver production support reports, project updates, and performance dashboards in a clear and timely manner.
- Provide periodic and ad hoc reports as requested by management or stakeholders.
- Maintain high accuracy and clarity in documentation, status reporting, and communication.
Working Relationships & Cross-Functional Collaboration
- Liaise with business users for issue clarification, data requirements, and production process coordination.
- Collaborate with technology teams such as Corporate Tech, Retail Tech, and Operations Tech for troubleshooting, information exchange, and workflow alignment.
- Coordinate with vendor teams for incident resolution, escalations, and enhancement support.
- Manage communication channels phone calls, emails, and incident tickets ensuring timely alerting to relevant process teams.
Behavioral Expectations
- Display excellent attitude, responsiveness, and professionalism in dealing with users and internal teams.
- Demonstrate strong English communication skills, both written and verbal, to ensure clarity and transparency in all interactions.
- Work independently with a high sense of responsibility, ownership, and accountability.
Problem Solving
- Demonstrates strong analytical and diagnostic skills to troubleshoot complex production issues across AI technology
- Performs detailed root cause analysis (RCA) to identify failure points, document findings, and implement long term preventive measures to avoid recurrence.
- Manages crisis situations effectively by coordinating with relevant teams, restoring services quickly, and driving timely recovery during high severity incidents.
- Communicates clearly and proactively with users, business stakeholders, and technology teams during major incidents, ensuring transparency on status, impact, and recovery steps.
- Applies structured problem solving techniques such as impact analysis, log investigation, trend analysis, and incident pattern recognition to reduce downtime and improve stability.
- Prioritizes production issues based on severity, business impact, and downstream dependencies to ensure accurate and timely resolution.
- Collaborates with cross-functional teams to investigate recurring issues, propose corrective actions, and drive continuous improvement initiatives.
- Proactively monitors system alerts, logs, and performance indicators to detect potential risks early and prevent system outages.
- Ensures thorough documentation of incidents, RCA outcomes, and corrective actions to build knowledge repositories and reduce future turnaround times.
Decision Making Authority & Responsibility
- Recommend functional and technical solutions that are aligned with business requirements, system constraints, and production support best practices.
- Exercise sound judgment to balance delivery timelines, operational risks, and solution quality while supporting production stability.
- Make decisions related to production changes, ensuring zero compromise on quality and strict adherence to release and change management policies.
- Escalate risks, bottlenecks, or potential service impacts in a timely manner to ensure proper visibility and mitigation.
- Assess multiple solution options and use effective judgment to determine the most feasible approach within required timeframes and operational boundaries.
- Identify and highlight any concerns that may affect deliverables, system stability, or compliance, ensuring early intervention.
- Ensure 100% compliance with all change, release, and deployment governance guidelines when recommending or approving fixes and enhancements.
- Take responsibility for validating functional solutions and ensuring their readiness for production deployment.
- Make informed decisions during incident handling, prioritizing recovery actions based on business impact and urgency.
Knowledge, Skills and Experience
- 4 to 7 years of hands-on experience in Application Production Support for applications developed in AI related technologies.
- Exposure to BAU (Business-As-Usual) support activities with strong focus on system stability, issue resolution, and stakeholder coordination.
- Experience working in shift-based or on-call support models to ensure sla adherence.
- Good understanding of Python and AI based solution and workflows.
- Knowledge of identifying and raising alerts on application risks, failures, and performance anomalies.
- Ability to coordinate and support patching cycles, vulnerability remediation, and environment maintenance.
- Capable of performing volume-metric analysis to assess application performance, capacity utilization, and system scalability.
- Familiarity with incident, problem, and change management processes within production environments.
- Reasonable proficiency in English communication, both verbal and written, to coordinate effectively with business users and technical teams.
- Ability to engage with business users during BAU activities, clarifying issues, gathering inputs, and supporting daily operations.
- Basic team coordination skills to guide junior members or colleagues during support cycles.
- Strong sense of responsibility, ownership, and responsiveness in handling production support duties.
- Ability to raise timely alerts and provide clear communication on risks, performance gaps, or potential disruptions.