
Search by job, company or skills
Company Description
CodeChavo is a trusted leader in IT Staffing and Services, catering to the staffing and solutions needs of renowned brands across India and the US. The company is committed to delivering high-quality technology solutions and connecting businesses with top-tier IT professionals. Known for its innovative approach and expertise, CodeChavo thrives on fostering successful partnerships with its clients and employees.
Role Description
Location: Mumbai, India
Work-mode: Hybrid Experience: 4-8 years
Important Requirement (Please Read Before Applying) This role requires:
● Willingness to work in rotational shifts (IST & EST time zones)
● Availability to work on weekends (mandatory as per shift schedule) Please apply only if you
are comfortable with the above requirements.
About the Role:
We are looking for an Application Support Engineer (L1/L2) to ensure the stability, reliability,
and smooth functioning of our production systems. This role acts as the first line of defense for
system monitoring and incident response, ensuring that issues are identified early, resolved
quickly, and escalated appropriately. The ideal candidate should be comfortable working in a
high-availability, fast-paced environment, handling alerts, monitoring data pipelines, and
ensuring seamless platform operations.
Key Responsibilities:
Monitoring & System Health
● Monitor production systems using tools such as Datadog, CloudWatch, and internal
dashboards
● Track system health across APIs, data pipelines, databases, and third-party integration
● Identify anomalies and validate alerts to reduce false positives
Incident Management & Response
● Respond to system alerts in real-time (failures, latency spikes, downtime)
● Perform initial incident triage and identify impacted components
● Execute predefined runbooks and recovery actions (job restarts, retries, etc.)
● Escalate issues to engineering teams when required
Data Pipeline Monitoring
● Monitor scheduled jobs and workflows (e.g., Dagster, SageMaker, batch pipelines)
● Identify missing, delayed, or failed data processes
● Trigger re-runs or escalate issues to relevant teams
Third-Party & Vendor Monitoring
● Monitor failures in external APIs, proxies, and vendor systems
● Coordinate with internal teams for resolution ● Track and highlight recurring vendor-related
issues
Database Monitoring
● Perform basic database health checks including:○ Connection issues
○ Slow queries
○ Replication lag
○ Storage utilization
● Raise alerts for any anomalies
Runbook Execution & Documentation
● Follow standard operating procedures and runbooks for known issues
● Maintain clear logs of actions taken during incidents
● Ensure proper closure and documentation of incidents
Reporting & Shift Handover
● Maintain incident logs and reports
● Provide structured shift handovers to ensure continuity
● Highlight recurring issues and patterns for further analysis
What You Will NOT Be Responsible For (To set the right expectations clearly)
● No deep debugging or code-level fixes
● No infrastructure changes
● No ownership of alert configurations (handled by SRE/Engineering teams)
Job ID: 145830275