Job description:
Key Responsibilities:
1.AWSCloud Management
- Manage, monitor, and optimize workloads using EC2, RDS, S3, VPC, CloudWatch, IAM, Lambda, and EKS.
- Implement and maintain Infrastructure as Code (IaC) using Terraform, CloudFormation, orAWSCDK.
- Enable auto-scaling, load balancing, and fault-tolerant architectures.
- Configure and maintainAWSSystems Manager (SSM) for patch automation and fleet management.
- Work with AI/ML services (e.g., Amazon Bedrock, SageMaker, Lookout for Metrics) for predictive insights or operational intelligence.
2. Linux Server Administration
- Manage and secure Red Hat / CentOS / Ubuntu servers (installation, hardening, patching).
- Implement user management, shell scripting, crontab automation, SE Linux, and auditing policies.
- Configure and troubleshoot web servers (Apache/Nginx), databases (MySQL/PostgreSQL), and application services.
- Monitor performance and automate log analytics integration with CloudWatch Logs or ELK Stack.
3. Automation & AI Ops
- Develop scripts in Python / Bash / PowerShell for repetitive task automation (e.g., patching, backups, monitoring).
- Integrate AI-based alert correlation and predictive analytics usingAWSCloudWatch Anomaly Detection, Amazon DevOps Guru, or third-party AIOps tools.
- Automate operational workflows usingAWSLambda, Event Bridge, Step Functions, and SNS.
- Participate in developing self-healing infrastructure via automation triggers and remediation scripts.
4. Security, Compliance & Governance
- Implement IAM least privilege, MFA, Guard Duty, Config Rules, and Security Hub compliance checks.
- Support security posture for DPDP, CERT-In, ISO 27001, andAWSWell-Architected Framework.
- Ensure patch compliance and vulnerability closure in collaboration with the Security Team.
- Participate in VAPT remediation and audit reporting.
5. Monitoring, Observability & Incident Management
- Use CloudWatch, Grafana, Prometheus, or Datadog for real-time performance insights.
- Utilize AI-based observability tools to reduce false positives and enhance incident triage.
- Handle L2 incident escalation, perform root cause analysis (RCA), and coordinate L3-level resolution.
- Prepare health reports, SLA adherence metrics, and cost optimization dashboards.
Required Skills:
- Strong hands-on experience inAWSEC2, S3, RDS, VPC, IAM, Lambda, CloudWatch, and EKS.
- Proficient in Linux administration (RHEL, CentOS, Ubuntu) and Bash/Python scripting.
- Working knowledge of Terraform / CloudFormation / Ansible / Jenkins.
- Familiarity with AI-powered monitoring or AIOps tools (e.g.,AWSDevOps Guru, Datadog AI, Ops Ramp, Splunk AI).
- Knowledge of Docker and Kubernetes for containerized workloads.
- Understanding of networking (DNS, VPN, routing, firewalls) and security best practices.
- Excellent analytical and problem-solving skills with a proactive mindset.