Requirement Details
Primary Location:
Noida
Position Overview (Job Summary):
A highly technical Data Engineering role focused on designing, building, and operationalizing data systems that support AI/ML production pipelines. The position centers on ingesting unstructured and structured data, building high-scale automated pipelines, implementing feature stores, supporting vector databases, and enabling MLOps workflows for reproducible AI model development.
Primary Skills:
- Python (expert-level)
- SQL (advanced, tuning)
- Apache Spark / PySpark
- Apache Kafka (streaming)
- ETL/ELT pipeline development
- Feature Stores (Tecton, Feast)
- Vector Databases (Pinecone, Milvus)
- Cloud services: AWS Glue, Azure Data Factory, Google Vertex AI
- Handling unstructured/semi-structured data (Parquet, JSON, Avro, text)
- Data pipeline orchestration (Airflow)
- Delta Lake / Lakehouse architectures
Secondary Skills:
- Hugging Face Datasets
- PyTorch / TensorFlow Data Loaders
- dbt (Data Build Tool)
- NoSQL (MongoDB, Cassandra)
- Distributed computing frameworks (Flink)
- Data quality automation & unit testing
- MLOps integration: data versioning, lineage
- AI/ML pipeline collaboration with data scientists
Experience:
- 6 to 12+ years in Data Engineering
- Minimum 2 years supporting production-grade AI/ML pipelines
- Band: 3.1 to 4.2
Role and Responsibilities:
A. Key Responsibilities
- Build robust, automated ETL/ELT pipelines for AI-ready datasets.
- Perform feature engineering: cleaning, normalizing, and structuring complex data.
- Develop and maintain Feature Stores to support both training and real-time inference.
- Manage distributed, large-scale (petabyte-level) data processing using Spark/Flink.
- Populate, index, and optimize vector databases for Generative AI/RAG workloads.
- Implement data quality checks, unit tests, and bias detection mechanisms.
- Support MLOps workflows: data versioning, lineage, reproducibility.
- Collaborate closely with ML Engineers and Data Scientists for model development.
B. Additional Responsibilities
- Work cross-functionally within Digital Foundation teams.
- Ensure pipeline scalability, performance optimization, and automation maturity.
- Prevent training-serving skew through structured data management practices.
- Provide infrastructure support enabling rapid model training and deployment.
- Contribute to best practices in AI data engineering and cloud-native architectures.
Educational Qualification:
- Bachelor's or Master's degree in:
- Computer Science
- Information Systems
- Engineering
- Or related technical field
Certifications:
(Not mandatory but beneficial; JD does not list specifics)
- Cloud certifications (AWS/Azure/GCP)
- Databricks/Spark certifications
- MLOps / ML engineering certifications
- Kafka, Airflow, or dbt certifications