Roles & Responsibilities:
- Design, develop, implement, test, and maintain scalable and efficient data pipelines for large scale structured and unstructured datasets, including document, image, and event data used in GenAI and ML use cases.
- Collaborate closely with data scientists, AI/ML engineers, MLOps and Product Owners to understand data requirements and ensure data availability and quality.
- Build and optimize data architectures for both batch and real-time processing.
- Develop and maintain data warehouses and data lakes to store and manage large volumes of structured and unstructured data.
- Implement data validation and monitoring processes to ensure data integrity.
- Implement and manage vector databases (eg. pgVector, Pinecone, FAISS, etc) and embedding pipelines to support retrieval-augmented architectures.
- Support data sourcing and ingestion strategies, including API, data lakes, and message queues.
- Enforce data quality, lineage, observability, and governance standards for AI workloads
- Work with cross-functional IT and business teams in an Agile environment to deliver successful data solutions.
- Help foster a data-driven culture via information sharing, design for scalability, and operational efficiency.
- Stay updated with the latest trends and best practices in data engineering and big data technologies