Data Engineer / ML Engineer Job Description
Location - Gurugram (Onsite)
Salary Budget - Upto 18 LPA
Key Responsibilities
- Design, build, and maintain scalable data pipelines (batch + streaming) using Spark, Hadoop, and other Apache ecosystem tools.
- Develop robust ETL workflows for large-scale data ingestion, transformation, and validation.
- Work with Cassandra, Data Lakes, and distributed storage systems to handle large-volume datasets.
- Write clean, optimized, and modular Python code for data processing, automation, and machine learning tasks.
- Utilize Linux environments for scripting, performance tuning, and data workflow orchestration.
- Build and manage web scraping pipelines to extract structured and unstructured data from diverse sources.
- Collaborate with ML/AI teams to prepare training datasets, manage feature stores, and support model lifecycle.
- Implement and experiment with LLMs, LangChain, RAG pipelines, and vector database integrations.
- Assist in fine-tuning models, evaluating model performance, and deploying ML models into production.
- Optimize data workflows for performance, scalability, and fault tolerance.
- Document data flows, transformation logic, and machine learning processes.
- Work cross-functionally with engineering, product, and DevOps teams to ensure reliable, production-grade data systems.
Requirements
- 36 years of experience as a Data Engineer, ML Engineer, or similar role.
- Strong expertise in Advanced Python (data structures, multiprocessing, async, clean architecture).
- Solid experience with:
- Apache Spark / PySpark
- Hadoop ecosystem (HDFS, Hive, Yarn, HBase, etc.)
- Cassandra or similar distributed databases
- Linux (CLI tools, shell scripting, environment management)
- Proven ability to design and implement ETL pipelines and scalable data processing systems.
- Hands-on experience with data lakes, large-scale storage, and distributed systems.
- Experience with web scraping frameworks (BeautifulSoup, Scrapy, Playwright, etc.).
- Familiarity with LangChain, LLMs, RAG, vector stores (FAISS, Pinecone, Milvus), and ML workflow tools.
- Understanding of model training, fine-tuning, and evaluation workflows.
- Strong problem-solving skills, ability to deep dive into complex data issues, and write production-ready code.
- Experience with cloud environments (AWS/GCP/Azure) is a plus.