Data/ML Engineer

roundcircle

Gurugram, Gurugram, India

Fresher

Save

Posted 5 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About the Role

We are looking for a Data Engineer / Machine Learning Engineer to build and scale the data and intelligence layer powering our multi-tenant SaaS platform. You will own data pipelines that ingest, transform, and serve billions of events across analytics, billing, and product surfaces — and increasingly power ML-driven features such as scoring, recommendations, and intelligent automation.

This is a hands-on role with deep ownership: you will design schemas, build pipelines end-to-end, optimize query performance on ClickHouse, and ship ML workflows into production. You will work closely with backend, product, and founding engineering leadership.

Core Responsibilities

Design, build, and maintain scalable batch and streaming data pipelines using Apache Airflow and Python.
Model and optimize analytical workloads on ClickHouse — including partition strategy, sort keys, materialized views, and ReplacingMergeTree / AggregatingMergeTree patterns.
Build and maintain ETL/ELT workflows for ingestion from operational stores (Mysql, PostgreSQL, MongoDB) into the analytics warehouse.
Develop, deploy, and monitor machine learning models — from feature engineering to training, evaluation, and production serving.
Define and enforce data contracts, schema evolution, and data quality checks across services.
Partner with backend teams to instrument event tracking and ensure data correctness across multi-tenant boundaries.
Optimize query performance and cost; investigate and resolve slow queries, full partition scans, and skew issues.
Contribute to the MLOps stack: model versioning, experiment tracking, monitoring, and retraining pipelines.
Write clean, tested, well-documented code. Participate in code reviews and design discussions.

Mandatory Skills

Python Programming — strong proficiency, including pandas, NumPy, and production-grade code (typing, packaging, testing).
Data Pipelines — solid experience designing batch and/or streaming pipelines, with awareness of idempotency, backfills, and failure recovery.
Apache Airflow — authoring DAGs, custom operators, sensors, and managing dependencies in production.
ClickHouse — hands-on experience with table engines (MergeTree family), partitioning, sort keys, and materialized views.
SQL — advanced proficiency: window functions, CTEs, query plans, and performance tuning on large datasets.
Relational and NoSQL databases — working knowledge of PostgreSQL and MongoDB (schemas, indexing, CDC patterns).
Distributed data processing — practical experience with PySpark, Dask, or equivalent for large-scale transforms.
Message brokers & streaming — hands-on experience with RabbitMQ and Apache Kafka; understanding of producers/consumers, partitioning, consumer groups, delivery guarantees, and dead-letter handling.
Machine Learning fundamentals — supervised/unsupervised techniques, model evaluation, and at least one framework (scikit-learn, PyTorch, or TensorFlow).
Version control with Git and collaborative workflows (PRs, code reviews).

Preferred Skills

Experience in agile development environments.
Familiarity with DevOps tools and CI/CD pipelines (e.g., GitHub Actions, GitLab CI, Jenkins).
Knowledge of containerization tools like Docker and orchestration platforms like Kubernetes.
Exposure to cloud platforms like AWS or GCP (BigQuery, Cloud Composer, GKE, Pub/Sub, Dataflow) is a plus.
Familiarity with CDC tools (Debezium) and stream processing frameworks (Kafka Streams, Flink).
Exposure to MLOps tooling — MLflow, Weights & Biases, SageMaker, Vertex AI, or equivalent.
Experience with LLMs and Generative AI — embeddings, RAG, vector databases (pgvector, Pinecone, Weaviate), and prompt orchestration frameworks.
Familiarity with observability tools — Grafana, Prometheus, or Datadog — for data pipelines.
Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or related field.