AI/LLM Data Intelligence Engineer

Clearstate

Gurugram, Gurugram, India

Fresher

Save

Posted 7 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About this Role

Clearstate is building an AI-powered data intelligence engine for the global MedTech market. This role sits at the heart of that initiative: you will design and own the LLM-based ingestion and analysis pipeline that transforms large volumes of unstructured proprietary data into high-accuracy structured intelligence.

This is not a generic data engineering position. You will be working closely with research analysts and operations leadership to understand the domain, shape the architecture, and prove what is possible - then build it into production

.
About Clearsta

teClearstate is a healthcare-focused market intelligence company specializing in MedTech and In Vitro Diagnostics (IVD). We track hundreds of product lines across 50+ global markets, delivering market size, share, and forecasting data to medical device manufacturers worldwid

e.
Our data combines proprietary primary research (KOL interviews, hospital surveys, distributor interviews) with large-scale secondary data processing (import/trade data, tender datasets, reimbursement claims). We are now investing in the infrastructure and AI capability to automate and scale that analytical engi

ne.
The Opportu

nityWe are at an inflection point. Our data assets are rich and proprietary, but much of our processing is still manual and script-based. We are build

ing:A centralised cloud data warehouse (BigQuery / Snowflake) as the single source of tr
uth.ETL/ELT pipelines to ingest structured and unstructured data at sc
ale.An LLM-based ingestion and analysis layer to automate first-cut analysis of interview transcripts, import data mappings, and market siz
ing.An agentic workflow environment where AI produces validated outputs and human analysts review excepti

ons.
This role leads the AI/LLM engineering workstream. The right person will be the technical architect and builder of our LLM pipeline - the equivalent of a secret sauce capability that strengthens our defensive moat as a busi

ness.
Th

e RolePhase 1 - Discovery and Architecture (Short

Term)Immerse yourself in Clearstate's
data:Understand the structure of interview transcripts, import datasets, HS code mappings, and market sizing m
odels.Audit existing Python scripts, Google Apps Scripts, and ML experiments to understand the current
state.Define the pre-processing pip
eline:Convert raw PDFs, Word documents and structured files into clean, normalised inputs suitable for LLM inge
stion.Evaluate and recommend open-source LLMs (e.g., Llama, Mistral, Qwen) vs. API-based models for the core inference workload, with total cost of ownership in
mind.Produce an architecture proposal and a proof-of-concept demonstrating achievable accuracy on a representative data s

ample.
Phase 2 - Build and Validate (Mediu

m Term)Build the LLM ingestion pipeline end-
to-end:Pre-proc
essing.Prompt engin
eering.Structured output extr
action.Confidence s
coring.Exception fl
agging.Implement domain-specific validation rules to catch hallucinations and data errors before they reach an
alysts.Run parallel trials against the manual process to quantify accuracy and time s
avings.Integrate with the data warehouse layer (BigQuery / Snowflake) so outputs land in analytics-ready
tables.Instrument the pi
peline:L
ogging.Moni
toring.Data quality
checks.Exception queues for human

review.
Phase 3 - Productionize and Expand (

Ongoing)Harden the pipeline for production use and hand over maintainable code to the broader engineeri
ng team.Explore agentic workflow patterns (multi-step reasoning, tool-calling) to further automate the analytica
l chain.Identify adjacent automation opportunities across the research and data operations w
orkflow.Contribute to IP documentation and architecture decisions that underpin Clearstate's long-term defens

ibility.
Skills and E

xperienceEssential -

Must HaveStrong Python engineering. You write clean, testable, production-qual
ity code.Deep practical knowledge of large language models, including prompt engineering, temperature / sampling parameters, context management, hallucination mi
tigation.Experience pre-processing unstructured data (PDFs, Word, HTML, plain text) into structured formats suitable for L
LM input.Familiarity with open-source LLM frameworks (Hugging Face, Ollama, vLLM, LangChain, or
similar).Understanding of structured output extraction from LLM responses (JSON schema, function calling, constrained gen
eration).Solid grounding in data engineering, including ETL/ELT concepts, SQL, data pipelin
e design.Ability to design and run rigorous accuracy evaluations and build confidence in model
outputs.Strong scientific rigour. You validate before you ship and you quantify imp

rovement.
Highly

DesirableExperience hosting and fine-tuning open-source LLMs in a cloud environment (AWS, GCP, or Azure) to manage cost
at scale.Exposure to RAG (Retrieval-Augmented Generation) architectures for grounding LLM outputs in proprie
tary data.Experience with agentic frameworks (LangGraph, AutoGen, CrewAI, or
similar).Hands-on exposure to AI coding agents or agentic development tools such as OpenAI Codex, Claude Code, Cursor, GitHub Copilot Workspace o
r similar.Data warehouse experience: BigQuery, Snowflake, or D
atabricks.ML pipeline experience, including feature engineering, model training, validation, and deployment (not just LLM - classical ML for mapping/classificati
on tasks).Familiarity with distributed computing or batch processing for large-scale
datasets.Experience building exception and human-in-the-loop review
workflows.Background in a data-intensive scientific or analytical domain (physics, genomics, finance, life sciences, or

similar).
What Great

Looks LikeYou can move from messy real-world documents to structured, validated outputs, not
just demos.You are comfortable comparing model and tool choices pragmatically across quality, cost, latency, maintainability and data-governance
trade-offs.You use AI development tools effectively, but still own the architecture, tests, review process and final co
de quality.You can explain technical decisions clearly to non-engineering stakeholders and turn analyst feedback into measurable system im

provements.
Doma

in KnowledgeNo prior MedTech knowledge
is required.We are buying technical depth. Domain knowledge can be learned. We will invest time in bringing you up to speed on the industry, our data and
our clients.What we do ask is intellectual curiosity. A genuine interest in understanding the data you are working with and in building something that makes the research team faster and mo

re accurate.
Re

porting LinesYou will work

closely with:The Head of Global Operations, setting the strategic direction, business context an
d priorities.The Tech and Product Manager. Your primary day-to-day collaborator on architecture and im
plementation.Research analysts and content managers. The domain experts who understand the data and will validate

your outputs.
You will have significant autonomy in the early stages. We expect you to define the art of the possible, not just execute a pre-written spec. Regular check-ins with leadership will ens

ure alignment.