Search by job, company or skills

Clearstate

AI/LLM Data Intelligence Engineer

Save
  • Posted 7 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About this Role

Clearstate is building an AI-powered data intelligence engine for the global MedTech market. This role sits at the heart of that initiative: you will design and own the LLM-based ingestion and analysis pipeline that transforms large volumes of unstructured proprietary data into high-accuracy structured intelligence.


This is not a generic data engineering position. You will be working closely with research analysts and operations leadership to understand the domain, shape the architecture, and prove what is possible - then build it into production

.
About Clearsta

teClearstate is a healthcare-focused market intelligence company specializing in MedTech and In Vitro Diagnostics (IVD). We track hundreds of product lines across 50+ global markets, delivering market size, share, and forecasting data to medical device manufacturers worldwid

e.
Our data combines proprietary primary research (KOL interviews, hospital surveys, distributor interviews) with large-scale secondary data processing (import/trade data, tender datasets, reimbursement claims). We are now investing in the infrastructure and AI capability to automate and scale that analytical engi

ne.
The Opportu

nityWe are at an inflection point. Our data assets are rich and proprietary, but much of our processing is still manual and script-based. We are build

  • ing:A centralised cloud data warehouse (BigQuery / Snowflake) as the single source of tr
  • uth.ETL/ELT pipelines to ingest structured and unstructured data at sc
  • ale.An LLM-based ingestion and analysis layer to automate first-cut analysis of interview transcripts, import data mappings, and market siz
  • ing.An agentic workflow environment where AI produces validated outputs and human analysts review excepti

ons.
This role leads the AI/LLM engineering workstream. The right person will be the technical architect and builder of our LLM pipeline - the equivalent of a secret sauce capability that strengthens our defensive moat as a busi

ness.
Th

e RolePhase 1 - Discovery and Architecture (Short

  • Term)Immerse yourself in Clearstate's
  • data:Understand the structure of interview transcripts, import datasets, HS code mappings, and market sizing m
  • odels.Audit existing Python scripts, Google Apps Scripts, and ML experiments to understand the current
  • state.Define the pre-processing pip
  • eline:Convert raw PDFs, Word documents and structured files into clean, normalised inputs suitable for LLM inge
  • stion.Evaluate and recommend open-source LLMs (e.g., Llama, Mistral, Qwen) vs. API-based models for the core inference workload, with total cost of ownership in
  • mind.Produce an architecture proposal and a proof-of-concept demonstrating achievable accuracy on a representative data s

ample.
Phase 2 - Build and Validate (Mediu

  • m Term)Build the LLM ingestion pipeline end-
  • to-end:Pre-proc
  • essing.Prompt engin
  • eering.Structured output extr
  • action.Confidence s
  • coring.Exception fl
  • agging.Implement domain-specific validation rules to catch hallucinations and data errors before they reach an
  • alysts.Run parallel trials against the manual process to quantify accuracy and time s
  • avings.Integrate with the data warehouse layer (BigQuery / Snowflake) so outputs land in analytics-ready
  • tables.Instrument the pi
  • peline:L
  • ogging.Moni
  • toring.Data quality
  • checks.Exception queues for human

review.
Phase 3 - Productionize and Expand (

  • Ongoing)Harden the pipeline for production use and hand over maintainable code to the broader engineeri
  • ng team.Explore agentic workflow patterns (multi-step reasoning, tool-calling) to further automate the analytica
  • l chain.Identify adjacent automation opportunities across the research and data operations w
  • orkflow.Contribute to IP documentation and architecture decisions that underpin Clearstate's long-term defens

ibility.
Skills and E

xperienceEssential -

  • Must HaveStrong Python engineering. You write clean, testable, production-qual
  • ity code.Deep practical knowledge of large language models, including prompt engineering, temperature / sampling parameters, context management, hallucination mi
  • tigation.Experience pre-processing unstructured data (PDFs, Word, HTML, plain text) into structured formats suitable for L
  • LM input.Familiarity with open-source LLM frameworks (Hugging Face, Ollama, vLLM, LangChain, or
  • similar).Understanding of structured output extraction from LLM responses (JSON schema, function calling, constrained gen
  • eration).Solid grounding in data engineering, including ETL/ELT concepts, SQL, data pipelin
  • e design.Ability to design and run rigorous accuracy evaluations and build confidence in model
  • outputs.Strong scientific rigour. You validate before you ship and you quantify imp

rovement.
Highly

  • DesirableExperience hosting and fine-tuning open-source LLMs in a cloud environment (AWS, GCP, or Azure) to manage cost
  • at scale.Exposure to RAG (Retrieval-Augmented Generation) architectures for grounding LLM outputs in proprie
  • tary data.Experience with agentic frameworks (LangGraph, AutoGen, CrewAI, or
  • similar).Hands-on exposure to AI coding agents or agentic development tools such as OpenAI Codex, Claude Code, Cursor, GitHub Copilot Workspace o
  • r similar.Data warehouse experience: BigQuery, Snowflake, or D
  • atabricks.ML pipeline experience, including feature engineering, model training, validation, and deployment (not just LLM - classical ML for mapping/classificati
  • on tasks).Familiarity with distributed computing or batch processing for large-scale
  • datasets.Experience building exception and human-in-the-loop review
  • workflows.Background in a data-intensive scientific or analytical domain (physics, genomics, finance, life sciences, or

similar).
What Great

  • Looks LikeYou can move from messy real-world documents to structured, validated outputs, not
  • just demos.You are comfortable comparing model and tool choices pragmatically across quality, cost, latency, maintainability and data-governance
  • trade-offs.You use AI development tools effectively, but still own the architecture, tests, review process and final co
  • de quality.You can explain technical decisions clearly to non-engineering stakeholders and turn analyst feedback into measurable system im

provements.
Doma

  • in KnowledgeNo prior MedTech knowledge
  • is required.We are buying technical depth. Domain knowledge can be learned. We will invest time in bringing you up to speed on the industry, our data and
  • our clients.What we do ask is intellectual curiosity. A genuine interest in understanding the data you are working with and in building something that makes the research team faster and mo

re accurate.
Re

porting LinesYou will work

  • closely with:The Head of Global Operations, setting the strategic direction, business context an
  • d priorities.The Tech and Product Manager. Your primary day-to-day collaborator on architecture and im
  • plementation.Research analysts and content managers. The domain experts who understand the data and will validate

your outputs.
You will have significant autonomy in the early stages. We expect you to define the art of the possible, not just execute a pre-written spec. Regular check-ins with leadership will ens

ure alignment.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 149079485