Data Scientist Intern
Duration: 3 Months Internship Full-Time Conversion (Performance-based)
Stipend: Starting from 30,000 (paid at the end of the internship period of 3 months). The final amount will be confirmed in the offer letter.
Location: Remote / Hybrid (DelhiBhubaneswar)
Start: Immediate
Role Overview
We are looking for a technically strong Data Scientist Intern with deep foundations in data engineering, ingestion systems, and applied machine learning for intelligence workflows.
This role combines:
- High-throughput data ingestion
- Intelligence-grade pipeline engineering
- Feature extraction & representation
- Similarity modeling & clustering
- Pattern analysis & explainability
This is not an academic research role. It is not dashboard-oriented.
It is execution-heavy, system-level, and designed for real-world intelligence data operating under noise, ambiguity, and adversarial conditions.
You will work on systems that:
- Ingest large volumes of heterogeneous data
- Enforce strict data integrity & provenance
- Transform raw inputs into structured intelligence
- Generate embeddings & similarity clusters
- Detect patterns, anomalies, and narrative shifts
- Produce explainable outputs for decision support
This role has a clear path to full-time conversion based on technical ownership, modeling rigor, analytical clarity, and reliability under production conditions.
Core Responsibilities
A. Data Engineering & Ingestion Systems
- Design, build, and maintain data ingestion pipelines for heterogeneous sources
- Implement web scraping systems across:
- Surface web
- Deep web
- Dark web (Tor-based or restricted sources where applicable, ethical & compliant use only)
- Build robust scrapers with failure recovery and retry logic
- Implement proxy-aware scraping (rotation, failure handling, rate-limits)
- Normalize raw inputs into a unified intelligence data envelope/schema
- Enforce immutability, audit logs, deduplication, and provenance tracking
- Implement ingestion health metrics and monitoring
- Handle multilingual and multi-format data (text, media, audio, docs, pdfs, videos and other unstructured forms of data)
B. Data Transformation & Feature Engineering
- Clean and normalize unstructured text data
- Extract entities (locations, organizations, dates)
- Engineer structured signals from noisy inputs
- Implement geo-scoring and relevance filtering logic
- Build keyword expansion and semantic filtering layers
- Handle encoding issues, multilingual normalization, and tokenization challenges
C. Embeddings, Similarity & Clustering
- Generate semantic embeddings from text content
- Evaluate embedding quality and drift
- Implement similarity search and clustering (e.g., DBSCAN, HDBSCAN, Agglomerative, K-Means where justified)
- Detect emerging patterns across clusters
- Identify noise vs signal within large datasets
- Handle high-dimensional vector data efficiently
- Implement cluster validation metrics and distribution checks
D. Pattern Detection & Explainability
- Perform cluster-level pattern analysis
- Identify dominant themes and narrative direction
- Detect escalatory language or coordinated signals
- Assign risk categorization (low / medium / high) with justification
- Build explainable outputs instead of black-box results
- Provide confidence scoring mechanisms
- Document reasoning behind each pattern conclusion
E. System Thinking & Model Discipline
- Maintain reproducibility of experiments
- Version datasets and pipelines
- Implement deterministic model pipelines where required
- Handle data drift and schema drift
- Design systems assuming adversarial or misleading inputs
- Balance performance, correctness, and interpretability
Required Technical Skills (Must-Have)
Programming & Engineering
- Strong Python (production-quality, modular, testable code)
- Experience with web scraping frameworks:
- Requests / HTTPX
- BeautifulSoup / lxml
- Playwright / Selenium (headless browsing)
- Experience scraping dynamic and non-trivial sources:
- JS-rendered pages
- Auth-gated or rate-limited sites
- Multi-language content
Data Systems
- Strong understanding of ETL / ELT concepts
- Experience building batch processing systems
- Failure recovery and idempotent design
- Familiarity with PostgreSQL (or similar relational DBs)
- Experience working with structured + unstructured data
- Experience designing append-only / immutable data systems
Proxy & Network Awareness
- Experience with proxy infrastructure (residential / rotating)
- Basic understanding of CAPTCHA handling and anti-bot systems
- Awareness of fingerprinting and rate-limiting behavior
Machine Learning & Similarity
- Experience generating embeddings (Open-source or API-based)
- Understanding of cosine similarity and vector search
- Experience implementing clustering algorithms
- Understanding of noise handling in unsupervised learning
- Ability to evaluate clustering quality
Required Technical Skills (Good to Have)
- Experience with Tor / onion services (ethical & compliant usage only)
- Experience scraping research portals, tenders, or institutional datasets
- Exposure to FastAPI or backend API systems
- Experience with vector databases
- Exposure to ANN search (HNSW / FAISS)
- Basic NLP experience (entity extraction, language detection, normalization)
- Experience with multilingual embeddings
- Familiarity with model explainability concepts
- Understanding of adversarial ML risks
What Makes You a Strong Fit
- You think in systems, not scripts
- You prefer correctness over hacks
- You understand that noisy data is more dangerous than missing data
- You can explain your model outputs clearly
- You question assumptions instead of blindly optimizing metrics
- You design for scale, failure, and auditability
Conversion Criteria for Full-Time Role at Ortus (After 3 Months)
- Accountability and execution discipline
- Stability and robustness of pipelines built
- Quality of embedding + clustering logic
- Analytical depth in pattern detection
- Ability to independently own data sources end-to-end
- Clean, auditable, intelligence-grade outputs
- Reliability under scale and adversarial conditions
- Clear documentation and reproducibility
About Ortus
Ortus Innovation is a deep-tech AI company building indigenous intelligence systems for India's defence and internal security ecosystem.
We transform complex, multi-source information into structured, explainable intelligence that enables faster, better decisions in high-risk environments.
Ortus means new beginnings. We represent a shift away from siloed, imported, black-box systems toward secure-by-design, auditable, adaptive systems built specifically for Indian operational realities.
If you are looking for a comfortable analytics internship, this role is not for you.
If you want to build real intelligence systems from ingestion to pattern detection, this is where you start.