Data Science Intern

Ortus Innovation

India

Fresher

Save

Posted 9 days ago
Over 50 applicants

Job Description

Data Scientist Intern

Duration: 3 Months Internship Full-Time Conversion (Performance-based)

Stipend: Starting from 30,000 (paid at the end of the internship period of 3 months). The final amount will be confirmed in the offer letter.

Location: Remote / Hybrid (DelhiBhubaneswar)

Start: Immediate

Role Overview

We are looking for a technically strong Data Scientist Intern with deep foundations in data engineering, ingestion systems, and applied machine learning for intelligence workflows.

This role combines:

High-throughput data ingestion
Intelligence-grade pipeline engineering
Feature extraction & representation
Similarity modeling & clustering
Pattern analysis & explainability

This is not an academic research role. It is not dashboard-oriented.

It is execution-heavy, system-level, and designed for real-world intelligence data operating under noise, ambiguity, and adversarial conditions.

You will work on systems that:

Ingest large volumes of heterogeneous data
Enforce strict data integrity & provenance
Transform raw inputs into structured intelligence
Generate embeddings & similarity clusters
Detect patterns, anomalies, and narrative shifts
Produce explainable outputs for decision support

This role has a clear path to full-time conversion based on technical ownership, modeling rigor, analytical clarity, and reliability under production conditions.

Core Responsibilities

A. Data Engineering & Ingestion Systems

Design, build, and maintain data ingestion pipelines for heterogeneous sources
Implement web scraping systems across:
Surface web
Deep web
Dark web (Tor-based or restricted sources where applicable, ethical & compliant use only)
Build robust scrapers with failure recovery and retry logic
Implement proxy-aware scraping (rotation, failure handling, rate-limits)
Normalize raw inputs into a unified intelligence data envelope/schema
Enforce immutability, audit logs, deduplication, and provenance tracking
Implement ingestion health metrics and monitoring
Handle multilingual and multi-format data (text, media, audio, docs, pdfs, videos and other unstructured forms of data)

B. Data Transformation & Feature Engineering

Clean and normalize unstructured text data
Extract entities (locations, organizations, dates)
Engineer structured signals from noisy inputs
Implement geo-scoring and relevance filtering logic
Build keyword expansion and semantic filtering layers
Handle encoding issues, multilingual normalization, and tokenization challenges

C. Embeddings, Similarity & Clustering

Generate semantic embeddings from text content
Evaluate embedding quality and drift
Implement similarity search and clustering (e.g., DBSCAN, HDBSCAN, Agglomerative, K-Means where justified)
Detect emerging patterns across clusters
Identify noise vs signal within large datasets
Handle high-dimensional vector data efficiently
Implement cluster validation metrics and distribution checks

D. Pattern Detection & Explainability

Perform cluster-level pattern analysis
Identify dominant themes and narrative direction
Detect escalatory language or coordinated signals
Assign risk categorization (low / medium / high) with justification
Build explainable outputs instead of black-box results
Provide confidence scoring mechanisms
Document reasoning behind each pattern conclusion

E. System Thinking & Model Discipline

Maintain reproducibility of experiments
Version datasets and pipelines
Implement deterministic model pipelines where required
Handle data drift and schema drift
Design systems assuming adversarial or misleading inputs
Balance performance, correctness, and interpretability

Required Technical Skills (Must-Have)

Programming & Engineering

Strong Python (production-quality, modular, testable code)
Experience with web scraping frameworks:
Requests / HTTPX
BeautifulSoup / lxml
Playwright / Selenium (headless browsing)
Experience scraping dynamic and non-trivial sources:
JS-rendered pages
Auth-gated or rate-limited sites
Multi-language content

Data Systems

Strong understanding of ETL / ELT concepts
Experience building batch processing systems
Failure recovery and idempotent design
Familiarity with PostgreSQL (or similar relational DBs)
Experience working with structured + unstructured data
Experience designing append-only / immutable data systems

Proxy & Network Awareness

Experience with proxy infrastructure (residential / rotating)
Basic understanding of CAPTCHA handling and anti-bot systems
Awareness of fingerprinting and rate-limiting behavior

Machine Learning & Similarity

Experience generating embeddings (Open-source or API-based)
Understanding of cosine similarity and vector search
Experience implementing clustering algorithms
Understanding of noise handling in unsupervised learning
Ability to evaluate clustering quality

Required Technical Skills (Good to Have)

Experience with Tor / onion services (ethical & compliant usage only)
Experience scraping research portals, tenders, or institutional datasets
Exposure to FastAPI or backend API systems
Experience with vector databases
Exposure to ANN search (HNSW / FAISS)
Basic NLP experience (entity extraction, language detection, normalization)
Experience with multilingual embeddings
Familiarity with model explainability concepts
Understanding of adversarial ML risks

What Makes You a Strong Fit

You think in systems, not scripts
You prefer correctness over hacks
You understand that noisy data is more dangerous than missing data
You can explain your model outputs clearly
You question assumptions instead of blindly optimizing metrics
You design for scale, failure, and auditability

Conversion Criteria for Full-Time Role at Ortus (After 3 Months)

Accountability and execution discipline
Stability and robustness of pipelines built
Quality of embedding + clustering logic
Analytical depth in pattern detection
Ability to independently own data sources end-to-end
Clean, auditable, intelligence-grade outputs
Reliability under scale and adversarial conditions
Clear documentation and reproducibility

About Ortus

Ortus Innovation is a deep-tech AI company building indigenous intelligence systems for India's defence and internal security ecosystem.

We transform complex, multi-source information into structured, explainable intelligence that enables faster, better decisions in high-risk environments.

Ortus means new beginnings. We represent a shift away from siloed, imported, black-box systems toward secure-by-design, auditable, adaptive systems built specifically for Indian operational realities.

If you are looking for a comfortable analytics internship, this role is not for you.

If you want to build real intelligence systems from ingestion to pattern detection, this is where you start.