Search by job, company or skills

Ortus Innovation

Data Science Intern

Fresher
new job description bg glownew job description bg glownew job description bg svg
  • Posted 9 days ago
  • Over 50 applicants

Job Description

Data Scientist Intern

Duration: 3 Months Internship Full-Time Conversion (Performance-based)

Stipend: Starting from 30,000 (paid at the end of the internship period of 3 months). The final amount will be confirmed in the offer letter.

Location: Remote / Hybrid (DelhiBhubaneswar)

Start: Immediate

Role Overview

We are looking for a technically strong Data Scientist Intern with deep foundations in data engineering, ingestion systems, and applied machine learning for intelligence workflows.

This role combines:

  • High-throughput data ingestion
  • Intelligence-grade pipeline engineering
  • Feature extraction & representation
  • Similarity modeling & clustering
  • Pattern analysis & explainability

This is not an academic research role. It is not dashboard-oriented.

It is execution-heavy, system-level, and designed for real-world intelligence data operating under noise, ambiguity, and adversarial conditions.

You will work on systems that:

  • Ingest large volumes of heterogeneous data
  • Enforce strict data integrity & provenance
  • Transform raw inputs into structured intelligence
  • Generate embeddings & similarity clusters
  • Detect patterns, anomalies, and narrative shifts
  • Produce explainable outputs for decision support

This role has a clear path to full-time conversion based on technical ownership, modeling rigor, analytical clarity, and reliability under production conditions.

Core Responsibilities

A. Data Engineering & Ingestion Systems

  • Design, build, and maintain data ingestion pipelines for heterogeneous sources
  • Implement web scraping systems across:
  • Surface web
  • Deep web
  • Dark web (Tor-based or restricted sources where applicable, ethical & compliant use only)
  • Build robust scrapers with failure recovery and retry logic
  • Implement proxy-aware scraping (rotation, failure handling, rate-limits)
  • Normalize raw inputs into a unified intelligence data envelope/schema
  • Enforce immutability, audit logs, deduplication, and provenance tracking
  • Implement ingestion health metrics and monitoring
  • Handle multilingual and multi-format data (text, media, audio, docs, pdfs, videos and other unstructured forms of data)

B. Data Transformation & Feature Engineering

  • Clean and normalize unstructured text data
  • Extract entities (locations, organizations, dates)
  • Engineer structured signals from noisy inputs
  • Implement geo-scoring and relevance filtering logic
  • Build keyword expansion and semantic filtering layers
  • Handle encoding issues, multilingual normalization, and tokenization challenges

C. Embeddings, Similarity & Clustering

  • Generate semantic embeddings from text content
  • Evaluate embedding quality and drift
  • Implement similarity search and clustering (e.g., DBSCAN, HDBSCAN, Agglomerative, K-Means where justified)
  • Detect emerging patterns across clusters
  • Identify noise vs signal within large datasets
  • Handle high-dimensional vector data efficiently
  • Implement cluster validation metrics and distribution checks

D. Pattern Detection & Explainability

  • Perform cluster-level pattern analysis
  • Identify dominant themes and narrative direction
  • Detect escalatory language or coordinated signals
  • Assign risk categorization (low / medium / high) with justification
  • Build explainable outputs instead of black-box results
  • Provide confidence scoring mechanisms
  • Document reasoning behind each pattern conclusion

E. System Thinking & Model Discipline

  • Maintain reproducibility of experiments
  • Version datasets and pipelines
  • Implement deterministic model pipelines where required
  • Handle data drift and schema drift
  • Design systems assuming adversarial or misleading inputs
  • Balance performance, correctness, and interpretability

Required Technical Skills (Must-Have)

Programming & Engineering

  • Strong Python (production-quality, modular, testable code)
  • Experience with web scraping frameworks:
  • Requests / HTTPX
  • BeautifulSoup / lxml
  • Playwright / Selenium (headless browsing)
  • Experience scraping dynamic and non-trivial sources:
  • JS-rendered pages
  • Auth-gated or rate-limited sites
  • Multi-language content

Data Systems

  • Strong understanding of ETL / ELT concepts
  • Experience building batch processing systems
  • Failure recovery and idempotent design
  • Familiarity with PostgreSQL (or similar relational DBs)
  • Experience working with structured + unstructured data
  • Experience designing append-only / immutable data systems

Proxy & Network Awareness

  • Experience with proxy infrastructure (residential / rotating)
  • Basic understanding of CAPTCHA handling and anti-bot systems
  • Awareness of fingerprinting and rate-limiting behavior

Machine Learning & Similarity

  • Experience generating embeddings (Open-source or API-based)
  • Understanding of cosine similarity and vector search
  • Experience implementing clustering algorithms
  • Understanding of noise handling in unsupervised learning
  • Ability to evaluate clustering quality

Required Technical Skills (Good to Have)

  • Experience with Tor / onion services (ethical & compliant usage only)
  • Experience scraping research portals, tenders, or institutional datasets
  • Exposure to FastAPI or backend API systems
  • Experience with vector databases
  • Exposure to ANN search (HNSW / FAISS)
  • Basic NLP experience (entity extraction, language detection, normalization)
  • Experience with multilingual embeddings
  • Familiarity with model explainability concepts
  • Understanding of adversarial ML risks

What Makes You a Strong Fit

  • You think in systems, not scripts
  • You prefer correctness over hacks
  • You understand that noisy data is more dangerous than missing data
  • You can explain your model outputs clearly
  • You question assumptions instead of blindly optimizing metrics
  • You design for scale, failure, and auditability

Conversion Criteria for Full-Time Role at Ortus (After 3 Months)

  • Accountability and execution discipline
  • Stability and robustness of pipelines built
  • Quality of embedding + clustering logic
  • Analytical depth in pattern detection
  • Ability to independently own data sources end-to-end
  • Clean, auditable, intelligence-grade outputs
  • Reliability under scale and adversarial conditions
  • Clear documentation and reproducibility

About Ortus

Ortus Innovation is a deep-tech AI company building indigenous intelligence systems for India's defence and internal security ecosystem.

We transform complex, multi-source information into structured, explainable intelligence that enables faster, better decisions in high-risk environments.

Ortus means new beginnings. We represent a shift away from siloed, imported, black-box systems toward secure-by-design, auditable, adaptive systems built specifically for Indian operational realities.

If you are looking for a comfortable analytics internship, this role is not for you.

If you want to build real intelligence systems from ingestion to pattern detection, this is where you start.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 142599581