Search by job, company or skills

GyanDhan

Software Developer Intern – Web Scraping & Data Extraction

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 days ago
  • Over 100 applicants

Job Description

About The Role

We're looking for a motivated and detail-oriented university student to join us as a Software Developer Intern. In this role, you'll design, build, and deliver a production-quality web scraper that discovers and extracts structured course data from a university's official website.

This is a hands-on engineering role — not a copy-paste exercise. You'll be responsible for the full lifecycle: crawling strategy, data extraction, schema mapping, error handling, and delivering clean, structured output ready for downstream use.

What You'll Do

  • Build a web scraper from scratch that programmatically discovers all available course URLs from the target university's official website.
  • Extract structured data from each course page according to a predefined schema (30+ fields including tuition fees, entry requirements, English proficiency scores, intake dates, and more).
  • Handle real-world edge cases — missing fields, inconsistent page layouts, pagination, and duplicate URLs.
  • Deliver clean, structured output in JSON or CSV format with one record per course.
  • Write clean, modular, and well-documented code with a README covering setup steps, dependencies, and run instructions.
  • Adhere strictly to ethical scraping practices — all data must be sourced exclusively from official university webpages. No third-party aggregators, pre-built datasets, or manual data entry.

What We're Looking For

  • Currently pursuing a degree in Computer Science, Software Engineering, IT, or a related field.
  • Proficiency in Python (or another scripting language suitable for web scraping).
  • Familiarity with HTML/CSS structure and how to navigate the DOM to locate data.
  • Understanding of HTTP requests, response codes, and basic web protocols.
  • Ability to produce structured data formats (JSON, CSV).
  • Comfort with Git for version control.

Deliverables

By the end of the assignment, you are expected to submit:

Scraper Source Code — Full working codebase that is clean, modular, and runnable out of the box. Include all supporting files, utilities, and dependency lists.

Extracted Data File — A structured JSON or CSV file with one record per course, mapped to the required data schema.

README / Setup Guide — A short document with setup steps, dependencies, instructions to run, and expected output format.

How You'll Be Evaluated

  • Completeness - All discoverable course URLs are captured
  • Data Accuracy - Extracted fields correctly reflect the source page
  • Source Integrity - Data comes only from the official university website
  • Edge-Case Handling - Missing values, duplicates, and inconsistencies are handled gracefully
  • Code Quality - Readable, modular code with sensible structure
  • Documentation - Clear setup and run instructions anyone can follow

Note: This is a paid internship.Skills: git & version control,web scraping & data extraction,python programming,data structuring (json, csv),error handling & debugging,data cleaning & validation,html/css & dom parsing

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145782709

Similar Jobs