About The Role
We're looking for a motivated and detail-oriented university student to join us as a Software Developer Intern. In this role, you'll design, build, and deliver a production-quality web scraper that discovers and extracts structured course data from a university's official website.
This is a hands-on engineering role — not a copy-paste exercise. You'll be responsible for the full lifecycle: crawling strategy, data extraction, schema mapping, error handling, and delivering clean, structured output ready for downstream use.
What You'll Do
- Build a web scraper from scratch that programmatically discovers all available course URLs from the target university's official website.
- Extract structured data from each course page according to a predefined schema (30+ fields including tuition fees, entry requirements, English proficiency scores, intake dates, and more).
- Handle real-world edge cases — missing fields, inconsistent page layouts, pagination, and duplicate URLs.
- Deliver clean, structured output in JSON or CSV format with one record per course.
- Write clean, modular, and well-documented code with a README covering setup steps, dependencies, and run instructions.
- Adhere strictly to ethical scraping practices — all data must be sourced exclusively from official university webpages. No third-party aggregators, pre-built datasets, or manual data entry.
What We're Looking For
- Currently pursuing a degree in Computer Science, Software Engineering, IT, or a related field.
- Proficiency in Python (or another scripting language suitable for web scraping).
- Familiarity with HTML/CSS structure and how to navigate the DOM to locate data.
- Understanding of HTTP requests, response codes, and basic web protocols.
- Ability to produce structured data formats (JSON, CSV).
- Comfort with Git for version control.
Deliverables
By the end of the assignment, you are expected to submit:
Scraper Source Code — Full working codebase that is clean, modular, and runnable out of the box. Include all supporting files, utilities, and dependency lists.
Extracted Data File — A structured JSON or CSV file with one record per course, mapped to the required data schema.
README / Setup Guide — A short document with setup steps, dependencies, instructions to run, and expected output format.
How You'll Be Evaluated
- Completeness - All discoverable course URLs are captured
- Data Accuracy - Extracted fields correctly reflect the source page
- Source Integrity - Data comes only from the official university website
- Edge-Case Handling - Missing values, duplicates, and inconsistencies are handled gracefully
- Code Quality - Readable, modular code with sensible structure
- Documentation - Clear setup and run instructions anyone can follow
Note: This is a paid internship.Skills: git & version control,web scraping & data extraction,python programming,data structuring (json, csv),error handling & debugging,data cleaning & validation,html/css & dom parsing