Software Developer Intern – Web Scraping & Data Extraction

GyanDhan

Noida, India

Fresher

Save

Posted 6 days ago
Over 100 applicants

Job Description

About The Role

We're looking for a motivated and detail-oriented university student to join us as a Software Developer Intern. In this role, you'll design, build, and deliver a production-quality web scraper that discovers and extracts structured course data from a university's official website.

This is a hands-on engineering role — not a copy-paste exercise. You'll be responsible for the full lifecycle: crawling strategy, data extraction, schema mapping, error handling, and delivering clean, structured output ready for downstream use.

What You'll Do

Build a web scraper from scratch that programmatically discovers all available course URLs from the target university's official website.
Extract structured data from each course page according to a predefined schema (30+ fields including tuition fees, entry requirements, English proficiency scores, intake dates, and more).
Handle real-world edge cases — missing fields, inconsistent page layouts, pagination, and duplicate URLs.
Deliver clean, structured output in JSON or CSV format with one record per course.
Write clean, modular, and well-documented code with a README covering setup steps, dependencies, and run instructions.
Adhere strictly to ethical scraping practices — all data must be sourced exclusively from official university webpages. No third-party aggregators, pre-built datasets, or manual data entry.

What We're Looking For

Currently pursuing a degree in Computer Science, Software Engineering, IT, or a related field.
Proficiency in Python (or another scripting language suitable for web scraping).
Familiarity with HTML/CSS structure and how to navigate the DOM to locate data.
Understanding of HTTP requests, response codes, and basic web protocols.
Ability to produce structured data formats (JSON, CSV).
Comfort with Git for version control.

Deliverables

By the end of the assignment, you are expected to submit:

Scraper Source Code — Full working codebase that is clean, modular, and runnable out of the box. Include all supporting files, utilities, and dependency lists.

Extracted Data File — A structured JSON or CSV file with one record per course, mapped to the required data schema.

README / Setup Guide — A short document with setup steps, dependencies, instructions to run, and expected output format.

How You'll Be Evaluated

Completeness - All discoverable course URLs are captured
Data Accuracy - Extracted fields correctly reflect the source page
Source Integrity - Data comes only from the official university website
Edge-Case Handling - Missing values, duplicates, and inconsistencies are handled gracefully
Code Quality - Readable, modular code with sensible structure
Documentation - Clear setup and run instructions anyone can follow

Note: This is a paid internship.Skills: git & version control,web scraping & data extraction,python programming,data structuring (json, csv),error handling & debugging,data cleaning & validation,html/css & dom parsing