17 — 24

JAN

2023

Web Scraping with Python for the Humanities

Eamonn Bell from the University of Durham will be hosting a course on web scraping as a data collection technique for researchers.

Online webinar
17 Jan 2023 10 a.m. — 1 p.m.
24 Jan 2023 10 a.m. — 1 p.m.

Please note, session 2 of this workshop has been moved from Tuesday 24th January 2023 to Tuesday 7th February 2023.

Apply for a place at this event. If you are offered a place for the first session, you are automatically enrolled in the second session. Both are required for the full syllabus of this course.

Session 1: Tuesday 17th January 2023

Duration: 3 hours (online)

Eligibility: postgraduate researchers (and above), enrolled at/working in N8 institutions

Pre-requisites: a basic familiarity with Python and interacting with computers (Windows, mac OS, or GNU/Linux) using the command line; access to a Jupyter Notebook environment with the capacity to install modules from PyPI (please contact your host institution if you do not have this already).

This workshop will introduce participants to intermediate topics in the use of web scraping as a data collection technique for researchers in the humanities. Web scraping is a technique for using computers to systematically identify and retrieve information hosted on web servers, which can be a valuable source of research data for humanities researchers given the diversity of historically, culturally, and sociologically significant of material available online. Topics covered will include:

  • An introduction to web technologies
  • Ethical, legal, and social issues of web scraping
  • (Politely) requesting remote resources using Python
  • Parsing HTML documents with BeautifulSoup (`bs4`)
  • Saving the results of web scraping for later analysis

Session 2: Tuesday 7th February 2023

Duration: 3 hours (online)

Eligibility: postgraduate researchers (and above), enrolled at/working in N8 institutions

Pre-requisites: a basic familiarity with Python and interacting with computers (Windows, mac OS, or GNU/Linux) using the command line; an understanding of the principles of web scraping (see Day 1); access to a Jupyter Notebook environment with the capacity to install modules from PyPI (please contact your host institution if you do not have this already).

Topics covered will include:

  • Intermediate topics in parsing HTML documents with BeautifulSoup (`bs4`)
  • Programmatically expanding the scope of scraping by extracting URLs from HTML
  • Scheduling and managing web scraping tasks
  • Approaches to scraping dynamic web properties
  • Approaches to dealing with archived web properties

More resources can be found at https://www.eamonnbell.com/

Return to event index