Python for Web Scraping

On Tuesday 15 and Wednesday 22 July 2020 a small team from Durham University ran a pair of workshops outlining how to use Python for scraping data from Twitter.


Social media platforms such as Twitter have recently become sources of most up-to-date information and commentary on current and significant events taking place in people’s lives and during various natural disasters.

They can be seen as a collector of real-time information that could be used by public health institutions as an additional information source for acquiring early warnings - thereby assisting them to mitigate the public health threats.

An Introduction to Python

Zhongtian Sun

In the first part of the workshop Zhongtian introduced the Python programming language and explained some of its key features.

Python is an interpreted, interactive and object-based programming language. It is well-suited to handling big data and handling complex mathematics, it can also connect to database systems to modify or rate data files or on a server to create web applications.

Zhongtian went on to explain the use of Google Colab before outlining a variety of features and functions of python including:

  • Variables
  • Data Types
  • List Methods
  • Functions
  • Modules

Downloads, Links and Resources

Twitter Data Collection and Processing

Tahir Aduragba

Tahir began his talk by explaining how Twitter's basic application programming interface (API) works and how much data you can reasonably expect to capture, before going on to explain how to create the relevant accounts, create an app within Twitter's Developer platform and utilise API credentials.

He went on to explain that some of the pre-processing steps that are required for pre-processing the data, including removing commonly used words, installing the 'nltk' package for Python and tokenising the data to make it easier to analyse. He also explained stemming and lemmatization, an additional step to make the text easier to work with.

Downloads, Links and Resources

Introduction to LDA Topic Modelling

Jialin Yu

The final part of the workshop was delivered by Jialin Yu who introduced LDA Topic Modelling.

He began by explaining that human beings are able to understand the world around them with some prior knowledge and by abstraction, whereas machines view everything as ones and zeroes. In order to help an AI to understand the data generated by web scraping it is necessary to teach the Ai the knowledge about the world and give it the right level of abstraction. This is done by identifying relevant research topics through Wikipedia and gensim; and by using the right level of text data.

After training your AI model and identifying suitable topics you can the use the model to understand the content of either historic or live Tweets. This can enable you to:

  • Monitor a disease outbreak
  • Live stream medical data classification
  • Topic analysis
  • Trend prediction

The Slides

The slides on the download link below relate to Digital Humanities. However, the concepts shown can easily be adapted to other research areas, such as Digital Health, by identifying different keywords when scraping data and training the AI with different topics.

  Python for Web Scraping - Digital Humanities

The Presentations

Introduction to Python, Zhongtian Sun

Scraping Data from Twitter, Tahir Aduragba

LDA Topic Modelling and gensim, Jialin Yu

The Speakers

Zhongtian Sun - Durham University

I graduated from University of Nottingham (Bachelor) and Warwick Business School (Master) respectively; and I am a first year PhD student at Durham University in Computer Science Department. I am interested in knowledge representation learning, graph neural network and machine learning.

Tahir Aduragba - Durham University

I'm a PhD student at the Department of Computer Science, Durham University. My research interest is in deep learning, natural language processing and data science. Specifically, I'm interested in the prediction of infectious disease spread on social media. I have a bachelors degree in Computer Science from Brunel University London and a masters degree in Information Systems from the University of Manchester.

Jialin Yu - Durham University

Jialin graduated from University of Nottingham and UCL respectively for his Beng and MSc; and is now a second year PhD student at Durham University in Computer Science Department. His research is around probabilistic modelling and machine learning with a focus on text data. He was a demonstrator for a second year module "Theory of Computation" at Durham University from 2019 to 2020.

Return to article index