Web Scraping with Python

Overview

Social media platforms such as Twitter have recently become sources of the most up-to-date information and commentary on current and significant events taking place in people’s lives and during various natural disasters.

They can be seen as a collector of real-time information that could be used by public health institutions as an additional information source for acquiring early warnings - thereby assisting them in mitigating public health threats.

An Introduction to Python

Zhongtian Sun

In the first part of the workshop Zhongtian introduced the Python programming language and explained some of its key features.

Python is an interpreted, interactive and object-based programming language. It is well-suited to handling big data and handling complex mathematics, it can also connect to database systems to modify or rate data files or on a server to create web applications.

Zhongtian went on to explain the use of Google Colab before outlining a variety of features and functions of Python including:

Variables
Data Types
List Methods
Functions
Modules

Downloads, Links and Resources

You can download Python through either of the two links below. PyCharm, an integrated development environment for Python.

Download Python

Download PyCharm

Python documentation

W3Schools Python tutorials

Real Python tutorials

Twitter Data Collection and Processing

Tahir Aduragba

Tahir began his talk by explaining how Twitter's basic application programming interface (API) works and how much data you can reasonably expect to capture, before going on to explain how to create the relevant accounts, create an app within Twitter's Developer platform, and utilise API credentials.

He went on to explain that some of the pre-processing steps that are required for pre-processing the data, including removing commonly used words, installing the 'nltk' package for Python, and tokenising the data to make it easier to analyse. He also explained stemming and lemmatization, an additional step to make the text easier to work with.

Downloads, Links and Resources

Apply for a twitter development account

Login to your development account and create a Twitter application

API reference

Twitter data dictionary

Sample Google Colaboration Notebook

Introduction to LDA Topic Modelling

Jialin Yu

The final part of the workshop was delivered by Jialin Yu who introduced LDA Topic Modelling.

He began by explaining that human beings can understand the world around them with some prior knowledge and by abstraction, whereas machines view everything as ones and zeroes. In order to help an AI to understand the data generated by web scraping, it is necessary to teach the AI knowledge about the world and give it the right level of abstraction. This is done by identifying relevant research topics through Wikipedia and gensim; and by using the right level of text data.

After training your AI model and identifying suitable topics, you can use the model to understand the content of either historic or live Tweets. This can enable you to:

Monitor a disease outbreak
Live stream medical data classification
Topic analysis
Trend prediction

The Presentations

	Introduction to Python Zhongtian Sun, Durham University
	Scraping data from Twitter Tahir Aduragba, Durham University
	Topic modelling and Gensim Jialin Yu, Durham University

Slides from the presentations

These slides relate to Digital Humanities. However, the concepts shown can easily be adapted to other research areas, such as Digital Health, by identifying different keywords when scraping data and training the AI with different topics.

Python for Web scraping

Digital Humanities

The Speakers

Zhongtian Sun - Durham University

I graduated from University of Nottingham (Bachelor) and Warwick Business School (Master) respectively, and I am a first year PhD student at Durham University in Computer Science Department. I am interested in knowledge representation learning, graph neural network and machine learning.

Tahir Aduragba - Durham University

I'm a PhD student at the Department of Computer Science, Durham University. My research interest is in deep learning, natural language processing, and data science. Specifically, I'm interested in the prediction of infectious disease spread on social media. I have a bachelor's degree in Computer Science from Brunel University London and a masters degree in Information Systems from the University of Manchester.

Jialin Yu - Durham University

Jialin graduated from the University of Nottingham and UCL respectively for his BEng and MSc, and is now a second-year PhD student at Durham University in the Computer Science Department. His research is around probabilistic modelling and machine learning, focusing on text data. He was a demonstrator for a second-year module "Theory of Computation" at Durham University from 2019 to 2020.

Overview

An Introduction to Python

Zhongtian Sun

Downloads, Links and Resources

Download Python

Download PyCharm

Python documentation

W3Schools Python tutorials

Real Python tutorials

Twitter Data Collection and Processing

Tahir Aduragba

Downloads, Links and Resources

Apply for a twitter development account

Login to your development account and create a Twitter application

API reference

Twitter data dictionary

Sample Google Colaboration Notebook

Introduction to LDA Topic Modelling

Jialin Yu

The Presentations

Introduction to Python

Scraping data from Twitter

Topic modelling and Gensim

Slides from the presentations

Python for Web scraping

The Speakers

Zhongtian Sun - Durham University

Tahir Aduragba - Durham University

Jialin Yu - Durham University