Social media platforms such as Twitter have recently become sources of most up-to-date information and commentary on current and significant events taking place in people’s lives and during various natural disasters.
They can be seen as a collector of real-time information that could be used by public health institutions as an additional information source for acquiring early warnings - thereby assisting them to mitigate the public health threats.
An Introduction to Python
In the first part of the workshop Zhongtian introduced the Python programming language and explained some of its key features.
Python is an interpreted, interactive and object-based programming language. It is well-suited to handling big data and handling complex mathematics, it can also connect to database systems to modify or rate data files or on a server to create web applications.
Zhongtian went on to explain the use of Google Colab before outlining a variety of features and functions of Python including:
- Data Types
- List Methods
Downloads, Links and Resources
Twitter Data Collection and Processing
Tahir began his talk by explaining how Twitter's basic application programming interface (API) works and how much data you can reasonably expect to capture, before going on to explain how to create the relevant accounts, create an app within Twitter's Developer platform and utilise API credentials.
He went on to explain that some of the pre-processing steps that are required for pre-processing the data, including removing commonly used words, installing the 'nltk' package for Python and tokenising the data to make it easier to analyse. He also explained stemming and lemmatization, an additional step to make the text easier to work with.
Downloads, Links and Resources
Introduction to LDA Topic Modelling
The final part of the workshop was delivered by Jialin Yu who introduced LDA Topic Modelling.
He began by explaining that human beings are able to understand the world around them with some prior knowledge and by abstraction, whereas machines view everything as ones and zeroes. In order to help an AI to understand the data generated by web scraping it is necessary to teach the Ai the knowledge about the world and give it the right level of abstraction. This is done by identifying relevant research topics through Wikipedia and gensim; and by using the right level of text data.
After training your AI model and identifying suitable topics you can the use the model to understand the content of either historic or live Tweets. This can enable you to:
- Monitor a disease outbreak
- Live stream medical data classification
- Topic analysis
- Trend prediction
The slides on the download link below relate to Digital Humanities. However, the concepts shown can easily be adapted to other research areas, such as Digital Health, by identifying different keywords when scraping data and training the AI with different topics.
Python for Web Scraping - Digital Humanities
Zhongtian Sun, Introduction to Python
Tahir Aduragba, Scraping Data from Twitter
Jialin Yu, LDA Topic Modelling and Gensim
Alternatively you can find out more using these links:
Zhongtian Sun - Durham University
I graduated from University of Nottingham (Bachelor) and Warwick Business School (Master) respectively; and I am a first year PhD student at Durham University in Computer Science Department. I am interested in knowledge representation learning, graph neural network and machine learning.
Tahir Aduragba - Durham University
I'm a PhD student at the Department of Computer Science, Durham University. My research interest is in deep learning, natural language processing and data science. Specifically, I'm interested in the prediction of infectious disease spread on social media. I have a bachelors degree in Computer Science from Brunel University London and a masters degree in Information Systems from the University of Manchester.
Jialin Yu - Durham University
Jialin graduated from University of Nottingham and UCL respectively for his Beng and MSc; and is now a second year PhD student at Durham University in Computer Science Department. His research is around probabilistic modelling and machine learning with a focus on text data. He was a demonstrator for a second year module "Theory of Computation" at Durham University from 2019 to 2020.