Text Analysis in R workshop at U. Vienna

As part of my Paul Lazarsfeld Guest Professorship I will teach a workshop on text analysis in R at the University of Vienna from 8 – 12 April.

For participants: Please bring your own laptop and make sure you have R and RStudio installed.

Introduction: The explosion of digital communication and increasing efforts to digitize existing material has produced a deluge of material such as digitized historical news archives, policy and legal documents, political debates or millions of social media messages by politicians, journalists, and citizens. This has the potential of putting theoretical predictions about the societal roles played by information, and the development and effects of communication to rigorous quantitative tests that were impossible before. Besides providing an opportunity, the analysis of such “big data” sources also poses methodological challenges. Traditional manual content analysis does not scale to very large data sets due to high cost and complexity. For this reason, many researchers turn to automatic text analysis using techniques such as dictionary analysis, automatic clustering and scaling of latent traits, and machine learning.

Course aims and structure: To properly use such techniques, however, requires a very specific skillset. This course aims to give interested PhD (and advanced Master) students an introduction to text analysis. R will be used as platform and language of instruction, but the basic principles and methods are easily generalizable to other languages and tools such as python. Participants will be given handouts with examples based on pre-existing data to follow along, but are encouraged to work on their own data and problems using the techniques offered.

Evaluation criteria: Evaluation will be based on two assignments:

  1. (30%) midweek data exercise
    1. Deadline: Wednesday (soft)
    2. Instructions
    3. Data
    4. Submission link
  2. (70%) final assignment  on a topic of your choice
    1. Deadline: Friday 19 April
    2. Instructions
    3. Submission link

There’s also a Optional/formative quiz to test your tidyverse skills

Material: The course mostly uses the handouts linked below per session. The source code of the handouts is available on Github. Also see the rstudio cheat sheets and the excellent book R for Data Science.

Course outline per day (A=morning, B=afternoon):

  1. Monday: Introduction to R
    1. (  9:00-11:00)
      1. R Basics: data and functions (practise template);
      2. Fun with Text
    2. (14:00-16:00)
      1. Tidyverse: Transforming  data;
      2. reading and importing data (external tutorial)
  2. Tuesday: R for data analysis
    1. (  9:00-11:15)
      1. Grouping and summarizing data
      2. Merging (joining) data sets
    2. (13:30-16:00)
      1. Visualizing data with ggplot
      2. Reshaping data: wide, long, and tidy
  3. Wednesday: Quantitative text analysis in R
    1. (9:00-13:00)
      1. Basic string handling in R [session log – warning, might be messy!]
      2. Reading, cleaning, and analysing text with quanteda and readtext [messy session log]
  4. Thursday: Topic Modeling  and Preprocessing
    1. (  9:00-12:00)
      1. Topic Modeling [slides] [handout]
        Optional handouts: [graphical interpretation] [perplexity code]
      2. NLP Preprocessing [slides] [handout]
    2. (14:00-16:00)
      1. Understanding topic modeling (slides)
        optional links: [gibbs sampling in R][understanding alpha]
      2. Structural Topic Model [slides] [handout] [vignette]
  5. Friday: Supervised machine learning
    1. (  9:00-12:00) Supervised text classification [slides] [handout]
    2. (14:00-16:00) Work on assignment

Course Literature:

Kasper Welbers, Wouter van Atteveldt, and Ken Benoit (2017), Text Analysis in R. Communication Methods and Measures, 11 (4), 245-265, doi:10.1080/19312458.2017.1387238

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. .

Background literature:
– Wouter van Atteveldt and Tai-Quan Peng (2018), When Communication Meets
Computation: Opportunities, Challenges, and Pitfalls in Computational Communication
Science, Communication Methods and Measures 12 (2-3), pp. 81-92.
– Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information
processing systems (pp. 288-296).
– Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189.
– Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.
– Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder‐Luis, J., Gadarian, S. K., … & Rand, D. G. (2014). Structural Topic Models for Open‐Ended Survey Responses. American Journal of Political Science, 58(4), 1064-1082.
– Young, L., & Soroka, S. (2012). Affective news: The automated coding of sentiment in political texts. Political Communication, 29(2), 205-231.

–  Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies10(1), 1-309. If you google for neural network methods for natural language processing pdf you might be able to find the evaluation sample from the publisher.