VU-HPC: Text analysis in R

Monday 5 november I will teach a course on Text Analysis in R at the VU.

To prepare, please install R and Rstudio on your laptop. If you are interested, you can read our recent article on text analysis in R and/or some introductory materials on learning R.

Location: WN-C203
Time: 9:00 – 13:00 (approx)

Data: [github] [meta.rds] [articles.rds] [tokens.rds]


  • Session I: Introduction [slides]
    • DTM’s in R
    • Dictionary analysis with AmCAT and/or quanteda
  • Session II: Corpus Analysis [slides]
    • Simple NLP
    • Corpus analysis and visualization
    • Topic modeling

To install all packages used in the example code, you can run the following commands in R:

install.packages(c("devtools", "corpustools", "quanteda", "topicmodels", "ggplot2", "LDAvis", "slam"))

Link to the visualization presentation

ICA 2017: Crowd sourcing for sentiment analysis

(Wouter van Atteveldt, Mariken van der Velden, Antske Fokkens)

Download slides

Due to the need for context-specific sentiment analysis tools and the rich language used for expressing sentiment in political text, automatic sentiment analysis suffers heavily from the scarcity of annotated sentiment data. This is especially true for directional sentiment, i.e. annotations that a holder has sentiment about a specific target.

In this paper we use crowdsourcing to overcome this data scarcity problem and develop a tool for classifying sentiment expressed in a text about a specific target. Crowdsourcing is especially useful for sentiment analysis because sentiment coding is a simple but essentially subjective judgment, and the low cost of crowdsourcing makes it possible to code items multiple times, showing the spread of sentiment as well as the point estimate.

We show that crowd sourcing can work to get directed sentiment with reasonable accuracy with as little as 2-3 coders per unit, increasing in accuracy up to 10 coders. By selecting sentences on which coders agree a very high-precision subset of codes can be compiled. It is essential to make the task as simple as possible and to have good ‘gold questions’ for quality control.

Our future plans are to gather data on sentiment about specific political parties from Dutch and English tweets and political news. These data are used to compare crowdsourcing to manual expert coding. Moreover, these data will be used to enhance an existing sentiment dictionary and to train a machine learning model. By comparing the outcome of these various approaches, we can show the most cost-effective way to conduct accurate targeted sentiment analysis.

ICA 2017: What are topics?

[Part of the Applications of Topic Modeling panel in the Computational Methods interest group. Download slides or an earlier version of this paper]

LDA topic modeling is a popular technique for unsupervised document clustering. However, the utility of LDA for analysing political communication depends on being able to interpret the topics in theoretical terms. This paper explores the relation between LDA topics and content variables traditionally used in political communication. We generate an LDA model on a full collection of front-page articles of Dutch newspapers and compare the resulting LDA topics to a manual coding of the political issues, frames, and sentiment.

In general, we find that a large number of topics are closely related to a specific issue; and that the different topics that comprise an issue can be interpreted as subissues, events, and specific journalistic framing of the issue. Linear combinations of topics are moderately accurate predictors of hand-coded issues, and at the aggregate level correlate highly. These results validate the use of LDA topics as proxies for political issues, and pave the way for a more empirical understanding of the substantive interpretation of LDA topics.

(Wouter van Atteveldt, Kasper Welbers)

Political Communication @Kobe University

Today, Nel and I will be presenting in a workshop on political communication at Kobe University:

The Netherlands’ 15 minutes of (in)fame:
political coverage and populism in the Dutch 2017 elections
dr. Nel Ruigrok
[download slides]

On the 15th of March the international press was focused on the Netherlands where, after Brexit and the election of Trump a third success of populism was expected. The extreme-right Geert Wilders lead the polls during the last months and was expected to become the biggest party. However, it was the liberal party of the current prime minister that won the election, followed by the PVV. Besides this turn to more right-wing parties, also progressive parties won numerous seats, making a the political landscape more fragmented than ever. In this talk we show the different media coverage during the campaign and discuss possible effects on voting behavior of different groups of voters.

Clause analysis:
using syntactic information for automatic analysis of conflict coverage
dr. Wouter van Atteveldt
[download slides]

This paper shows how syntactic information can be used to automaticallyextract clauses from text, consisting of a subject, predicate, and optionalsource. Since the output of this analysis can be seen as an enriched token list or bag of words, normal frequency based or corpus linguistic analyses can be used on this output. Taking the 2008–2009 Gaza war as an example, we show how corpus comparison, topic modelling, and semantic network analysis can be used to explore the differences between US and Chinese coverage of this war.

Mondag 20 feb: Research talk @cityu

Don’t you like it? Using CrowdSourcing for Sentiment Analysis of Dutch and English (political) text  

Wouter van Atteveldt, Antske Fokkens, Isa Maks, Kevin van Veenen, and Mariken van der Velden

[Download slides]

Sentiment Analysis is an important technique for many aspects of communication research, with applications from social media analysis and online reviews to negativity in political communication. The subjective and context-specific nature of evaluative language, however, makes it particularly challenging to develop and validate good sentiment analysis tools.

We use crowdsourcing to develop a tool for classifying sentiment expressed in a text about a specific target. Crowdsourcing is especially useful for sentiment analysis because of the subjective nature of the judgment, and the low cost makes it possible to code items multiple times. By comparing crowdsourcing with dictionary analysis and expert coding, we can show the most cost-effective way to conduct accurate targeted sentiment analysis.


CfP: CMM Special Issue on Computational Methods

CT&M’s journal Communication Methods and Measures invites submissions for a special issue on computational methods. Here is the full call for papers:

For this special issue, we invite submissions that further the understanding, development and application of computational methods in communication research.Computational methods include (but are not limited to) methods such as text analysis, topic modeling, social/semantic network analysis, online experiments, machine learning, and agent-based modeling and simulations. Computational Methods can be used to build theory about, quantify, analyze, and visualize communication structures and processes. Computational methods can be applied to “big data” and social media data, but can also be used to analyse historical archives (e.g. newspaper archives, proceedings) or to provide a more sophisticated understanding of “small data”.

In particular, we welcome submissions on:

  • Innovative ways to use computational methods for communication research;
  • Evaluation and validation of computational approaches to studying communication research;
  • Application of computational methods to answer substantive communication research questions;
  • Reflections on the role of computational methods in communication research and their link with theory;

The special issue may also include a “teacher’s corner” article with brief descriptions of useful software packages and tools for studying communication. Authors interested in this format are encouraged to contact special issue co-editor Wouter van Atteveldt prior to submission.

The deadline for submission for consideration is July 1, 2017. Submitters should include a statement in the cover letter that the manuscript is being submitted for the special issue on Computational Methods. Articles will be peer reviewed and a decision rendered within 60 days, with a target publication date of March 2018. Instructions for authors and a description of the online submission process can be found on the journal’s home page at

Questions about this special issue can be directed to Wouter van Atteveldt or Winson Peng, Guest Editors, at and


Visiting Hong Kong

As you can have guessed from my new header image, I am currently in Hong Kong as a visiting assistant professor at City University Hong Kong. Besides hiking, I am looking forward to continuing my research and especially to experimenting with some Chinese (and Cantonese!) text processing.

I will also give a research talk here on the 20th of February on Sentiment Analysis, stay tuned!

Text Analysis in R @Glasgow

I will be giving a workshop on Text Analysis in R at Glasgow University on 17 November, 2016.

Data: [all data (zip)][tokens.rds][meta.rds][lexicon.rds][reviews.rds]

Data as csv: [tokens_full.csv][tokens.csv][lexicon.csv]

[Source for all slides (contains the R code)]


10:30 – 12:00 [slides][session log]
– Recap: Frequency Based Analysis and the DTM
– Dictionary Analysis with AmCAT and R

13:30 – 15:00 [slides]
– Simple Natural Language Processing
– Corpus Analysis and Visualization
– Topic Modeling and Visualization

15:15 – 17:00 [slides]
– Sentiment Analysis with dictionaries
– Sentiment Analysis with proximity
– [Handout: Obtaining sentiment resources with R]
– If time permits: [machine learning sentiment analysis handout]

Useful links:

Current Issues in communication science: Big Data and Social Analytics

I will give a guest lecture today in the ‘Current Issues in Communication Science’ course of our MSc in Communication Science. The leading question of the lecture is how big data will impact the social sciences: what are the opportunities and pitfalls? Using the famous ‘facebook studies’ as an example, I will show how ‘big data’ can be used to answer theoretically relevant questions that would otherwise be impossible to answer, but also stress the problems and dangers of relying on such data. [Download slides]

Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., & Fowler, J. H. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295-298.

Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788-8790.

Clause Analysis accepted for Political Analysis

I’m delighted that my paper on clause analysis has been accepted for publication in political analysis:

Clause analysis: Using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008-2009 Gaza War
Wouter van Atteveldt Tamir Sheafer, Shaul R. Shenhav, and Yair Fogel-Dror

Abstract: This article presents a new method and open source R package that uses syntactic information to automatically extract source–subject–predicate clauses. This improves on frequency based text analysis methods by dividing text into predicates with an identified subject and optional source, extracting the statements and actions of (political) actors as mentioned in the text. The content of these predicates can be analyzed using existing frequency based methods, allowing for the analysis of actions, issue positions and framing by different actors within a single text. We show that a small set of syntactic patterns can extract clauses and identify quotes with good accuracy, significantly outperforming a baseline system based on word order. Taking the 2008–2009 Gaza war as an example, we further show how corpus comparison and semantic network analysis applied to the results of the clause analysis can show differences in citation and framing patterns between U.S. and English-language Chinese coverage of this war.

You an download the [presentation] I gave based on the paper at NetGlow