Tools for the automated extraction of information

I will be giving a workshop/tutorial on using R for automatic information extraction at the EUI. This is part of the Innovations in Quantitative Content Analysis workshop organized by Hanspeter Kriesi, Swen Hutter, and Jasmine Lorenzini.

Date: Thursday, February 26th, 12:30 – 17:30
Location: EUI Florence, Emeroteca, Badia Fiesolana
Materials: Slides, Learning R, handouts on Corpus Analysis and Clause Analysis

At the bottom of this post is an overview of the programme/contents of my tutorial. As the tutorial is interactive and will use R for all of the analyses, please make sure to have the newest version of R and RStudio installed on your computer. Also please install the devtools and RTextTools packages and use devtools to install the amcat/amcat-r and kasperwelbers/corpustools packages by running the following code in RStudio:


If you have any trouble installing R, Rstudio, or these packages, it would be great if you could mail me beforehand so we don’t waste our time in the workshop hunting down installation problems.

Preliminary program:

Time Name Goals / Topics
11:30 12:30 Introduction to R - Make sure R/Rstudio/amcatr/rtexttools/corpus-tools is installed and running
– Basics of R: variables, vectors, data frames
– Selecting and transforming data
12:30 13:30 Lunch
13:30 15:00 Corpus analysis
and the Document-Term matrix
- Create and play with dtm’s
– Understand tokenizing, stemming, lemmatizing etc.
– Using corpus-tools:
– Word frequency analysis and filtering
– Comparing corpora
– Topic modeling
– Using amcat+amcatr for preproccesing
15:00 15:15 Coffee break
15:15 17:30 Clause analysis - understanding the link between syntax and clauses
– using amcat+amcatr to perform source+clause analysis
– combining clause analysis with keyword analysis
– combining clause analysis with corpus analysis/topic modeling

Leo Kim (Treum) @VU: Media Monitoring and Network Analysis

I am pleased to announce that Leo Kim will present at the VU communication science research colloquium 16 February 15:30, Metropolian Z009 at the VU University Amsterdam.

Leo Kim is currently finalizing his Ph.D. at the University of Sussex and has done extensive research on media monitoring and social and semantic network analysis. He is currently CEO of Treum, a Korean company specialized in (social) media monitoring with customers such as Samsung and Coca Cola.

See e.g: and


The challenges of semantic network analysis and its practical applications
Leo Kim

The methodology of semantic network analysis has inspired intellectuals in social sciences for its semiotic implications, calculability and powerful visual graphics.
However, the variable and complex nature of data processing, thresholding, and presentation to uninformed public imposed additional challenges to convince its use.
In order to entrench the methodology more stable and communicable, the company Ars Praxia (formerly Treum) has engaged in methodological improvements over a few years. In this presentation, the presenter shares the trajectory of methodological improvements faced with current challenges, and shares cases of practical applications that had lasting social impacts.

Social and Semantic Networks in Communication Research

DEADLINE EXTENDED: 14 February 2015

We have decided to extend the deadline for the Social and Semantic networks preconference by one month to the 14th of February. By popular request, we have also decided to accept submissions in (extended) abstract form.  Submit here and spread the word!


Bringing together Social and Semantic Networks in Communication Research

Wouter van Atteveldt, Christian Baden, Jana Diesner (alphabetic)

ICA 2015 Preconference, Puerto Rico, 21 May 2015

Co-Sponsored by ICA’s Communication & Technology, Mass Communication & Political Communication Divisions

While the analysis of social networks and semantic networks has quickly advanced over the past years, this development is still weakly received in the communication sciences. Network researchers have developed a whole bouquet of powerful and scalable tools and methods to the analysis of discourse texts and communicative interactions, and first inroads are being cut toward the joint analysis of social and semantic network data. However, these methods’ communication theoretic foundations, as well as their applications for addressing pressing questions in the field are still underdeveloped. Moreover, social and semantic network analytic approaches are most commonly considered separately. Yet, communication processes inevitably include patterns of both social relations and semantic contents, which can often be fruitfully conceptualized as networks. Building upon last year’s preconference on this theme, this event is aimed to connect network analytic methodology with important developments in the field of communication research, such as:

  • the rising attention to the semantic substance and meaning of messages and the configuration of different communication content exchanged in public discourses
  • The theoretically grounded integration of text and social network data in communication analysis (e.g., in social media communication).
  • the rising importance of networked organizations and forms of organizing and communicating with flat hierarchies, a dedifferentiation of communicator roles, and self-organizing publics
  • the reconceptualization of existing communication patterns, social structures, institutions, and other in society in terms of interaction networks

The preconference is co-sponsored by the ICA Communication & Technology Division, the ICA Mass Communication Division and the ICA Political Communication Division, but it touches upon the fields of many more ICA divisions and interest groups. The preconference aims to bring together researchers from different backgrounds, including theoretically, methodologically, and practically oriented researchers in diverse fields of application, both inside and outside the academia. It thereby aims to instill a mutual learning process and exchange innovative ideas and challenges for the further development of network analysis in communication research.

We invite contributions that make use of social, semantic, or both types of network analysis to address relevant questions in communication research, to advance network analytic methodology for the study of communication, or to advance communication theory to integrate with network analytic methodology. Specifically, we welcome any contributions that consider how semantic and social relations and processes might be linked or can affect one another (e.g., semantic networks related to social groups or interactions, social networks related to semantic contents or ideas, socio-semantic networks). We are also looking for technological advances in the form of new computational solutions and tool demonstrations.

CONFERENCE FORMAT & SUBMISSIONS (Paper, Data Presentations, Tool Presentations)

Contributions can come from a wide variety of disciplinary backgrounds, but should relate to both network analytic methodology and communication science research questions and/or theory. Submissions will be evaluated according to their innovative potential, methodological quality, and contribution to communication science research. In addition to more classic research presentations, we explicitly invite tool- and data sharing.

In addition to more classic research presentations, we explicitly invite the sharing of network-analytic tools and data, which can be presented in a especially dedicated high-density demonstration session). These demonstrations serve to introduce new software tools (open access tools privileged) for applying network analysis in communication science research, and open access data sets available to the research community (e.g., “big data” with network-analytic potential).

Submissions for a regular presentation should be original papers of approximately 4000 to 8000 words, which have not been published elsewhere. In an accompanying abstract of 150 words, they should emphasize the specific contribution of their paper to advancing network analytic research and theory in communications.

Submissions for the high-density demonstration session should provide extended abstracts (1000 to 1500 words) that introduce the data or tool presented. As far as applicable, these abstracts should also state the conditions of use of the presented tool or data for other researchers.

All submissions must be uploaded to by January 11, 2015, with all identifying information removed from the manuscript or abstract. All contributions will be blindly peer-reviewed, and acceptance notifications will be sent out before the end of February 2015.

Registration for the preconference is open to both presenters and non-presenters and opens on January 15, 2015. Registration fees are 60 USD for students (graduate, doctoral) and 100 USD for both faculty (PhD holders) and practitioners outside the academia. The preconference will take place on Thursday, May 21, 2015, at one of the two conference hotels of the 65th ICA Annual conference in San Juan, Puerto Rico.

For any direct inquiries regarding this preconference, please contact any of the following:

Wouter van Atteveldt, VU Amsterdam:

Christian Baden, Hebrew U Jerusalem:

Jana Diesner, UIUC:


09:00 – 09:15 Welcome & introduction

09:15 – 10:30 Paper session 1 (3 papers, 25 min each: 15-20 min talk, 10-5 min Q&A)

10:30 – 11:00 Break

11:00 – 12:15 High-density session (6-7 tool-/data-demonstrations, 10 min each: 5 min talk, 5 min Q&A)

12:15 – 13:15 Lunch (off site)

13:15 – 14:30 Paper session 2 (3 papers, 25 min each: 15-20 min talk, 10-5 min Q&A)

14:30 – 15:00 Coffee break

15:00 – 16:15 Paper session 3 (3 papers, 25 min each: 15-20 min talk, 10-5 min Q&A)

16:15 – 17:00 Roundtable Discussion: Challenges and Future Directions


This preconference is kindly supported by

2014 Conference recap

LDA models topics… But what are ‘topics’ ?

(Big data in the Social Sciences workshop, University of Glasgow, 23 June 2014)

LDA topic modeling is a popular technique for unsupervised document clustering. However, the utility of LDA for analysing political communication depends on being able to interpret the topics in theoretical terms. This paper explores the relation between LDA topics and content variables traditionally used in political communication. We generate an LDA model on a full collection of front-page articles of Dutch newspapers and compare the resulting LDA topics to a manual coding of the political issues, frames, and sentiment.

In general, we find that a large number of topics are closely related to a specific issue; and that the different topics that comprise an issue can be interpreted as subissues, events, and specific journalistic framing of the issue. The relation between frames and topics is less direct, with a large amount of topics associated with each of the investigated frames while no topics were identified that really encoded just a specific frame. Finally, hardly any
topic had a clear sentiment associated, with only exception for topics whose sentiment is contained in the represented issue, such as disasters. These results validate the use of LDA topics as proxies for political issues, and pave the way for a more empirical understanding of the substantive interpretation of LDA topics.

Quotes as Data: Extracting Political Statements from Dutch Newspapers by applying Transformation Rules to Syntax Graphs [presentation]

(MPSA 2014, Chicago)

To understand the relation between media and politics, it is necessary to study the content of politicians’ statements in the news. By using syntactic analysis and topic models, this paper looks at how often politicians are quoted, and whether their media statements are similar to their statements in parliament. While media attention simply follows political power, this is quite different for media statements. The frequency of statements is a matter of journalistic demand (e.g. high during scandals) and political supply (e.g. low during closed-door negotiations). Media statements are most similar to political discourse during the campaign, and for limited-issue parties. Some interesting results were found, with the anti-immigration PVV being relatively dissimilar during the campaign, and possible coalition partners being relatively dissimilar during the coalition talks. This paper is a promising first step into the relatively understudied area of mediated politics.

Semantic Network Analysis of Frame Building during war: Mediated Public Diplomacy in Gaza, Georgia, and Iraq [presentation]

(Presented at ISA 2014, Toronto)

This paper is a work-in-progress describing an ongoing effort to automatically analyze the framing of conflict by media in third countries using Semantic Network Analysis. We study three conflicts: the 2003–2011 war in Iraq, the 2008 South Ossetian conflict, and the 2008–2009 Gaza War. For each conflict, we have manually analysed (public or private) messages of at least one of the belligerent parties to determine that party’s preferred framing of the conflict. By analysing these frames from a semantic network perspective, we show that there is a recurrent set of framing functions that are used by the parties in all three conflicts. Using transformation rules on the syntactic structure of sentences, these framing functions can then be automatically identified in newspaper coverage. Once these rules are finalized and evaluated properly, they will allow us to automatically study framing building in international conflict in an automatic and transparent way, while retaining the rich semantics required by framing analysis.