NetGloW 2016: Networks in a Global World

Paper: Using syntactic clauses for social and semantic network analysis

Abstract: This article presents a new method and open source R package that uses syntactic information to automatically extract source–subject–predicate clauses. This improves on frequency based text analysis methods by dividing text into predicates with an identied subject and optional source. The content of the identified predicates can be analysed by existing frequency based methods, showing what different actors are described as doing and saying.  We show that a small set of syntactic patterns can extract clauses and identify quotes with good accuracy, significantly outperforming a baseline system based on word order. Taking the 2008–2009 Gaza war as an example, we further show how corpus comparison and semantic network analysis can be applied to the results of the clause analysis to analyse the difference in citation and framing patterns between U.S. and English-language Chinese coverage of this war. [paper under review, mail me if interested] [presentation]

Workshop: Text and Network Analysis with R (hosted on github):

This is the material for the Text and Network Analysis with R course as part of the Networks in a Global World (NetGloW) conference. I will use this page to publish slides, hand-outs, data sets etc. As the title indicates, the workshop will be taught almost completely using R. If you don’t use R yet, please make sure that you install R and Rstudio on your laptop.

This repository hosts the slides (html and source code). The source code for all handouts is published on my learningR page. You may also want to check out a 4 day Text Analysis with R course I taught at City U, Hong Kong.

Session 1: Managing data and Accessing APIs from R

In this introductory session you will learn how to use R to organize and transform your data and how to obtain data by accessing public APIs such as from Twitter, facebook etc.

Session 2: Corpus and Network Analysis

This session is the main content of the workshop: analysing text and networks from R. We will look at simple corpus analysis, comparing corpora, topic modeling, analysing (social) networks, and semantic network analysis.

Hong Kong Summer School: Advanced Text Analysis with R

I’m very excited to be teaching the course on Advanced Text Analysis with R at the Hong Kong  as part of the City University of Hong Kong Summer School in Social Science Research. I will use this page to publish lecture slides, hand-outs, data sets etc.

As the title indicates, the course will be taught almost completely using R. If you don’t use R yet, please make sure that you install R and Rstudio on your laptop. Also, please go through the code on the first two handouts published on my learningR page:

  1. R as a calculator
  2. Playing with data in R 

In general, all slides (including source code) are available from github vanatteveldt/hk2016, and all handouts are available from vanatteveldt/learningr

If you have any questions, please don’t hesitate to email me at Thanks, and see you all in Hong Kong!

June 2nd (morning): Organizing and Transforming data in R

In this introductory session you will learn how to use R to organize and transform your data: calculating columns, subsetting, transforming and merging data, and computing aggregate statistics. If time permits, we will also cover basic modelling and/or programming in R as desired.

June 2nd (afternoon): Visualizing and using APIs from R: Twitter, Facebook, NY Times
In this session we will look briefly at visualizing data in R. The main focus of the session is on using APIs from R. We will be looking at the Twitter, Facebook, and NY Times API, and also see how to access arbitrary web resources from R.
You will also start working on your mini-projects by selecting a topic and gathering data.

June 3d (morning):  Querying text with AmCAT and R
This is the first session that directly deals with text analysis.
The goal of this session is to learn how to use AmCAT as a document management tool, upload data, and perform queries from R.
You will continue working on your topic by uploading your data and conducting exploratory analyses.

June 3d (afternoon): Corpus Analysis and Text (pre)processing
In this session the focus is on the Document Term Matrix: word clouds,  comparison of different corpora, and topic models.

June 4th: Advanced text analysis: Machine learning and sentiment analysis
In this session we will do sentiment analysis using both a dictionary approach and with machine learning. These techniques can also be applied to other forms of automatic content analysis such as determining topic or frame analysis.

June 5th: Advanced text analysis: Semantic Network Analysis and Visualization
In the last session we will look at semantic network analysis with word-window approaches and more advanced visualization techniques using ggplot2, igraph, and gephi.

Simple sentiment analysis with AmCAT and semnet

This document shows how to use the semnet windowed co-occurrence function to do ‘sentiment analysis’ with AmCAT data, i.e. search for specific sentiment terms in a window around specific target terms.

This requies the amcatr and semnet packages


As an example, we use set 27938 in project 2. We will use ‘afrika’ as target term and a very short list of positive terms, feel free to make this as complex as needed:

articleset = 27938
afrika = "afrika*"
positief = "aangena* aardig* gul*  solide* wijs* zorgza*"

Step 1: Getting the tokens:

The amcat.gettokens function allows you to get tokens (list of words) from AmCAT. The module elastic always works, you can also use language-specific lemmatizers if needed.

conn = amcat.connect("")
tokens = amcat.gettokens(conn, project, articleset, module = "elastic", page_size = 1000)
##   position       term end_offset start_offset      aid
## 1        0 tientallen         10            0 30855253
## 2        1     eieren         17           11 30855253
## 3        2         in         20           18 30855253
## 4        3          s         23           22 30855253
## 5        4    werelds         31           24 30855253
## 6        5     oudste         38           32 30855253

Step 2: Running the queries

Next step is to search for the target and sentiment terms by converting them to regular expressions and using grepl to define a new variable concept on the token list:

afrika = lucene_to_re(afrika)
positief = lucene_to_re(positief)

tokens$concept = NULL
tokens$concept[grepl(afrika, tokens$term, = T)] = "afrika"
tokens$concept[grepl(positief, tokens$term, = T)] = "positief"
table(tokens$concept, useNA="always")
##   afrika positief     <NA> 
##     2526      223   745209

Step 3: Running semnet

Now we can run semnet. To only get total counts of co-occurring terms:

g = windowedCoOccurenceNetwork(location=tokens$position,  context = tokens$aid, term = tokens$concept), "edges")

TTo get the counts per article and join to the metadata:

coocs = windowedCoOccurenceNetwork(location=tokens$position,  context = tokens$aid, term = tokens$concept, output.per.context = T)
coocs = coocs[coocs$x == "afrika", ]
colnames(coocs) = c("concept" ,"sentiment", "id", "n")
meta = amcat.getarticlemeta(conn, project, articleset)
## Got 952 rows (total: 952 / 952)
coocs = merge(coocs, meta)
head(coocs, 2)
##         id concept sentiment n       date          medium
## 1 10945880  afrika  positief 1 2011-10-01 NRC Handelsblad
## 2  1448693  afrika  positief 1 2008-09-21    De Telegraaf

Text Visualization workshop @ LSE & Imperial

I’ll be presenting tomorrow (Thu 24 March) at the Text Visualization workshop organized by LSE and Imperial. I’m curious what kind of visualizations people will come up with for the hackathon challenge!

Visualize your corpus with R: Why word clouds aren’t always stupid

Word clouds usually convey only relative word frequency, but by using the other dimensions (colour, x, y) we can convey a lot more information. Using the corpustools and semnet packages we can make word clouds that are both pretty and informative

Slides [html] [source code]

Teaser: Words in and between 3 topics in the states of the union 2000-2016


“R klasje” for CW Master

Today from 11:00 – 13:00 I will teaching the first part of the informal R course for the CW master in 2A-59. Everyone is welcome!

Links: [slides] | [handouts] | [data] | [income_topdecile]

Mogelijke inhoud van deze en volgende sessies:

  1. Getting started: Your data in R
  2. Merging and transforming data
  3. Classical statistics and visualization
  4. Advanced statistics and/or programming
  5. Analysing texts and networks


1) Neem een laptop mee
2) Zorg dat je R en rstudio hebt geinstalleerd (zie
3) Werk alvast even de eerste twee handouts door (kijk gewoon hoe ver je komt) van
4) Denk na waar je R voor denkt te kunnen gebruiken, wat je zou willen leren aan technieken.

Wie checkt de fact-checker?

De media hebben een controlerende functie, maar dat roept altijd de vraag op wie de media dan moet controleren. Nu wordt onze bewering over ANP invloed bij o.a. de Volkskrant gecontroleerd door een Volkskrant redacteur, waarbij hij er (zonder methodologische verantwoording) op uitkomt dat het best meevalt met de Volkskrant, wat vervolgens in de Volkskrant wordt gepubliceerd. Als wij hierop willen reageren met een ingezonden brief, wordt deze zonder opgaaf van redenen geweigerd. In plaats daarvan bekijkt de Volkskrant-ombudsvrouw de zaak nog eens, waarbij ze concludeert dat de Volkskrant toch eigenlijk heel objectief is, omdat ook een (voor ons onderzoek irrelevant) kritiekpunt wordt besproken. Dat wordt wederom in de Volkskrant gepubliceerd. Voor ons komt hier toch een heel helder beeld uit naar voren: “wij van de Volkskrant raden de Volkskrant aan”, en kritische geluiden zijn niet welkom. Dan rest bij ons de vraag wie nu eigenlijk de fact-checkers moet controleren.