Simple sentiment analysis with AmCAT and semnet

This document shows how to use the semnet windowed co-occurrence function to do ‘sentiment analysis’ with AmCAT data, i.e. search for specific sentiment terms in a window around specific target terms.

This requies the amcatr and semnet packages

library(amcatr)
library(semnet)

As an example, we use set 27938 in project 2. We will use ‘afrika’ as target term and a very short list of positive terms, feel free to make this as complex as needed:

project=2
articleset = 27938
afrika = "afrika*"
positief = "aangena* aardig* gul*  solide* wijs* zorgza*"

Step 1: Getting the tokens:

The amcat.gettokens function allows you to get tokens (list of words) from AmCAT. The module elastic always works, you can also use language-specific lemmatizers if needed.

conn = amcat.connect("https://amcat.nl")
tokens = amcat.gettokens(conn, project, articleset, module = "elastic", page_size = 1000)
head(tokens)
##   position       term end_offset start_offset      aid
## 1        0 tientallen         10            0 30855253
## 2        1     eieren         17           11 30855253
## 3        2         in         20           18 30855253
## 4        3          s         23           22 30855253
## 5        4    werelds         31           24 30855253
## 6        5     oudste         38           32 30855253

Step 2: Running the queries

Next step is to search for the target and sentiment terms by converting them to regular expressions and using grepl to define a new variable concept on the token list:

afrika = lucene_to_re(afrika)
positief = lucene_to_re(positief)

tokens$concept = NULL
tokens$concept[grepl(afrika, tokens$term, ignore.case = T)] = "afrika"
tokens$concept[grepl(positief, tokens$term, ignore.case = T)] = "positief"
table(tokens$concept, useNA="always")
## 
##   afrika positief     <NA> 
##     2526      223   745209

Step 3: Running semnet

Now we can run semnet. To only get total counts of co-occurring terms:

g = windowedCoOccurenceNetwork(location=tokens$position,  context = tokens$aid, term = tokens$concept)
get.data.frame(g, "edges")

TTo get the counts per article and join to the metadata:

coocs = windowedCoOccurenceNetwork(location=tokens$position,  context = tokens$aid, term = tokens$concept, output.per.context = T)
coocs = coocs[coocs$x == "afrika", ]
colnames(coocs) = c("concept" ,"sentiment", "id", "n")
meta = amcat.getarticlemeta(conn, project, articleset)
## https://amcat.nl/api/v4/projects/2/articlesets/27938/meta?page_size=10000&format=rda&columns=date%2Cmedium
## Got 952 rows (total: 952 / 952)
coocs = merge(coocs, meta)
head(coocs, 2)
##         id concept sentiment n       date          medium
## 1 10945880  afrika  positief 1 2011-10-01 NRC Handelsblad
## 2  1448693  afrika  positief 1 2008-09-21    De Telegraaf