You can find all handouts on vanatteveldt.com/learningr
This document shows how to use the semnet windowed co-occurrence function to do ‘sentiment analysis’ with AmCAT data, i.e. search for specific sentiment terms in a window around specific target terms.
As an example, we use set 27938 in project 2. We will use ‘afrika’ as target term and a very short list of positive terms, feel free to make this as complex as needed:
project=2 articleset = 27938 afrika = "afrika*" positief = "aangena* aardig* gul* solide* wijs* zorgza*"
amcat.gettokens function allows you to get tokens (list of words) from AmCAT. The module
elastic always works, you can also use language-specific lemmatizers if needed.
conn = amcat.connect("https://amcat.nl") tokens = amcat.gettokens(conn, project, articleset, module = "elastic", page_size = 1000)head(tokens)
## position term end_offset start_offset aid ## 1 0 tientallen 10 0 30855253 ## 2 1 eieren 17 11 30855253 ## 3 2 in 20 18 30855253 ## 4 3 s 23 22 30855253 ## 5 4 werelds 31 24 30855253 ## 6 5 oudste 38 32 30855253
Next step is to search for the target and sentiment terms by converting them to regular expressions and using
grepl to define a new variable
concept on the token list:
afrika = lucene_to_re(afrika) positief = lucene_to_re(positief) tokens$concept = NULL tokens$concept[grepl(afrika, tokens$term, ignore.case = T)] = "afrika" tokens$concept[grepl(positief, tokens$term, ignore.case = T)] = "positief" table(tokens$concept, useNA="always")
## ## afrika positief <NA> ## 2526 223 745209
Now we can run semnet. To only get total counts of co-occurring terms:
g = windowedCoOccurenceNetwork(location=tokens$position, context = tokens$aid, term = tokens$concept) get.data.frame(g, "edges")
TTo get the counts per article and join to the metadata:
coocs = windowedCoOccurenceNetwork(location=tokens$position, context = tokens$aid, term = tokens$concept, output.per.context = T) coocs = coocs[coocs$x == "afrika", ] colnames(coocs) = c("concept" ,"sentiment", "id", "n") meta = amcat.getarticlemeta(conn, project, articleset)
## https://amcat.nl/api/v4/projects/2/articlesets/27938/meta?page_size=10000&format=rda&columns=date%2Cmedium ## Got 952 rows (total: 952 / 952)
coocs = merge(coocs, meta) head(coocs, 2)
## id concept sentiment n date medium ## 1 10945880 afrika positief 1 2011-10-01 NRC Handelsblad ## 2 1448693 afrika positief 1 2008-09-21 De Telegraaf
I’ll be presenting tomorrow (Thu 24 March) at the Text Visualization workshop organized by LSE and Imperial. I’m curious what kind of visualizations people will come up with for the hackathon challenge!
Visualize your corpus with R: Why word clouds aren’t always stupid
Word clouds usually convey only relative word frequency, but by using the other dimensions (colour, x, y) we can convey a lot more information. Using the corpustools and semnet packages we can make word clouds that are both pretty and informative
Teaser: Words in and between 3 topics in the states of the union 2000-2016
Today from 11:00 – 13:00 I will teaching the first part of the informal R course for the CW master in 2A-59. Everyone is welcome!
Mogelijke inhoud van deze en volgende sessies:
De media hebben een controlerende functie, maar dat roept altijd de vraag op wie de media dan moet controleren. Nu wordt onze bewering over ANP invloed bij o.a. de Volkskrant gecontroleerd door een Volkskrant redacteur, waarbij hij er (zonder methodologische verantwoording) op uitkomt dat het best meevalt met de Volkskrant, wat vervolgens in de Volkskrant wordt gepubliceerd. Als wij hierop willen reageren met een ingezonden brief, wordt deze zonder opgaaf van redenen geweigerd. In plaats daarvan bekijkt de Volkskrant-ombudsvrouw de zaak nog eens, waarbij ze concludeert dat de Volkskrant toch eigenlijk heel objectief is, omdat ook een (voor ons onderzoek irrelevant) kritiekpunt wordt besproken. Dat wordt wederom in de Volkskrant gepubliceerd. Voor ons komt hier toch een heel helder beeld uit naar voren: “wij van de Volkskrant raden de Volkskrant aan”, en kritische geluiden zijn niet welkom. Dan rest bij ons de vraag wie nu eigenlijk de fact-checkers moet controleren.
We just got the exciting news that our article on Dutch media coverage of youth crime was accepted for publication by Journalism!
Nel Ruigrok, Wouter van Atteveldt, Sarah Gagestein, Carina Jacobi
Abstract: Between 2007 and 2011, the number of registered juvenile suspects declined by 44% but the Dutch public did not feel any safer. In this research we study media coverage of youth crime and interview journalists and their sources, in order to investigate the relationship between journalists, their sources and the possible effects on the public with respect to fear of crime. We find an overrepresentation of youth crime in news coverage, especially in the popular press, and a stronger episodic focus over time. All media focus increasingly on powerful sources that focus on repressive framing, but this is especially found in the elite press. We conclude that news coverage in all media groups, although in different ways, does contribute to the fear of crime in society and the idea that repressive measures are needed. The fact that this fear of crime is also caused by news coverage is acknowledged, but neither journalists nor politicians are able or willing to change this.
I’m happy to report that the paper I co-authored with Kasper Welbers and others has been accepted for ICA:
A gatekeeper among gatekeepers: The impact of a single news agency on political news in print and online newspapers in the Netherlands.
Kasper Welbers, Wouter van Atteveldt, Jan Kleinnijenhuis, Nel Ruigrok
Abstract: This paper investigates the influence of news agency ANP on the coverage and diversity of political news in Dutch national newspapers, using com putational text analysis. We analyzed the influence on print newspapers across three years (1996, 2008 and 2013) and compared influence on print and online newspapers in 2013. Results indicate that the influence of ANP on print newspapers only increased slightly. Online newspapers, however, depend heavily on ANP and are highly similar as a result of it. We draw conclusions pertaining to the gatekeeping role of news agencies in the digital age in general, and in the context of the Netherlands in particular. Additionally, we demonstrate that techniques from the field of information retrieval can be used to perform these analyses on a large scale. Our scripts and instructions are provided online to stimulate the use of these techniques in communication studies.
After almost 10 years I’m giving a talk at CLIN (Computational Linguistics in the Netherlands) again. I completely rewrote the clause code from python to R, which is quite exciting as it will make it much easier to tweak and add rules “client-side”, see github.com/vanatteveldt/rsyntax. I also did a new validation, comparing the results to a new gold standard of manually coded aggressive actions in the 2009 Gaza war. I also compare the results to a “word order co-occurrence” baseline that assumes that the leftmost actor is the agent (subject). Results show convincingly that word-order is indeed very fragile in conflict situations:
I also re-evaluated the source extraction, where I compare to a baseline that uses the same speech verbs, and assumes that an actor left of the speech verb is the source, and right of the speech verb the quote. Evaluation shows that recall is the same for both methods (which miss more ‘subtle’ ways of expressing quotes), but precision is extremely good for the syntactic method while being mediocre for the baseline:
In my presentation I will be presenting these results as well as a number of substantive results related to the different bias of Chinese and American newspaper coverage of the 2009 Gaza war. Results show that Chinese quote Hamas much more frequently and also display Hamas less as an aggressor.
More visually, the following shows side-by-side the actions of Israel according to the US and Chinese media, where you can clearly see that US focuses on aggression towards Hamas and emphasises the reasons for the attack (goal discourse), while China focuses on the more civilian Gaza and emphasises the attacks itself (means discourse).
(Israeli actions, Left: US newspapers; right: Chinese newspapers. Network shows co-occurrence based semantic network of all words in predicates with Israel as subject that are overrepresented in the respective country)