Big data workshop @ Glasgow

This Monday I will be presenting a paper at the Big Data workshop organized by Philp Habel and Sarah Birch at the university of Glasgow.

Links: [Paper | slides | visualization | topics | r-toolkit]

LDA models topics… But what are ‘topics’? 
The relation between LDA topics and traditional measures of issue, frame, and valence
Wouter van Atteveldt, Kasper Welbers, Carina Jacobi, Rens Vliegenthart

Abstract

LDA topic modeling is a popular technique for unsupervised document clus-
tering. However, the utility of LDA for analysing political communication
depends on being able to interpret the topics in theoretical terms. This
paper explores the relation between LDA topics and content variables tra-
ditionally used in political communication. We generate an LDA model on
a full collection of front-page articles of Dutch newspapers and compare the
resulting LDA topics to a manual coding of the political issues, frames, and
sentiment.
In general, we find that a large number of topics are closely related to a
specific issue; and that the different topics that comprise an issue can be
interpreted as subissues, events, and specific journalistic framing of the issue.
The relation between frames and topics is less direct, with a large amount
of topics associated with each of the investigated frames while no topics
were identified that really encoded just a specific frame. Finally, hardly any
topic had a clear sentiment associated, with only exception for topics whose
sentiment is contained in the represented issue, such as disasters. These
results validate the use of LDA topics as proxies for political issues, and pave
the way for a more empirical understanding of the substantive intepretation
of LDA topics.

Posted in papers, Uncategorized | Leave a comment

Amcat 3.3.1 preview release

We’ve updated preview.amcat.nl to a new version that we hope to be able to release shortly.

 Major changes are:
- uploading articles now gives a progress bar (like the export coding job)
- the tables have been updated in a major way:
  – You no longer scroll to see more results, you either click next page, or select a larger number of rows per page
  – The export to csv/excel has been moved to a button on the top of the table
  – You can now select rows using control+shift click. We expect to use this in a lot of places for e.g. adding/removing selected articles to a set, deleting or archiving multiple article sets, etc. Currently, it’s main use is in exporting coding jobs: you select a bunch of jobs in the coding job overview, and then click export jobs to export those jobs. This replaces the much too small coding job selection window.
  – You can now export articles from the article list view in the query screen by selecting export as -> html. If ‘text’ is selected as one of the columns it will also be exported.
  – Keyword in context has been cleaned up
As always, all bug reports and feature requests are highly appreciated.
Posted in amcat, Uncategorized | Leave a comment

Hands-on exercises posted to github

I just reorganized and posted all the hands-on exercises used in the R course to a new github repository:

vanatteveldt.com/r

Feel free to share the link!

Posted in Uncategorized, workshops | Leave a comment

R Course @VU Amsterdam

I will be teaching an R course at the VU next week. This page will contain the slides and additional information for the hands-on session.

Day 1: R basics and data manipulation

Links: [slides | data | exercises ]

The aim of the first day is to get acquainted with R and the R environment; get started on reading, manipulating, and analysing data; and get started with descriptive analyses.

Topics:

  • Getting started
    • Principles of R
    • R studio, knitr, scripts, projects
    • How to get help?
  • Your data in R
    • Data types: data frames, vectors, lists, …
    • Reading and writing data: CSV, SPSS and other sources
    • Simple descriptives: summary, table, …
    • Selecting, sorting, and calculating data

During the instruction period, practical assignments will be given to attain proficiency with the studied techniques and commands. After the instruction period, participants who take the course for credit should work on the research projects and complete assignment 1 (see below)

Day 2: Data transformations in R

Links: [slides | data |  exercises ]

Combining and transforming data

  • Combining data: cbind, rbind, merge and match
  • Transforming data: aggregate and reshape

Day 3: Statistics and visualizations in R
Links: [slides | data |  exercises ]

 The goal of the second day is to become comfortable with getting the data in the shape you want it to be; to do traditional inferential statistics on these data; and to get started with making plots and charts in R.

Topics:

  • Simple statistics in R
    • Tabulating data
    • correlation, t-tests, anova, regression
  • Visualizing data:
    • plot, barplot, hist
    • Customizing plots one line at a time
    • Multiple plots

As on day 1, practical assignments will be given during the instruction period to attain proficiency with the studied techniques and commands. After the instruction period, participants who take the course for credit should work on the research projects and complete assignment 2 (see below)

Day 4: Advanced topics in R

Links: [slides |  handsons ]

The goal of the last day is to give a sampling of some more advanced techniques that are possible with various packages in R.

Topics (depending on time, some topics may be dropped in favour of more in-depth discussion of the other topics):

  • Advanced statistics
    • Multilevel modeling
    • VAR and ARIMA models
  • Text and Network Analysis in R

As on the other days, practical assignments will be given during the instruction period to attain proficiency with the studied techniques and commands. After the instruction period, participants who take the course for credit should work on the research projects to work towards their final paper and presentation (see assignment 3, below).

 

 

Posted in Uncategorized, workshops | Leave a comment

CCCT Seminar talk

I will be giving a talk at 16:00 today at the monthly seminar of the Center for Creation, Content and Technology at the UvA (Science park 904).

My presentation (download) will be partly based on my paper at the 2013 Text as Data conference and on my 2014 MPSA paper.

Wouter van Atteveldt (VU Amsterdam)
Using grammatical analysis and LDA topic models to study politicians’ media statements.

To understand the relation between media and politics, it is necessary to study the content of politicians’ statements in the news. I use transformation rules on the syntactic structure of text to extract quotes, and build a topic model on the content of these quotes and statements in Parliament. This allows me to determine how often politicians are quoted and whether their media statements are similar to their statements in parliament. While media attention mostly follows political power, this is quite different for media statements. The frequency of statements is a matter of journalistic demand (e.g. high during scandals) and political supply (e.g. low during closed-door negotiations). Media statements are most similar to political discourse during the campaign, and for limited-issue parties. Some interesting results were found, with the anti-immigration PVV being relatively dissimilar during the campaign, and possible coalition partners being relatively dissimilar during the coalition talks. 

 

Posted in papers, Uncategorized | Leave a comment

Pre-ICA AmCAT workshop: 21 May, UW, Seattle, 1-5pm

Update: download slides | hands-on exercises

Many thanks to Patricia Moy for helping me organize an AmCAT workshop to be held at the University of Washington in Seattle on the 21st of May, right before the start of the ICA conference in the same city.

The workshop is open to all and free of charge, but please mail me at wouter@vanatteveldt.com if you plan to attend.

The program will run from approximately 1-5 PM and include:

  • Beginning AmCAT (approx. 1-3 PM)
    • What is AmCAT? Can I use AmCAT?
    • Getting started: project managements, adding your texts to AmCAT, users
    • Automatic coding: using keyword queries, improving queries, estimating validity, using codebooks.
    • Manual coding: designing codebooks and coding schemas, coding, extracting results
  • Intermedia AmCAT (approx. 3-5 PM)
    • The AmCAT API: how to access your data from python/R
    • Scraping / uploading articles from other data sources
    • Topic modeling and machine learning on AmCAT data

The workshop is open to all who are interested. No prior knowledge of AmCAT is required, the Intermediate workshop should be understandable after the beginner’s workshop. As the intermediate workshop is about accessing AmCAT from python and R, this will be interesting mainly to those with some (statistics) programming experience. 

Participants who are not interested in the ‘intermediate’ part are invited to use that time to practice using AmCAT using the on hands-on exercises that will be distributed with the workshop or of course on their own data. We will be around to answer questions and provide advice.

Please bring a laptop to the workshop in order to take part in the interactive sessions. 

We will presumably head to a place that serves drinks and/or dinner after five, all are invited.

Please also see http://vanatteveldt.com/index.php/amcat-api-howto/ for documentation relevant to using AmCAT from R or Python.

Posted in amcat, Uncategorized, workshops | Leave a comment

AmCAT API howto’s for Python and R (and an extra workshop)

Inspired by the interest generated at the workshops last weeks., I’ve written a number of howto documents for working with the AmCAT API:

  • Python scraping: A demo for a scraper using the amcat API that scrapes the (creative commons licensed) wikinews site (thanks to Paul Huygen)
  • Python analysis: A simple demo script that donwloads the (scraped) articles and counts all words.
  • R querying: A howto for using R to query AmCAT and retrieve metadata
  • R vocabulary and topic modeling: A howto for downloading term-document matrices, comparing them (to find typical vocabulary or collocates), and doing topic modeling

There will also be an extra workshop on using grammatical analysis in AmCAT that will be held on 30th of April. Please let me know if you plan to attend (and haven’t already told me).

Posted in amcat, Uncategorized, workshops | Leave a comment

AmCAT workshops@VU: 9 April (beginner), 16 April (advanced)

I will be giving two workshops for AmCAT users in April at the VU Amsterdam. The workshops are open to anyone interested, but please mail me if you want to attend either or both workshops.

Using AmCAT: April 9, 13:30 – 16:30, Location Metropolitan Z-009

On the 9th of April, I will help everyone who wants to get started with using AmCAT3. This workshop is aimed at scientists from the social sciences or humanities who want to get started with using digital text analysis methods. Existing users who have used AmCAT2 will also be interested to learn what is new and changed in this version.

Download Slides | Download hands-on exercises

Topics will depend on audience demand, but include:

  • What is AmCAT? Can I use AmCAT?
  • Getting started: project managements, adding your texts to AmCAT, users
  • Automatic coding: using keyword queries, improving queries, estimating validity, using codebooks.
  • Manual coding: designing codebooks and coding schemas, coding, extracting results
  • Hands-on session. Please bring a laptop if you want to participate in the hands-on part of the workshop!

Advanced AmCAT: April 16, 13:30 – 16:30, Location Metropolitan Z-007

Download slides

This workshop is made for people who are interested in the more technologically advanced capabilites of AmCAT. People who are interested in using AmCAT in conjunction with R of python will also be interested. No specific technical knowledge is required to understand the workshop, but some experience with R will help a lot. Please bring a laptop if you plan to attend this session (I can provide one if needed)

Topics will depend on audience demand, but include:

  • The AmCAT API: how to access your data from python/R
  • Scraping / uploading articles from other data sources
  • Topic modeling and machine learning on AmCAT data
  • Extracting quotes and statements: Grammatical analysis and graph transformations
Posted in amcat, Uncategorized, workshops | Leave a comment

ISA and MPSA

I’m now in Toronto to attend ISA and will go to MPSA afterwards. Interestingly, both conferences had ‘big data’ panels that I will be presenting in:

ISA: Semantic Network Analysis of Frame Building during war: Mediated Public Diplomacy in Gaza, Georgia, and Iraq (presentation)

This paper is a work-in-progress describing an ongoing effort to automatically analyze the framing of conflict by media in third countries using Semantic Network Analysis. We study three conflicts: the 2003–2011 war in Iraq, the 2008 South Ossetian conflict, and the 2008–2009 Gaza War. For each conflict, we have manually analysed (public or private) messages of at least one of the beligerent parties to determine that party’s prefered framing of the conflict. By analysing these frames from a semantic network perspective, we show that there is a recurrent set of framing functions that are used by the parties in all three conflicts. Using transformation rules on the syntactic structure of sentences, these framing functions can then be automatically identified in newspaper coverage. Once these rules are finalized and evaluated properly, they will allow us to automatically study framing building in international conflict in an automatic and transparant way, while retaining the rich semantics required by framing analysis.

MPSA: Quotes as Data:  Extracting Political Statements from Dutch Newspapers by applying Transformation Rules to Syntax Graphs (presentation)

To understand the relation between media and politics, it is necessary to study the content of politicians’ statements in the news. By using syntactic analysis and topic models, this paper looks at how often politicians are quoted, and whether their media statements are similar to their statements in parliament. While media attention simply follows political power, this is quite different for media statements. The frequency of statements is a matter of journalistic demand (e.g. high during scandals) and political supply (e.g. low during closed-door negotiations). Media statements are most similar to
political discourse during the campaign, and for limited-issue parties. Some interesting results were found, with the anti-immigration PVV being rela tively dissimilar during the campaign, and possible coalition partners being relatively dissimilar during the coaltion talks. This paper is a promising first step into the relatively understudied area of mediated politics.

Posted in papers, Uncategorized | Leave a comment

AmCAT 3.3 released

Yesterday evening we released AmCAT 3.3. We are quite excited about this as we think that it is an important step towards making AmCAT more usable and more stable. Below this mail you can find a summary of improvement, the most obvious of which will be the completely restructured UI, which we believe gives a cleaner and more modern look.
Although we have tested this version extensively while developing, this upgrade adds quite a number of features and has a completely refactored system for generating the website (reflected in the new UI), so it is quite possible that there are still some bugs. Please report any bugs or feature requests on the new issue tracker at https://github.com/amcat/amcat/issues.

The ‘amcatbook’ has also been updated and is available from https://www.dropbox.com/s/nnkhzlcuhza9f69/amcatbook.pdf (in dutch). The manual can be accessed as before at http://amcat.vu.nl/news/index.php/for-users/, it has not been updated to the newest version but most things should be quite similar.

For an up-to-date list of immediate bugs that we are aware of, please see the issue tracker and especially the page for milestone 3.3.01: https://github.com/amcat/amcat/issues?milestone=2&page=1&state=open. Feel free to browse to the other milestones as well, 3.3.1 are the first plans for improvements to the new version, while milestone 3.4 are the more long-term plans, first and foremost the query screen (https://github.com/amcat/amcat/issues/18 and https://github.com/amcat/amcat/issues/17).

Thanks for using AmCAT and thanks as always for reporting bugs and suggestions for improvement!

“The AmCAT team”

Improvements in 3.3:

  • Completely overhauled navigator UI. The new UI is a lot more standardized across pages and has less clutter, making it easier to use especially for new users.
  • Complete rewrite of the annotator which greatly improved performance when using large and/or many codebooks. Keyboard shortcuts and supported browsers have also changed, so please consult the “help” link in the annotator before coding. You can also indicate which part of a sentence you are coding if ‘subsentence’ is selected in the coding schema.
  • Queries and some actions are now run in the ‘background’, and the website displays a progress dialog. This makes the server less likely to become too busy and gives the user an indication of what is happening.
  • If you click on an article after querying, the matches for the query will be highlighted in the article text.
  • Authentication (rights and permissions) are now handled better than before. It is still quite possible that users are allowed to do things they shouldn’t, but most cases should be handled now. If you don’t see a button are you get a permission denied, please check whether you have sufficient access to the project you are working in. If there are any permissions-related problems, please open an issue as normal.
  • Codebook handling. You can now export codebooks and all labels to excel and various formats, and import them from csv. You can also update an existing codebook with new labels or structure from a csv file.
  • Improved export. All tables now have excel and SPSS export. Aggregations and Associations now have correct field type for SPSS.
  • Performance improvements. Performance of complicated queries has improved a lot (this was backported to production around Christmas). Summary now makes a single call to get both total #hits and the top 10 hits.
  • Minor usability improvements, such as opening articles in a new page from the query screen, association interval and other options have been revamped, plain text uploader has a “text” field to bypass uploading a text, scripts and uploaders have better help text, …
  • API improvements, especially token based authentication and full support for query search and aggregate. See also https://github.com/amcat/amcat-r and http://amcat.nl/R/amcatr.pdf

Plans for 3.4

  • The biggest priority is the query screen. It is confusing that some options are only accessible once you have performed a specific query (https://github.com/amcat/amcat/issues/18). It is also annoying that changing e.g. association settings requires first asking a new summary. We are also eager to add new functionality to the query screen, such as word clouds, new visualizations, links with coding jobs, etc.
  • Storing more state, especially storing queries and offering a list of recent queries, recent projects, etc. (https://github.com/amcat/amcat/issues/16https://github.com/amcat/amcat/issues/17)
  • Improving the API. There is a new ‘hierarchical’ API in place (i.e. where an articleset is located under that project at api/v4/projects/X/articlesets/Y) which also allows creation/modification. In 3.4 this new API will replace the old API, meaning that all old resources should be present in the new system and that security should be checked thoroughly, especially as the API also allows anonymous access. (https://github.com/amcat/amcat/issues/15)

If you have any suggestions for 3.4, please mail us or create an issue!

Posted in amcat, Uncategorized | Leave a comment