Computational Social Science on the cheap using Twitter

This is a followup to my post Computational lexicography on the cheap using Twitter, but more especially in response to Using off-the-shelf software for basic Twitter analysis.

The later article shows how to use database software (MySQL and its implementation of the SQL language) to do basic Twitter analysis. The ‘basic analysis’ includes counts by hashtag, timelines, and word clouds. They analyse about 475k tweets.

But here’s the thing: all their analyses can be done more simply with simple text files and pipes of Unix commands (as most eloquently demonstarted in Unix for Poets, by Ken Church). In fact, several simple   commands—commands I use everyday—are powerful enough to do the kind of analyses they discuss.

Getting the data.

(You can skip over this if you have data already!)

Interestingly, they do not show how to get the tweets to begin with. My previous post discusses this, but it might be useful to show a simple Ruby program that collects Tweet data, especially since the method has changed slightly since my post. The biggest hurdle is setting up authentication to access Twitter’s data—discussed in full, here, but the crucial thing is that you have to register as a Twitter developer, register a Twitter application, and get special tokens. You create an application at the Twitter apps page; from that same location you generate the special tokens.

Here’s the Ruby script (also listed here).

require 'rubygems'
require 'tweetstream'
require 'date'

TweetStream.configure do |config|
  config.consumer_key = ''
  config.consumer_secret = ''
  config.oauth_token = ''
  config.oauth_token_secret = ''
  config.auth_method = :oauth
  config.parser   = :json_gem
end

# Change the words you want to track
TweetStream::Client.new.track('football', 'baseball', 'soccer', 'cricket') do |status|
  begin
    # The Tweet id
    id = status.id
    # The text of the tweet, with new lines (returns) replaced by spaces
    txt = status.text.gsub(/n/," ")
    # The date of the tweet, printed out in a slightly more useful form 
    # for our purposes
    d = DateTime.parse(status.created_at).strftime("%Y-%m-%dt%H:%M:%S")
    puts [id,txt,d].join("t")
  rescue Exception => e
    puts "!!! Error: #{e.to_s}"
  end
end

With the proper keys and secrets, this gist wlll allow you to track keywords over time, and print out, in a tab-separated format, the tweet id, the text of the tweet, the date, andthe time it was published (in UTC, or Greenwich, time). You could add additional columns, as described (by example) in the Twitter API.

The example here tracks mentions of football, baseball, soccer, and cricket, but obviously, these could be other keywords. Running this using this command:

ruby track_tweets.rb | tee nsports.tsv

will place tweets in the file ‘nsports.tsv’.

Basic statistics

Counting the number of football, baseball, etc. mentions is easy:

$ grep -i football nsports.tsv | wc -l
$ grep -i baseball nsports.tsv | wc -l
$ grep -i soccer nsports.tsv | wc -l
$ grep -i cricket nsports.tsv | wc -l

As well as getting the number of lines in the file:

$ cat nsports.tsv | wc -l

The second analysis was to count who is retweeted the most, done by counting the username after the  standard Twitter “RT ” (eg “rt @willf good stuff!”). The following pipeline of commands accomplishes this simply enough:

egrep -io "rt +@w+" nsports.tsv | perl -pe "s/ +/ /g" | cut -f2 -d  | sort | uniq -c | sort -rn | head

(This may be easier to copy from here). Each of these is a separate command, and the pipe symbol (|), indicates that the output from one command goes on to the next. Here’s what these commands do:

  1. egrep -io “rt +@w+” nsports.tsv — searches through the tweets for the pattern RT space @ name, where there is one or more spaces, and one or more ‘word’ characters. It only prints the matching parts (-o), and ignores differences in case (-i).
  2. perl -pe “s/ +/ /g” — I noticed that from time to time, there is more than one space after the ‘RT’, so this substitutes one or more spaces with exactly one space.
  3. cut -f2 -d  — Each line looks like “RT @name”, now, and this command ‘cuts’ the second field out of each line, with a delimiter of a space. This results in each line looking like ‘@name’.
  4. sort | uniq -c | sort -rn — this is three commands, but I type them so frequently, it seems like one to me. It sorts the text, so they can be counted with the uniq command, which produces two columns : the count and the name; we reverse sort (-r) on the first numeric field (-n)
  5. head — this shows the top ten lines from a file.

This command pipeline should have no problem handling 475k lines.

The third analysis was to put the data in a format that can be used by Excel to create a graph, with counts by day. Because we have printed the date and time in separate columns, with the date in column 3. So, we can simply do the cut, sort, uniq series:

cat nsports.tsv | cut -f3 | sort | uniq -c > for_excel.tsv

This will put the data into a format that Excel can read.

Finally, the authors show how to create Wordle word graphs overall, and for the categories. I’m not a big fan of these as a data exploration tool, but notice you can use cut -f2 to get the text to paste into Wordle.

So, this is computational social science on the cheap using Twitter, using some basic Unix commands (cat, cut, sort, uniq, grep), with one tiny, tiny call to Perl. You can do this too–and it’s easier to learn than MySQL and SQL! Plus, you can easily read the text files that are created. All of this was done on a standard Mac, but any Unix machine, or Windows machine with the Cygwin tools installed, can do this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *