Computational lexicography on the cheap using Twitter

Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:

What iss the most popular question then? <strong>Tweet</strong>    the answer and hopefully u may only get asked 500 times?
what is there to                         <strong>tweet</strong>    about this morning?
What is your biggest food weakness?      <strong>Tweet</strong>    @Thintervention for motivation! #thinterventionG

This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).

To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:

> @client ='user','pass')
>"tweet.tweets", "w+") do |f|
        n = 0
        @client.track('tweet','tweets','tweeting') do |s|
            n+= 1
           @client.stop if n >= 10000
           f.puts "#{s.text}"

When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:

> cat tweet.tweets | perl -pe 's/b(tweet|tweets|tweeting)b/t$1t/gi' |sort -f -k2,3 -tt    > tweets.txt

You can then import this file into your speadsheet program and slice and dice to your heart’s content.

Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!

3 thoughts on “Computational lexicography on the cheap using Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *