Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:
What iss the most popular question then? <strong>Tweet</strong> the answer and hopefully u may only get asked 500 times? what is there to <strong>tweet</strong> about this morning? What is your biggest food weakness? <strong>Tweet</strong> @Thintervention for motivation! #thinterventionG
This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).
To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:
> @client = TweetStream::Client.new('user','pass') > File.open("tweet.tweets", "w+") do |f| n = 0 @client.track('tweet','tweets','tweeting') do |s| n+= 1 @client.stop if n >= 10000 f.puts "#{s.text}" end end
When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:
> cat tweet.tweets | perl -pe 's/b(tweet|tweets|tweeting)b/t$1t/gi' |sort -f -k2,3 -tt > tweets.txt
You can then import this file into your speadsheet program and slice and dice to your heart’s content.
Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!
3 thoughts on “Computational lexicography on the cheap using Twitter”