Category Archives: Search technology

Using Ngram data for segmentation

I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.

Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:

#  > segment("bpcares")
#  => ["bp", "cares"]
#  > segment("Twitter")
#  => ["Twitter"]
#  > segment("writers")
#  => ["writers"]
#  > segment("iamwriting")
#  => ["i", "am", "writing"]
#  > segment("backchannel")
#  => ["back", "channel"]
#  > segment("tcot")
#  => ["tcot"]
#  > segment("vacationfallout")
#  => ["vacation", "fall", "out"]

The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).

The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.

Anyway, enjoy the code: it can be found in my GitHub account:

http://github.com/willf/microsoft_ngram/blob/master/examples/segment.rb

It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.

Ruby project to access Microsoft Ngram data

I am pleased to announce the availability of a Ruby library to access the Microsoft Ngram data. This data currently includes 1,2,3,4 gram data for anchor text, 1,2,3 gram data for body text, 1,2,3,4 gram data for page titles, and 1,2,3 gram data for queries, collected in June 2009. See the Bing/MSR Ngram data page for general information. Although I am a Microsoft employee, this software is provided by me, not Microsoft.

Microsoft provides a SOAP API, and a Python REST-based library, but this is (I think) the first Ruby library. You can get a copy at Github.com: http://github.com/willf/microsoft_ngram.

I hope, in the days to come, to write some example programs that show the power of this data resource for research. But, to whet the appetite, should we parse “Boston cream pie” as [Boston [cream pie]] or [[Boston cream] pie]? That is, a cream pie made in the Boston style, or a pie made with Boston cream? If the former, “cream pie” should be more frequent than “Boston cream”; if the later, the opposite.

> MicrosoftNgram.new(:model => ‘bing-body/jun09/2’).jps([‘boston cream’,’cream pie’]) => [[“boston cream”, -7.231685], [“cream pie”, -6.027882]]
These are log probabilities. This model suggests [Boston [cream pie]] is the correct bracketing.

Alumin(i)um

I enjoyed reading the Wikipedia page about its “lamest edit wars.” One of these edit wars was whether the article on what Americans call “aluminum” and what Brits call “aluminium” should have, as its fundamental title, “Aluminum,” or “Aluminium.” And, one of the arguments presented in favo(u)r of “Aluminum” was that more Google hits are available for the US spelling than the UK spelling. “Ghits” is notoriously unreliable (as are Bing hits and the other search engines), since the number of search results reported are subject to lots of factors, not of which is tied directly to actual number of documents returned.

However, Bing (my employer) has recently provided programmatic access to its data on ngrams (frequency statistics based on the number of word tokens) found on web pages, query logs and anchor text (the data inside links). And I can safely express that the US spelling is much more frequently used. Here is the actual data, based on the June 2009 data release:

Source P(Aluminum) P(Aluminium) Ratio US:UK
Body text 0.00852 0.00487 1.76
Anchor text 0.00727 0.00426 1.70
Query text 0.00974 0.00483 2.01

So, as a data point: “aluminum” is around twice as frequent as “aluminium” on the Web.

“buy a house, sell a home?” redux

A while ago, I wrote about whether there was evidence that people tend to “buy a home and sell a house,” and used Google’s n-gram  data (based on web documents) to suggest this wasn’t the case. I happened to come across this post today, and wondered whether some of the query streams I have access to now might say something to this. I looked at our frequently “sell” or “buy” co-occur with “house” or “home” in a stream of about 36 million queries.

buy & home : 393 buy & house: 525

sell & home: 396 sell & house: 420

buy & home/buy & house: 0.74 sell & home/sell & house: 0.94 (buy & home + sell & home)/(buy & house + sell & house) : 0.83

Unlike the n-gram data, people are more likely to use “house” in queries that include “buy” or “sell,” (taken separately, or taken together). This may indicate that people searching for information on real estate tend to use “house,” while people advertising real estate tend to use “home” (“sell a home” was over 7.5x more likely than “sell a house” in the document-based ngram data). As far as people searching goes, they tend to “buy a house” and “sell a house.”

Google’s Superbowl ad

I really enjoyed Google’s Superbowl ad, in which a love story is told as a series of search queries:

  • study abroad paris france
  • cafes near the louve (sic)
  • translate tu es très mignon
  • impress a french girl
  • chocolate shops paris france
  • what are truffles
  • who is truffaut
  • long distance relationship advice
  • jobs in paris
  • AA120
  • churches in paris
  • how to assemble a crib

I was curious how Google would do “in real life” on these queries, as well as how Bing (my employer) does. Not surprisingly, Google does well on all these queries. I am pleased to state that Bing does well, too, although I have to admit that the specific results from the translate and “AA120” (a flight search) are not quite as succinctly done (yet!) as Google’s are. But all of the general and “local” queries (even the one with “Louvre” misspelled) are every bit as good as Google’s, and sometimes better presented. “Churches in paris”, I think, is nicer–showing images of churches first before the “local” search.

At this point, I’d claim that Bing really is as good or better than Google for general search–not based on this ad, of course, but from my daily use of both.

I also notice that the link clicked in the video for “how to impress a french girl” is now welcoming people who saw the Google ad.

Thoughts on the Microsoft acquisition

(The usual disclaimers: my opinion only, not my current or future employers)

When Powerset began a couple of years ago, a lot of commentators called us — and still do call us — a would be Google killer. This, despite repeated comments by senior staff that this wasn’t what we were about. As a company, Google is hard to beat. Our goal was audacious, but not that audacious. Our goal was to build a better search experience: to use natural language technology to provide better search results, both by having a better understanding of web documents as well as user queries.

But natural language technology has always only been part of the mix. We have, from the beginning, seen ourselves as doing “keywords plus”; that is, we have always planned to do what the other search engines do (keyword search, link analysis, blah, blah, blah), but add on top of this signals coming from parsing and semantic understanding. For example, we’d like to do as good a job as Google (say) on queries like ‘powerset microsoft’, but do even better on queries such as ‘Who acquired Powerset?’ and ‘Which company did Microsoft just buy?’ and everything in between.

What I didn’t realize when I joined the company is how some of the same technology would create innovations in the user interface, too. Powerset’s ‘Factz’ are a nice addition to the standard search page, and our ‘snippets’ are the best in the business. When I first typed in ‘stars of BSG‘ in the Powerset search box, I was floored by the beauty of the results.

So I think we met our audacious goal: a better search experience. Microsoft seems to think so; after all, they bought the company.

And here’s the thing: we were bought by Microsoft. Microsoft’s market cap is still 90 billion dollars greater than Google. If anyone is able to capitalize a little ol’ startup like Powerset to make us a big player in search, it’s Microsoft. In fact, it’s clear (to me at least) we have a new mission, which is just the old mission the pundits wrongly labeled us with at the start: As a search company, our mission is now to beat Google.

Interesting times ahead.

You ain’t seen nothing yet

Figures from Nathan Stubblefield's patent application for a wireless telephone

Who invented the wireless telephone?

It’s been 100 years since “American inventor and Kentucky melon farmer” Nathan B. Stubblefield received the patent for the first wireless cell phone (UK Telegraph article, via Mirabilis).

Today, the company I work for, Powerset, launched its search product for general public use at powerset.com. It’s a really cool search and browsing engine for Wikipedia, with lots of information gleaned from the Freebase project as well.

Reading about Mr. Stubblefield made me want to know about other inventors who were melon farmers. Searching for “inventors who raise melons” does, in fact, return Powerset’s republished Wikipedia page about Mr. Stubblefield, with its first sentence helpfully highlighted. And, as it turns out, lots of other inventors who raised melons, including (of course) melon researchers, with nice results about watermelon, muskmelon, and galias.

Most people agree, even Peter Norvig (Google’s head of research), that search is in its early days. Google, Yahoo!, Live, Ask, and the other search engine companies have led the way, and Powerset is adding a new set of signals, based on principles from natural language understanding and knowledge representation, to the mix. (And, note, these are additional signals; no one from Powerset has ever claimed that current search signals, such as the presence of keywords or page rank, were unimportant.) These are relatively early days for Powerset, and early days for search. Early, and exciting, days.

And that makes me wonder: Who said, “You ain’t seen nothing yet?”