[Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing]
Matt Cutts has a thoughtful follow-up article on the controversy over whether Bing is copying Google’s search results. There has been a little too much heat and not enough light in the discussion, and I appreciate reflective thoughts, and his shout out to us Bing engineers. Someday, we’ll have forgotten all about this controversy. I predict about two and a half weeks.
I’ve been thinking some more about this, too, and I’d like to write a bit more about the clickstream data. I won’t describe specific ways that the data is collected, although Bing has made it clear that it comes though the use of opt-in data provided by users. And I won’t describe all the content or format of what the clickstream contains. But let’s abstract what is obviously there: information about the search engine used, the query, the presented URLs (and their order), and the URLs that are clicked by the user. So, you can imagine that there is this stream of click data coming from a browser (I suggest only search engines here, Google, Bing and SearchCo, which will be a strawman for any search engine not as good as Google or Bing):
<Bing, “fish sticks”, http://fishsticks.com @1, clicked>
<Bing “fish sticks”, http://fishlips.com @2, unclicked>
<Google, “fish sticks”, http://fishsticks.com @1, clicked>
<Google, “fish sticks”, http://fishplanks.com @2, clicked>
<SearchCo, “fish sticks”, http://fishlips,com @1, unclicked>
<SearchCo, “fish sticks”, http://firesticks.com @2, unclicked>
and thousands and millions more of these (I have no idea of the number, actually, but it’s “large”).
A first question is to ask how these events are captured; given the variety of ways search requests are encoded, something, some process will have to do the decoding. Matt Cutts seems to think this is a priori bad reverse engineering (I guess), but it is done billions of times a day by all the search engine companies; the term of art is scraping, but it’s just using a practice of converting data from one form (a form outside the original intent of the provider) into a usable clickstream form. We all do this, all the time.
A second thing to note, and I think this is very important, is that the “goodness” of Google with respect to the clickstream is not evidenced because there is necessarily a choice to favor Google a priori. Rather, it is the users’ clicking on the links which provides the real evidence. The fake clickstreams above indicate users preferred Google’s particular search results (as evidenced by both results being clicked) to Bing’s (only one URL clicked) or to SearchCo (no URL clicked). The point is this: it is not that Bing is targeting Google, but that the evidence given by the users signal the good results provided by Google. As a bare fact, it is not the triple <Search Engine, Query, URL> triple that matters, but the triple <Query, URL, User Evaluation> that matters.
In the natural course of things, Bing’s search engine results gets better because users say so; when people like what Google (or Bing, or even SearchCo) gives them, that eventually shows up as selection and ranking preferences at Bing. In the end, when Bing’s search results look similar to Google’s search results, it’s because Google does a good job, and Bing’s learning algorithms do a good job; the learning algorithm doesn’t even have to know where the result came from to begin with. Bing won’t look as much like SearchCo just because SearchCo (mythical company, of course) doesn’t do as well. Also, of course, Google has had a huge market share, and so the preponderance of data comes from Google (although I have no real idea of the market share for the opt-in data Bing receives).
In yet other words, it’s all in the clicks. Or, at least, mostly in the clicks.
Time for a reminder: I’m a Bing developer, but I don’t work on ranking or selection.
I personally agree with Matt that software companies in general should do a better job of being transparent about what opting in means. I really don’t know what this should look like, though. For example, I use Chrome for a lot for personal web use—it’s so fast—and I know this helps, at some level, competitors to the people who pay my salary. But I have very little knowledge of the specific ways this happens—and perhaps then it doesn’t really matter. Just as I am glad that there is now a choice in search engines, so also I am glad that there are three or more good choices for browsers—which incidentally improve the aligned search engines. It just doesn’t seem that important to say anything more than anonymous data will be used to improve products, with a way to find out more details of the matter.
Finally, I want to point out that, even though the pages that Matt presents do have the same unique query/URL pairs for Bing and Google (and how that could happen is described above, my previous post, and in others’ posts on the web), the content on the pages is not the same. The titles differ in at least one case (alas for Bing, I think Google’s is better), and the captions are different (in general, I think Bing’s are better—but then I’m biased, because that is a large part of the work that goes on around me). Bing suggests alternative segmentation/spelling for some of the synthetic queries, and “related searches.” So, whatever the merit of Matt’s case as to what counts as “copying,” it’s important to note that much of the content differs.