[Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing]
Matt Cutts has a thoughtful follow-up article on the controversy over whether Bing is copying Google’s search results. There has been a little too much heat and not enough light in the discussion, and I appreciate reflective thoughts, and his shout out to us Bing engineers. Someday, we’ll have forgotten all about this controversy. I predict about two and a half weeks.
I’ve been thinking some more about this, too, and I’d like to write a bit more about the clickstream data. I won’t describe specific ways that the data is collected, although Bing has made it clear that it comes though the use of opt-in data provided by users. And I won’t describe all the content or format of what the clickstream contains. But let’s abstract what is obviously there: information about the search engine used, the query, the presented URLs (and their order), and the URLs that are clicked by the user. So, you can imagine that there is this stream of click data coming from a browser (I suggest only search engines here, Google, Bing and SearchCo, which will be a strawman for any search engine not as good as Google or Bing):
<Bing, “fish sticks”, http://fishsticks.com @1, clicked> <Bing “fish sticks”, http://fishlips.com @2, unclicked> <Google, “fish sticks”, http://fishsticks.com @1, clicked> <Google, “fish sticks”, http://fishplanks.com @2, clicked> <SearchCo, “fish sticks”, http://fishlips,com @1, unclicked> <SearchCo, “fish sticks”, http://firesticks.com @2, unclicked>
and thousands and millions more of these (I have no idea of the number, actually, but it’s “large”).
A first question is to ask how these events are captured; given the variety of ways search requests are encoded, something, some process will have to do the decoding. Matt Cutts seems to think this is a priori bad reverse engineering (I guess), but it is done billions of times a day by all the search engine companies; the term of art is scraping, but it’s just using a practice of converting data from one form (a form outside the original intent of the provider) into a usable clickstream form. We all do this, all the time.
A second thing to note, and I think this is very important, is that the “goodness” of Google with respect to the clickstream is not evidenced because there is necessarily a choice to favor Google a priori. Rather, it is the users’ clicking on the links which provides the real evidence. The fake clickstreams above indicate users preferred Google’s particular search results (as evidenced by both results being clicked) to Bing’s (only one URL clicked) or to SearchCo (no URL clicked). The point is this: it is not that Bing is targeting Google, but that the evidence given by the users signal the good results provided by Google. As a bare fact, it is not the triple <Search Engine, Query, URL> triple that matters, but the triple <Query, URL, User Evaluation> that matters.
In the natural course of things, Bing’s search engine results gets better because users say so; when people like what Google (or Bing, or even SearchCo) gives them, that eventually shows up as selection and ranking preferences at Bing. In the end, when Bing’s search results look similar to Google’s search results, it’s because Google does a good job, and Bing’s learning algorithms do a good job; the learning algorithm doesn’t even have to know where the result came from to begin with. Bing won’t look as much like SearchCo just because SearchCo (mythical company, of course) doesn’t do as well. Also, of course, Google has had a huge market share, and so the preponderance of data comes from Google (although I have no real idea of the market share for the opt-in data Bing receives).
In yet other words, it’s all in the clicks. Or, at least, mostly in the clicks.
Time for a reminder: I’m a Bing developer, but I don’t work on ranking or selection.
I personally agree with Matt that software companies in general should do a better job of being transparent about what opting in means. I really don’t know what this should look like, though. For example, I use Chrome for a lot for personal web use—it’s so fast—and I know this helps, at some level, competitors to the people who pay my salary. But I have very little knowledge of the specific ways this happens—and perhaps then it doesn’t really matter. Just as I am glad that there is now a choice in search engines, so also I am glad that there are three or more good choices for browsers—which incidentally improve the aligned search engines. It just doesn’t seem that important to say anything more than anonymous data will be used to improve products, with a way to find out more details of the matter.
Finally, I want to point out that, even though the pages that Matt presents do have the same unique query/URL pairs for Bing and Google (and how that could happen is described above, my previous post, and in others’ posts on the web), the content on the pages is not the same. The titles differ in at least one case (alas for Bing, I think Google’s is better), and the captions are different (in general, I think Bing’s are better—but then I’m biased, because that is a large part of the work that goes on around me). Bing suggests alternative segmentation/spelling for some of the synthetic queries, and “related searches.” So, whatever the merit of Matt’s case as to what counts as “copying,” it’s important to note that much of the content differs.
What is the amount and percentage of search clicks you collect come from Google, you say you don’t know it, I think it matters a lot, as clicks are strongly correlated to ranking. And I am sure as a senior software engineer in bing you can find out how many clicks on Google searches you collected.
Anon, I’m sure I could find out, but it wouldn’t be information I’d likely be able to share in this forum.
Then I encourage you to do so, at least that will give you a realistic idea about how much of the “click stream” coming from Google. My bet is at least %70-%80 of it, considering worldwide shares.
so actually it probably looks like this:
Then you could ask someone from ranking team how much this effects their results. and may be write another blog post about it.
I think we can estimate the scale of the amount of clicks from Google.
IE 8 has 34% of the market [1]. Google serves 3 billion searches a day [2].
Ballpark, let’s figure that there are of the order of 1 billion searches per day from IE8 on Google. Let’s assume that 90% of the users are opted-in to suggested sites, since it’s by default checked on.
That’s 900M searches a day. Let’s guess that 66% of the time, a search yields a result click. Now we’re at about 600M query/click pairs a day.
Even if this work is off by an order of magnitude, that’s a lot of juicy data for machine learning algorithms to pick up. I think the 9 junk queries are a distraction in comparison
Will,
You mention that the clickstream data collected by Bing includes the list of “presented URLs (and their order)”, as you put it.
My understanding, just verified on Wikipedia, is that “A Clickstream is the recording of what a computer user clicks on while Web browsing…”. That shouldn’t include other page information, such as the links that were not clicked on.
Your description, Will, indicates that Bing collects clickstreams plus context, which is more than most people were aware of. You say this is obvious, but I don’t think it was obvious to many of Bing’s defenders, myself included.
Curiouser and curiouser. I think I’ll stop complaining about Google’s attacks on Bing, and let Microsoft defend itself.
Bob, but how would a system know which links are clicked on unless the “context” (i.e. the SERP) were collected?
Will,
Well, the links that were clicked are all in the clickstream itself (as that term is usually understood).
As for the context, I guess it depends on what context you’re talking about. If you want to capture the user’s entire experience, you’ll have to scrape the SERP.
If you simply want the context that says “the user chose this link after doing that search”, it’s all in the clickstream, as the combination of two clicks: the search itself, and the selection of a single search result. The information in these two clicks is all that would be needed for Bing to produce the Bing SERPs that Google displayed. In fact, if Bing captures Referrer information, it doesn’t need to capture that first click.
It seems quite legitimate for Bing to capture the actual clickstream, i.e. the GET requests which arise directly from the user’s action. That information is generated by the user and should be theirs to share. But if Bing also captures the other links on the page, now it is taking information that was generated elsewhere and sent to the user; this raises new questions.
Will, your discussions on this subject are thoughtful, and I’d like to present you with my perspective. I’m a search quality engineer at Google, but, like you, I don’t speak for my employer in any capacity, and this is my personal opinion.
First, two personal notes: 1. I had a great internship at Microsoft search (then Live Search) in 2007 in Redmond building 88, and I think they’re good people. 2. I had absolutely nothing to do with the Google “sting” operation.
I work on synonyms. As discussed in this Google Blog post, we’re very proud of the awesome synonyms we produce, like knowing whether “gm” means “General Motors” or “genetically modified” or “George Mason” or “gamemaster” or 20 other meanings: http://googleblog.blogspot.com/2010/01/helping-computers-understand-language.html
From bolding in Bing’s search results it looks like Bing does similar things, and they should be proud of them too.
Now consider this hypothetical scenario: I spend six months developing, testing, and launching a new synonyms algorithm that brings up some great new results for a whole bunch of queries. After all that hard work, my project launches, and Google users start clicking on the new results because they’re good. Yay! Then, three weeks later, my great new results start appearing as the top Bing results for those queries. This happens automatically, even though Bing has no idea that, say, “gm” meant “General Motors” in one case and “genetically modified” in another. So, wait — what did I do all that work for? How are my manager and I going to justify working on similar improvements in the future? Software engineers ain’t cheap.
Sounds far-fetched? That’s basically what happened to Google’s spelling team with their correction from “torsorophy” to “tarsorrhaphy.” And since Google’s announcement last week, nothing from Bing has convinced me that this scenario is implausible. In fact, they seem adamant that they will continue this practice.
So that’s my concern. If competitors can’t differentiate their products on quality, it removes the incentive to improve quality. Isn’t that bad for users in the long run?
Jeremy,
Thanks for your thoughtful comments.
I have certainly been involved in internal projects at Bing when an improvement we have been working on in our group has been wiped out by another group’s work, either directly or indirectly. And it can be discouraging. And so our group’s internal focus changed to other kinds of improvements where there seemed to be better value for effort.
I think that both Bing and Google do best when they focus on improvements of the customer’s experience. Let’s assume that Bing learned the alteration of ‘torsorophy’ to ‘tarsorrhaphy’ from Google based on the clickstream data (and many similar query alterations). In the end, customers in general get a better search experience, and Google maintains its competitive lead by N months, where N is the length of time it takes the Bing systems to learn the alterations.
For what it’s worth, I don’t see this as particularly different from the ways Bing and Google already use searchlog data and webscraped data. Our systems learn from the work (both easy and hard) of many, many people. In fact, I think this is the core issue over which Bing and Google have disagreed.
Still, it seems to me that Bing and Google, now aware of each other’s knowledge, will have to think hard about what to do next; it seems unlikely our two companies will continue to do things in the same way, adamantly or not. Cool, more work & more interesting problems to solve!
So, I just happened to come across this Wikipedia page today: Homeoteleuton, which contains the opening phrase:
Homeoteleuton, also spelled as homoeoteleuton and homoioteleuton, …
This seems like just the sort of alternative spelling information one could pick up from general web scraping (a classic “Hearst pattern”).
We (that is, Bing & Google) wouldn’t consider it cheating to learn this from webscraping–would you agree? If so, then why should we consider it cheating when it happens from a clickstream?
Will, it might be helpful if you provided a specific definition of “clickstream”.
Bob, I think I did, in the body of my post, and the previous note. Abstractly, think of it as (timestamped) tuples.
Thanks, Will.
You did say that the clickstream obviously contains information about “the search engine used, the query, the presented URLs (and their order), and the URLs that are clicked by the user”.
I was just pressing for a more formal definition, because the part about the presented URLs and their order doesn’t fit into most definitions of clickstream. And it looks like you overlooked my earlier question about that.
I guess the point of this blog post was to say that it is not necessary to collect which search engine is used. But your question is a good one (is the SERP collected?) and the answer is: I don’t know. I think I was confusing two issues, and I appreciate your followup question.
So, if I were to rewrite this post, I’d only include positive information in the clickstream; that is (query, url) timestamped pairs. The point being: (search-engine, query, url) timestamped triples would be unnecessary–the information comes from the customers, and only indirectly from Google.
I’m not trying to dodge your question; I just don’t know the answer. I’ve tried to keep my blog posts based on the information made public by Bing. Since I don’t/can’t speak for them, I guess it’s just as well that I don’t know: I’d hate to have to say “I know the answer, but I can’t tell you.”
Thanks, Will. I appreciate you telling us what you know (and can tell us) as well as what you don’t know.
Your point, that it isn’t necessary to know which search engine the user was interacting with when he shared his clickstream with Bing, is quite true. In fact, collections of (query term, clicked URL) pairs were entirely sufficient to create the infamous Google / Bing screen shots.
It’s just my opinion, but I’m certain that the Google folks conducted additional tests on bogus SERPs with more than one result. If the unclicked links had then showed up in the BING SERPs, we know that Google would be telling us all about it.
Also, regarding what you wrote about Chrome: “For example, I use Chrome for a lot for personal web use—it’s so fast—and I know this helps, at some level, competitors to the people who pay my salary. But I have very little knowledge of the specific ways this happens—and perhaps then it doesn’t really matter. Just as I am glad that there is now a choice in search engines, so also I am glad that there are three or more good choices for browsers—which incidentally improve the aligned search engines.”
To put your fears to rest, you might want to read the exchange between a Microsoft and Google employee here: http://glinden.blogspot.com/2011/02/google-bing-and-web-browsing-data.html “Update: In the comments, Peter Kasting, a Google engineer working on Chrome, denies that Chrome sends clickstream data (the URLs you visit) back to Google. A check of the Google Chrome privacy policy ([1] [2]) appears to confirm that. I apologize for getting that wrong and have corrected the post above.”