O(n) beats O(lg n)

Here’s a problem I faced at work: determine the effective top level domain of a URI, using the data from the Public Suffix List, but not using any data structures read into memory (ie, a dynamically created hash table or even a static array). For example, the effective top level domain of “http://blog.entish.org” is “entish.org”, because it is a name one could register.

So, there are about 4000 “positive” rules and about 100 “negative” rules. One way to approach this is to convert a host name into a series of keys (blog.entish.org, entish.org, org) and find the longest positive and negative matching rules. It’s simple enough to write a fast hash function (e.g., FNVa), and write code to write a function that does one of the following two things:

(1) switch (fnva(key)) {case h1: case h2: …. case hn: return true; default: return false}

(2) var h = fnva(key); if (n < hmid) if …}

where (1) is a giant switch statement, and (2) is a function which does a binary search of comparisons of values. (1) is O(n), of course (because you might have to look through every value. (2) is O(lg n) as any good binary search is.

I was surprised to see that, on my test set of about 9000 hosts, the switch version was 4 times faster than the binary search version.

It’s fun to write code that writes code. It’s good to benchmark.

8 thoughts on “O(n) beats O(lg n)”

Nice! I’ve always(*) heard that there were cases like this, but it’s neat that you’ve encountered one in real life. Good to remember.

(*) For some values of always which include “after having read the C/L/R/S algorithms book, and it might’ve been noted by Jon Bentley or Kernighan & Pike too”.

Actually, we’re seeing more and more things like this. As Dave Fayram wrote, “@willf Who was it who said something like, “Never underestimate the power of L1 cache?” https://twitter.com/#!/KirinDave/status/89046920391180288

Hmm. Isn’t this just an example of why complexity classes / O-notations are too abstract in some cases? I think every CS student brings up that issue in the first class that discusses them:

lg n + G, where G is Graham’s number, is still O(lg), and epsilon * n is still (O(n)), even for a freakishly small epsilon (which admittedly isn’t how my math teacher would have phrased it). In practical problem spaces, a algorithm with a runtime like the latter will outperform the former.

This additional cost becomes harder to see the more we move away from theoretical constructs and into practical implementations of these, which usually are hidden behind layers of libraries, frameworks, operating systems and specialized hardware.

Will Fitzgerald says:

July 8, 2011 at 9:57 am

I don’t think so, although your point is well taken in general. In this case, n is the same, but the switch statement, with n+1 branches, is being compiled into something the hardware is much faster at executing than the the lg n comparisons.

Reply

Neat! What language? What’s the cost of < vs ==? Is the latter referential equality?

The language was C#. Data was all int (signed 32 bit integers), so the cost of < and == were pretty darn cheap.

It seems not unlikely that the switch is so fast because the compiler converted it to a binary search or a jump table; at least that is what C compilers like gcc do.

It’s as Falk said. This switch is actually converted in a hash table at compile time. This same strategy is used in Java too for the Strings in switch. Thus really, “O(1) beats O(lg n)”…?

Will.Whim

A weblog by Will Fitzgerald

O(n) beats O(lg n)

8 thoughts on “O(n) beats O(lg n)”

Leave a Reply Cancel reply