Using lazy vals in a state object

I was writing some Scala code which had a long series of somewhat complicated and expensive calculations to make, and in many cases I wanted to log the intermediate calculations. The first version of this code was quite complicated looking, because the calculations of the intermediate values and the logging statements were all intermixed.

But then I realized that I could create a separate object which maintained the state, and which performed the calculations, too. But sometimes the final calculation was not even required (as well, as a result, some of the calculations of the intermediate results). And I realized this was a good use of lazy values; values that don’t get calculated until they are required, and then their values are cached.

Here’s an example; the Jaccard Similarity between two sets is the ratio of the size of their intersection to the size of their union (and defined as 0, if they are both empty sets). The following code shows how you might use a state object to define as-needed values for this:

object LazyExample extends Logging {

  class SetComparison[T](left: Set[T], right: Set[T]) {
    lazy val intersection = left intersect right
    lazy val intersectionSize = intersection.size
    lazy val union = left union right
    lazy val unionSize = union.size
    lazy val jaccardSimilarity =
      if (unionSize == 0) 0.0
      else intersectionSize.toDouble / unionSize.toDouble

  def similarity[T](left: Set[T], right: Set[T]) = {
    val comparison = new SetComparison(left, right)"intersection: ${comparison.intersectionSize} " +
      s"union: ${comparison.unionSize} " +
      s"similarity: ${comparison.jaccardSimilarity}")

Note that (ignoring the logging statement), the intersection isn’t even calculated if the union is empty.

“You can’t franchise the kingdom of God”: A review of Slow Church: Cultivating Community in the Patient Way of Jesus

“You can’t franchise the kingdom of God”: A review of Slow Church: Cultivating Community in the Patient Way of Jesus, by C. Christopher Smith and John Pattison, InterVarsity Press, 2014 Will Fitzgerald May 26, 2014

It’s been a remarkable weekend of eating for me this weekend. Searching for a place to grab a quick bite with other out-of-town friends while we were all in Chicago, we fired up an app on our iPhones to seek a nearby place. On a whim, we called a restaurant named Podhalanka to see if they were open. “Yes! Come! I will take care of you!” said the man who answered. Podhalanka, named after a region in Poland, turned out to be the perfect hole in the wall. No loud music, just the soccer match on the television, and we were the only customers there. The very inviting waiter, Greg, took care of us indeed. Out came a strawberry compote, bread and five bowls of three different soups, including a dill pickle soup (“zupa ogorkowa”), potato pancakes, amazing cheese blintzes, sausage, stuffed cabbage (“gołąbki”, which we call gawumkies at home), pierogi (both cheese and meat-filled), and a potato dish. All of it perfectly well done and gregariously served. We weren’t looking for a feast; we didn’t need a feast; but a feast was thrust upon us, and we gratefully accepted. Rooted in one of the Polish neighborhoods of Chicago, Podhalanka has been taking care of people for over thirty years.

This morning, my wife, Bess, and I joined her sister and husband for a traditional Memorial Day breakfast at Milham Park. That is, it’s traditional for Bess’s family to gather there, eat blueberry coffeecake and quiche. Anne and Sandy made two amazing quiches, one with morel mushrooms (the state sport of Michigan, we joked, was not telling people were we picked our morels, a rare treat found for only a short time in the Spring), and the other with goat cheese and dried tomatoes. They’re good cooks. We played a game of Pooh sticks, as we always do (I won, but in all modesty, I have to say that it’s all in how you twist your wrist).

It has been a delight to enjoy this “slow food,” food cooked and served without hurry, with a sense of terroir, a taste of that particular Polish neighborhood and this particular family tradition. In Chris Smith and John Pattison’s new book, Slow Church: Cultivating Community in the Patient Way of Jesus, they apply the image of slow food to the church, and argue for a a slow approach to church, in contrast to franchise and modernist church growth approaches, which they call, after George Ritzer, “McDonaldization.” “Slow Church,” they say, “happens when people live, work, worship, go to school, eat, grow, learn, heal and play in proximity to each other, often outside the walls of the sanctuary.”

With a forward by Jonathan Wilson-Hartgrove, whose book, The Wisdom of Stability is a strong influence, Smith and Pattison make their way through three “courses” of ethics, ecology, and economy, and chapters devoted to terroir, stability, patience, wholeness, work, Sabbath, abundance, gratitude, and hospitality. Along the way, they bring in many theologians, cultural critics, and culture creators into the conversation. Wendell Berry makes more than several appearances, but so does Tina Fey on improvisation, Walter Brueggemann on abundance and scarcity, NT Wright on the history of creation, Christine Pohl on hospitality and gratitude and the practices of community. In fact, if you want to know who is influencing people who are trying to think through what the postmodern (white, American, Protestant) church should look like, this book is an invaluable resource.

Each of the chapters has a number of questions at their end, also making it an excellent resource for a discussion group, especially for a group anchored in a particular congregation looking for ways to strengthen or reinvigorate their life together. It brings out many thought-provoking ideas, real-life stories, and and practical suggestions, and I strong recommend it.

Still, the book, or perhaps the very idea of a “Slow Church Movement” is not above criticism. In adapting the notion of the Slow Food Movement, it has a certain tang of faddishness. I fear that grouping certain ideas and practices under “slow church” will mean that some of those ideas will be discounted when the fad passes. It also has a certain flavor of elitism to it; just as slow food movement has its foodies, I fear that a “slow church movement” might be become more about aesthetic choices than Smith and Pattison intend.

Smith and Pattison also discount the “franchise” model too quickly, I think. There are reasons that McDonalds are more popular than places like Podhalanka, and not all of them are bad. For example, our meal cost, with tip, about $125; if we had gone to McDonald’s and just had a snack, it would have cost under $10. And if we had gone into McDonald’s, we would have known exactly what we were getting — Podhalanka was amazing, but it could have been terrible. Similarly, there are faithful franchise models of church which they barely discuss. For example, the parish system of the Catholic church is briefly mentioned, yet they cannot quite bring themselves to recommend that Protestant (and here, I think, they mean mostly the white, American free churches) join or create parishes. “Protestant churches could learn a lot from Catholic parishes” is as far as they are willing to go. They don’t discuss Anglican/Episcopal parish systems, or even the Mormon stake system. Nor do they discuss how parachurch organizations, which often have elements of a franchise to them, can aid churches in a local area to join to accomplish the aims of the church overall. For example, there is a slight irony that InterVarsity Christian Fellowship, whose press published Slow Church, is not mentioned.

In the end, though, this is a remarkably generous book. It describes attitudes and practices that many churches and church leaders will find helpful for building up a faithful, attentive congregation, rooted in the realities and delights of communities. Taste and see — especially if these ideas are new to you — perhaps a feast awaits you, too.

I am grateful to InterVarsity Press for providing a pre-publication version of Slow Church to review. Some of the first section of this essay was previously published as a review of Podhalanka, which is located at 1549 W Division Street in Chicago, Illinois.


Metrics and Hiring, some comments on diversity

Lukas Biewald, the CEO of CrowdFlower, posted an essay called Metrics and Hiring. In it, he does something I haven’t seen anyone else do, which is to dive into his own data looking for indicators of success of people he has hired. He found that referral quality and school quality were the two most predictive indicators. Lukas does not break down his data by gender or age or other demographic classes. I have to say, I very much appreciate Lukas’s humility in recognizing that he’d made disastrous hiring decisions. 

I responded briefly (of course) on Twitter, as did Tim Converse. Here’s our conversation — three white guys shooting the breeze. I think Tim and I both were trying to bring in a warning about the cost of limiting diversity, which was what Lukas seemed to be taking from his data (that is, he did better when he hired from his old boy networks). Both Tim and I have written essays about the cost of limiting diversity (Tim’s essay, my essay).

The lesson I take away from Lukas’s essay is the importance of really understanding how people get referred into an organization. How do you make the move from hiring just from within your own networks and hiring more broadly while still hiring strong candidates? One answer might be some kind of moneyball approach, based on what candidates have actually delivered in the past. Another answer is active and strong mentoring programs to both find diverse candidates and build culture together (for example, what Etsy did with their hacker school).

I suspect that diversity has to be built into a startup from its earliest days. If it is part of the culture to find and retain excellent people who are not like themselves, then the advantages of a diverse workplace will scale as the company grows. I’m not suggesting that this is easy, but I do believe it is both necessary and feasible.

A comment on “Computational linguistics and literary scholarship”

There is a controversy over “Computational linguistics and literary scholarship.” A paper published at the Association for Computational Linguistics David Bamman, Brendan O’Connor, & Noah A. Smith, “Learning Latent Personas of Film Characters“) has come under criticims by Hannah Alpert-Abrams with Dan Garrette, “Some thoughts on the relationship between computational linguistics and literary scholarship“. The primary criticism has been that the ACL paper fails to cite any literary scholarship past the middle of the last century.

Brendan O’Connor (a friend of mine, I should say) responded by saying the authors of the ACL paper were not trying to do literary theory, but “developing a computational linguistic research method of analyzing characters in stories.” O’Connor states that this is a paper about method, not literary content; and compares it to the early days of quantitative models in social science and economics (with which he is more familiar).

There is a slight irony in this controversy because a common trope on the Language Log blog (where Alpert-Adams and Garrette published their comments) is about physicists expounding on linguistics without consulting any linguists (for example, “Word String frequency distributions” and Joshua Goodman’s wonderful rant, “Language Trees and Zipping.”) This idea of poaching on other people’s grounds – or, better, trying to do other people’s work for them without a proper understanding of just what that work is – is something linguists, computational or otherwise, might have a sensitivity to.

However, I think this overlooks the real criticism of the latent personas work. To quote from Alpert-Abrams and Garrette:

When we look at Wikipedia entries about film, for example, we would not expect to find universal, latent character personas. This is because Wikipedia entries are not simply transcripts of films: they are written by a community that talks about film in a specific way. The authors are typically male, young, white, and educated; their descriptive language is informed by their cultural context. In fact, to generalize from the language of this community is to make the same mistake as Campbell and Jung by treating the worldview of an empowered elite as representative of the world at large.

I am no literary scholar, but it’s common knowledge that much of post-modern literary scholarship concerns itself with power relationships. One might even say that much of what it means to be post-modern is to study those relationships, especially to disillusion elites and recover disempowered voices. Alpbert-Abrams and Garrette’s criticism is, I believe, not so much about poaching or about ignoring recent advances in the field of literary criticism, but ignoring the realities of power in the Wikipedia data.

This affects the work of Bamman, O’Connor and Smith in at least two ways. First, it suggests that the data upon which they base their conclusions is skewed in a way they did not recognize. It is reasonable for O’Connor to say that they were more interested in method, and not criticism. But it is very important for any machine learning programme that one understands the data that models are being run on. To ignore the quality of the data will produce likely produce erroneous results (in fact, this is one of the points that Mark Liberman makes in his friendly criticism of the word string frequency paper mentioned above). Second, and this is related to the first issue, given the skewed nature of Wikipedia contributors towards young educated males, the results based on this data will likely be towards perpetuating models reflecting their views, and further disempowering women, elders, and non-educated people. The Wikipedia project states that the top quartile of Wikipedia participants starts at 30 years old, and half are under 22. It is easy to imagine how the comparative lack of data about films from people in their 30s, 40s, 50s, etc, can skew the results, and the same is easily said for women (12% of contributors) and educational level, co-varies with the young age of the average participant. Actually, the Wikipedia study did not provide data on race/ethnicity (only nationality), although it seems prima facie likely that contributors tend to be white, heterosexual, and cis-gendered.

As a single example of this, Bamman, O’Connor and Smith list the topics in a 50-topic model in their table 3. Two of the fifty topics are WOMAN (woman, friend wife, sister, husband) and WITCH (witch, villager, kid, boy, mom) – the label name is, as is typical for this research, the word with the highest PMI (pointwise mutual information) for the topic. There is no topic for MAN (because it is totally unmarked?) or ethnicity except for monsters – FAIRY, WEREWOLF, ALIEN. MAYA is a topic, but the other high PMI words in this topic are (monster, monk, goon, dragon). Granted that movies skew towards the interests of young males, these topics are not surprising, but it is likely that the makeup of contributors even further skews these results.

I am grateful for Alpert-Abrams and Garrette’s note. Wikipedia is increasingly being seen as a source of ground truth for any number of machine-learned models and systems. Its breadth and depth make it natural to use Wikipedia titles as a kind of list of things of interest (see, for example, the Wikifier project, which does entity extraction based on Wikipedia titles and content). It is a good reminder that the of the contributors to Wikipedia will bias results based on Wikipedia in certain systematic (and political) ways; understanding these is, as Alpert-Abrams and Garrette note, an important project in its own right.

A Fibonacci A Day: Using Streams and a Fast Memoization Method

Well, I didn’t think I’d write another Fibonacci A Day post — but there’s a great fibonacci function that’s defined right in the Scala documents, in their discussion of Streams.

stream is a standard Scala data structure which is very list like. Recall that a Scala list has two parts, a head and a tail; the head is a value, and the tail is either the special empty list (Nil), or another list. For example, the list created by List(“cat”, “mouse”, “cheese”) is a list whose head is “cat”, and whose tail is a list whose head is “mouse” and whose tail is a list whose head is “cheese” and whose tail is Nil (the cheese does not quite stand alone).

Elements in a stream (the values in their heads), on the other hand, are only evaluated when they are needed. For example, getting the nth value is a stream causes the steam to evaluate its first element, its second element, etc, up to n. Once evaluated, they are remembered (that is, memoized) so they the next time the element is requested, the value has already been computed (the lookup is still O(n), but no computation is done).

Here’s a definition of a fibonacci stream, pretty much as given in the documents

lazy val _fibs: Stream[BigInt] = 
  0 #:: 1 #:: (_fibs zip _fibs.tail map { case (x, y) ⇒ x + y })

This defines an “infinite” list of fibonacci numbers — infinite in the sense that the list can be as long as we have memory for. Note that Scala allows us to have recursive data structures as well as recursive functions. Here the tail of the _fibs stream (after the 2nd element, anyway) is defined in turns of itself — we zip the stream with the tail of the stream, and then each element is the sum of the two previous elements — just as we had defined.

We can access this list in a similar way to our previous functions to get positive and negative fibonacci numbers:

def stream_fibonacci(n: Int) = n match {
    case _ if (n >= 0)     ⇒ _fibs(n) // natural numbers 
    case _ if (n % 2 == 0) ⇒ -_fibs(-n) // even, negative numbers*/
    case _                 ⇒ _fibs(-n) // odd, negative numbers 

How does compare in speed with our previous functions? Calling stream_fibonacci(1000) the first time results in zipping though the stream and calculating the sums. Somewhat surprisingly, this is not more expensive than our fastest version. Afterwards, though, we just “look up” the 1001st element, which requires traversing the stream’s first 1001 elements. But this is very fast, and is about 100x times faster on my machine (see the updated benchmarks).

Of course, this comes at the expense of memory storage for the 1001 fibonacci numbers in the stream. Depending on the application, this could be just fine. Of course, fibonacci numbers get big fast, so maintaining a large stream of them might not be good (the other versions only require two or so BigInts to be held in memory during the computation).

A Fibonacci A Day: Fibonacci(n) in O(lg(n)) steps

I am taking Martin Odensky’s Functional Programming in Scala class, and discovered (or rediscovered) that the Fibonacci function is discussed in SICP (Structure and Interpretation of Computer Programs, the book that Odensky seems to be basing much of his Functional Programming in Scala course on).

Even more interesting is that a Fibonacci function which runs in O(lg(n)) steps is presented — that is, it takes about lg(n) steps to compute the nth Fibonacci number, instead of n steps. It’s given as an exercise (Exercise 1.19). Bill the Lizard has been working through the SICP exercises, and his explanation of Exercise 1.19 is the clearest I’ve found. The code below might not make much sense without looking at the SICP chapter or his explanation, but here is the code in Scala, and put in the form that I’ve been using — this computes fibonacci(n) for any integer (including negative integers). Running the benchmark indicates that this version a lot faster than the straight tail recursive version.

  def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(a: BigInt, b: BigInt, p: BigInt, q: BigInt, count: Int): BigInt =
        if (count == 0) b
        else if (count % 2 == 0) {
          val qq = q * q
          recurse(a, b, (p * p + qq), (2 * p * q) + qq, count / 2)
        } else {
          val aq = a * q
          recurse(b * q + aq + a * p, b * p + aq, p, q, count - 1)
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 0, 1, m)
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers


A Fibonacci A Day

The goal of this series was to come up with a general, efficient fibonacci function that goes beyond the typical computer science introduction. The version we developed is fast, efficient with respect to memory, and general. I’d be glad to hear comments.


The Series:

  1. A Fibonacci A Day — Introduction
  2. A Fibonacci A Day — Recursive Version
  3. A Fibonacci A Day — Special Pi Day Post
  4. A Fibonacci A Day — Iterative Version
  5. A Fibonacci A Day — Out of Bounds
  6. A Fibonacci A Day — Speedy Recursion
  7. A Fibonacci A Day — Golden Ration Version
  8. A Fibonacci A Day — Memoization
  9. A Fibonacci A Day — Benchmarking
  10. A Fibonacci A Day — Extending to Negative Numbers
  11. A Fibonacci A Day — Fibonacci in O(lg(n)) Steps
  12. A Fibonacci A Day — Using Streams and a Fast Memoization Method


A Fibonacci A Day — Extending to Negative Numbers

6040055534_e26371bc09We are now confident that the fast recursive version is the best of the algorithms we have explored — it is fast, and can handle large values of the Fibonacci function.

I haven’t been happy, though, with returning a negative -1 when the function is called with a negative number — this feels like a kind of silent error we were trying to avoid in going to BigInts. Fortunately, it turns out that there is a natural extension of the Fibonacci sequence to negative numbers. If the nth Fibonacci number is defined as F(n) = F(n-2) + F(n-1), then we can rearrange this to F(-n) = (-1)nF(n).

So, this is the final version of the Fibonacci function, that supports any integer, up to the the memory size of the computer, computed quite quickly. Much better than our simple recursive or iterative version, which were ripe for error and inefficiency.

 def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(i: Int, u: BigInt, v: BigInt): BigInt =
        if (i == limit) v
        else recurse(i + 1, v, u + v)
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 1)
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers

I hope you’ve enjoyed these series.


A Fibonacci A Day — Benchmarking

Race The following table describes the limits of the various Fibonacci algorithms I’ve created during this series. I ran them through a very simple benchmarking routine, and empirically tested them for the maximum size they can handle. Each benchmark was run 1000 times, except the first algorithm, which was run just once — we already know it runs too slow. These were run on a MacBook Air (Late 2010).

Log Recursive (BigInt)0.0082Unlimited (except for memory)

Algorithm Milliseconds per call for F(46) Max size
Simple Recursive 21,360.00 40 or so
Classic Iterative (Int) 0.0040 46
Classic Iterative (Long) 0.0040 164
Classic Iterative (Double) 0.0030 76
Classic Iterative (BigInt) 0.0080 unlimited (except memory)
Tail Recursive (BigInt) 0.0060 unlimited (except memory)
Using Phi (BigInt) 0.0030 70 or so
Memoization (BigInt) 0.0020 unpredictable
Memoization Monad (BigInt) 0.0940 unpredictable

Here is a table comparing the versions which allow really big fibonacci numbers, calculating F(20000) 100 times each.

Algorithm Milliseconds per call for F(20000) Max size
Classic Iterative (BigInt) 24.23 unlimited (except memory)
Tail Recursive (BigInt) 23.50 unlimited (except memory)
Log Recursive (BigInt) 1.00 unlimited (except memory)
Stream (BigInt), first time 3.19 unlimited (except memory)
Stream (BigInt), subsequent calls 0.22 unlimited (except memory)

F(20000) is a really big integer, and a lot of this time is creating and destoying the memory associated with creating the integers. The thing to notice is that the iterative and the tail recursive versions take approximately the same amount of time — the recursive version actually runs a little more quickly than the iterative version. And the Log recursive version of the same is two orders of magnitude faster — — the stream version is yet another order faster, but it may use too much memory. Note that even the first use is faster than the iterative and tail recursive versions!

Still, there’s really no reason not to use the BigInt, log recursive version of this function — it’s both fast and accurate.

(April 2, 2013: Updated to include log recursive version.

(April 13, 2013: Updated to include < a href="stream”>”>stream version; rerun benchmarks for F(20000); and update conclusions.