Category Archives: Science and Tech

Of truth

August 1, 2015Books, Literature, the Arts, Religion, Science and Techwillwhim

Of Truth

Francis Bacon, translated by Will Fitzgerald August 1, 2015

“What is truth?” asked jesting Pilate, and would not stay for an answer. Certainly there have been those who delight in giddiness, and count it bondage to settle on a belief, who affect free will in thinking as well as in acting. And though the schools of philosophers of that kind are gone, there remain certain chattering wits of the same vein—though not with as much blood in them as there was in the ancients.

It is not just that finding out the truth is difficult and laborious, nor it is the way the truth has a way of imposing itself on our thoughts, that bring lies to be in favour, but a natural (though corrupt) love of the lie itself. One of the later schools of the Greeks examined the matter and was confused that people love lies, even when they are not made for pleasure (as with poets) or to a gain an advantage (as in business) but solely for the lie’s sake. But I cannot tell. Truth is naked and open daylight that does not show the masks and acts and triumphs of the world half so stately and elegantly as candlelight does. Truth might eventually be as prized as pearls, which is best viewed in the day. But it will not rise to the price of a diamond or garnet that are best viewed in varied lights. Lies even adds some pleasure to the mix. Does anyone doubt that, if we removed silly opinions, flattering hopes, false values, and willful imaginations, it would leave the lives of any number of people poor shrunken things, full of melancholy and indisposition, unpleasing to themselves? St. Augustine called poetry the “wine of devils” because it fills the imagination, and yet it is but the shadow of a lie. It is not the lie that passes through the mind, but that sinks into and settles in the mind that does the most hurt, as we have spoken of before.

However these things are in people’s depraved judgements and feelings, truth, which judges only itself, teaches that the supreme good of human nature is the search for truth (courting and wooing it), the knowledge of truth (the presence of it), and the belief of truth (the enjoying of it).

God’s first creature, in the work of the seven days of creation, was the light of the sense. The last was the light of reason, and God’s Sabbath work, ever since, is the illumination of the Spirit. First, God breathed light up the face of the matter or chaos, then God breathed light into the face of humanity, and God still breathes and inspires light into the face of God’s chosen. The poet Titus Lucretius (who was the beauty of the otherwise inferior Epicureans) said it so well; to paraphrase: “It is a pleasure to stand on the shore, and see ships tossed upon the sea; a pleasure to stand in the windows of a castle, and see a battle and all of its adventures below. But no pleasure is comparable to standing on the vantage ground of Truth (the highest hill of all, and where the air is always clear and serene) and see the errors, wanderings, mists, and tempests in the valley below.” We take this view pity, and not with swelling or pride. Certainly, it is heaven on earth, to have one’s mind move in love, rest in providence, and turn upon the poles of truth.

Passing from theological and philosophical truth to the truth of civil business, it must be acknowledged that clear and honest dealing is the honor of human nature, and that mixing in falsehood is like alloys in gold or silver coin; they may make the metal easier to work with, but they debase its value. For these winding and crooker courses are the goings of the serpent, which goes basely upon its belly and not upon its feet. There is no vice that so covers a person in shame as to be found false and deceitful. Therefore as Montaigne said so well, when he asked why being called a liar should be such a disgraceful and odious charge, “If you weigh it out, to say that someone is a liar is the same as saying that they are brave towards God and a coward toward others.” For a lie faces God, and shrinks from humanity. Surely the wickedness of falsehood and breach of faith cannot be as well express as this: when the last peal to call the judgements of God upon the generations of humanity, it is foretold, that when Christ comes, “he shall not find faith upon the earth.”

Elasticsearch 1.6 upserts and partial updates

June 18, 2015Science and Tech, Science and Tech, Search technologywillwhim

To partially update an Elasticsearch document in 1.6, you’ll need to create a script for the partial update, and potentially use an upsert.

For example, you want to add tags to a blog post

here’s an example of a document created directly:

POST /website/blog/1/_update { “doc” : { “tags” : [ “testing” ], “views”: 0 } }

but if we want to add a tag without changing the view, you need to create a script in config/scripts

for example, config/scripts/add_tag.groovy

ctx._source.tags+=tag

Then, to upsert the document (so that a tag is appended if the document already exists, but a document of the same shape shown in the first example above is created:

POST /website/blog/3/_update { “script_file” : “add_tag”, “params” : { “tag” : “search” }, “upsert”: { “tags”: [“search”], “views”: 0 } }

Similarly, to add to the number of views,

config/scripts/add_view.groovy

ctx._source.views+=n

POST /website/blog/3/_update { “script_file” : “add_view”, “params” : { “n” : “1″ }, “upsert”: { “views”: 1 } }

How to to named entity recognition: machine learning oversimplified

May 7, 2015Language, Machine Learning, Oversimplified, Science and Techwillwhim

What is Named Entity Recognition?

Named Entity Recognition (or NER) is the task of finding all the names of things in a text. For example, in the sentence “Jimi Hendrix played ‘The Star-Spangled Banner’ at Woodstock,” NER, correctly done, would find the references to “Jimi Hendrix,” “The Star-Spangled Banner,” and “Woodstock.” Note that NER should also extract names it has never previously seen; for example, in the sentence “Johannes von Smetly is the protagonist in Willard Smith’s new novel”, an NER system should extract “Johannes von Smetly” as a name, even if it it doesn’t appear in any standard name list.

NER systems typically also provide a classification for the names it extracts; for example, that “Jimi Hendrix” is the name of a person, “Woodstock” is a location, etc. The most standard classification I have used has been PERSON/LOCATION/ORGANIZATION/MISC, but of course it depends on the needs of the application.

General approaches

NER is typically approached as a sequence tagging task, not unlike part of speech tagging. The basic idea is to tag non-name tokens as being “outside” the name. Tokens within a name are tagged in some way so that one can tell where a name begins or ends. For example, the BIO scheme tags tokens as beginning, inside, or outside a name. For example:

Jimi/B Hendrix/I played/O at/O Woodstock/B ./O

The BILOU scheme tags tokens as being beginning, inside, last, outside, or unit. “Unit” means a name that is a single token long. For example:

Jimi/B Hendrix/L played/O at/O Woodstock/U ./O

With such a scheme, it is easy to then extract the names by filtering based on the tags. Finally, it should be noted that the name classification is often used as part of the token classification; for example:

Jimi/B-PER Hendrix/I-PER played/O at/O Woodstock/B-LOC ./O

So, with BILOU tagging and four name types, there are 17 possible classifications for each token (4 name times times 4 name tags, plus “outside”). Somewhat surprisingly, in my experience, BILOU tagging outperforms BIO tagging, and putting name types on the tokens works better than a separate classification after tagging is done.

Metrics

Accuracy at the token level tends to not be especially interesting. The baseline of assigning every token an “outside” tag can have pretty high accuracy rates, depending on the text. Typically, precision and recall of recognized NEs tend to be more useful, as well as metrics based on these (F1-score, ROC-curves, etc.)

Methods and feature engineering

As with part of speech tagging, there are a variety methods for sequence tagging in NER. The simplest use hidden Markov models or maximum entropy Markov models (MEMMs), which are very efficient to run at prediction time. The Stanford NER system uses conditional random fields, which provide some improvement over MEMMs.

As with part of speech tagging, however, the real gains are due to feature engineering. “The features are more important than the model.” The Stanford NER system, for example, uses bag of word features, word ngram features, and word context features; general orthographic features; prefixes and suffixes; and feature conjunctions. Additionally, distributional features (that is, word clusters, or word vectors) are important as well. (Again, we hope to address word vectors in another essay). In my work, using of gazetteers (that is, specialized word lists) is also important: if you know that “Obama” is a common personal (last) name token in a particular corpus, then that can be helpful.

Feature engineering suggests that the training system should run as quickly as possible, in order to explore different spaces of parameters, which is another reason to prefer simpler modelers over others. It’s quicker to train a max-ent model than a SVM, for example.

How to do part of speech tagging: machine learning, oversimplified

May 5, 2015Machine Learning, Oversimplified, Science and Techwillwhim

What is part of speech tagging?

Part of speech tagging (POS tagging) is a classification problem: assigning each word or token in a text to a “part of speech,” that is, a lexical/grammatical class for each token. Different theories, or (perhaps more importantly) different uses will suggest different sets of grammatical labels, but they typically include nouns, verbs, adjectives, etc. For example, the sentence “John thinks it’s all good.” might be tagged as:

NNP/John  VBZ/thinks  PRP/it VBZ/'s  DT/all  JJ/good ./.

where the tags are prepended to each word or token. In this set, the most typical set used, based on some of the earliest work in POS tagging, NNP means “singular proper noun,” VBZ means “third person singular present verb”, PRP means “preposition”, DT means determiner, JJ means “adjective,” and “.” means “sentence-final punctuation.” As might be inferred from this small sample, this particular set is very English-specific: distinguishing third-person present from other forms of the present, for example, which is very English-oriented; the tagset for standard Spanish would need to be different.

POS tagging depends on a tokenization scheme; that is, some way to break text into words and punctuation and the like. This is much trickier than it first appears. In standard English, one can go a long way with simple regular-expression based tokenizers. But in any kind of informal language, it can get more difficult. Twitter text in particular has required new approaches to tokenization.

It should also be noted that POS tagging also requires software that determines the boundaries of sentences. A typical pipeline looks something like:

 cat text | sentence_breaker | tokenizer | pos_tagger | ...

POS tagging is not usually used on its own, but as part of a larger pipeline, as hinted at above. For example, if one wanted to do something as simple as highlight all the adverbs in a text (because they were taught that too many adverbs were “bad”, a POS analysis is necessary. But more typically POS tags become features for downstream analysis, for example extracting names found in text, or “named entity recognition”, which we hope to discuss in the next essay in this series.

Machine learning approaches to tagging

POS tagging is a multi-label classification problem. There is one very important difference from the kind of multi-label classification problem we described in the previous essay on sentiment analysis.

A very important feature or set of features is the tag of the previous or following tokens. For example, if the previous tag is a determiner, such as “the”, the next tag is much more likely to be an adjective or noun than it is to be a verb. But, of course, we don’t know the tags of the previous or next token until the analysis is complete. So, what is required are techniques that minimizes the unlikelihood of a sequence of tags, given the evidence. Hidden Markov Models were used in the first successful POS taggers for standard English text, and they remain a reasonable technique. Really good results, as provided by the Stanford POS Tagger, use a “cyclic dependency network,” in which previous and subsequent tags are considered at the same time.

On Twitter text, the best results have come from additional features, and not so much from the minimization technique, including the use of word clusters–another topic for a future essay.

Quality of state-of-the-art POS taggers

Standard English text can be tagged at more that 97% accuracy. Even a fast, averaged perceptron-base model gets nearly 97% accuracy. This is very good, of course, but it suggest that in a paragraph of 100 words, there will be about three words incorrectly tagged. These are often “out-of-vocabulary” tags; that is, tags for words or tokens the system had not been trained on. It should be noted that just picking the most frequent tag for known tokens, and most frequent tag otherwise gets about 89% accuracy.

For Twitter text, accuracy is about 93%, which is roughly the inter-rater agreement for humans tagging Twitter text.

List of Penn Treebank Part of Speech tags

Here is at least one version of the Penn Treebank POS tagset. Punctuation tags vary a lot. Since I often try to find a “complete” list, perhaps I’ll find it again more easily if I put it in my own blog…

CC Coordinating Conjunction (and, or, both)
CD Cardinal Number (371, 1, one, two)
DT Determiner (all an another any both each either every many much neither no some such that the them these this those)
Ex Existential There
FW Foreign Word (ich jeux habeas jour salutaris oui corporis)
IN Preposition/subordinating Conjunction (among upon in into below atop until over under towards to whether despite if)
JJ Adjective (third ill-mannered regrettable calamitous clean nice)
JJR Adjective, Comparative (cleaner nicer)
JJS Adjective, Superlative (cleanest nicest)
LS List Item Marker
MD Modal (can could may might must need ought shall cannot can’t shouldn’t)
NN Noun, singular or mass (machine computer air wind)
NNS Noun plural (machines computers)
NNP Proper Noun, Singular (Philadelphia Delaware Eagles)
NNPS Proper Noun, plural (Americas)
PDT Predeterminer (all both half)
POS Possessive ending (‘s)
PRP Personal pronoun (him himself we)
PP$ Possessive pronoun (her our ours)
RB Adverb (quickly swiftly always)
RBR Adverb, Comparative (further greater more)
RBS Adverb, Superlative (further best hardest most)
RP Particle (across up)
SYM Symbol, mathematical or scientific (= + &)
TO to
UH Interjection (goodbye, shucks, heck, oops)
VB Verb, base form (hit assign run)
VBD Verb, past tense (hit assigned ran)
VBG Verb, gerund/present participle (hitting)
VBN Verb, past participle (assigned)
VBP Verb, non-3rd person singular, present (displease)
VBZ Verb, 3rd person singular, present (displeases)
WDT wh-determiner (that which whichever what)
WP wh-pronoun (that which what whom)
WP$ Possessive wh-pronoun (whose)
WRB Wh-adverb (how however wherein why)
# Pound sign
$ Dollar sign
. Sentence-final punctuation
, Comma
: Colon or semi-colon
( Opening parenthesis
) Closing parenthesis
“ Opening quotation mark
” Closing quotation mark
‘ Apostophe

A Fibonacci A Day: Using Streams and a Fast Memoization Method

April 13, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

Well, I didn’t think I’d write another Fibonacci A Day post — but there’s a great fibonacci function that’s defined right in the Scala documents, in their discussion of Streams.

A stream is a standard Scala data structure which is very list like. Recall that a Scala list has two parts, a head and a tail; the head is a value, and the tail is either the special empty list (Nil), or another list. For example, the list created by List(“cat”, “mouse”, “cheese”) is a list whose head is “cat”, and whose tail is a list whose head is “mouse” and whose tail is a list whose head is “cheese” and whose tail is Nil (the cheese does not quite stand alone).

Elements in a stream (the values in their heads), on the other hand, are only evaluated when they are needed. For example, getting the nth value is a stream causes the steam to evaluate its first element, its second element, etc, up to n. Once evaluated, they are remembered (that is, memoized) so they the next time the element is requested, the value has already been computed (the lookup is still O(n), but no computation is done).

Here’s a definition of a fibonacci stream, pretty much as given in the documents

lazy val _fibs: Stream[BigInt] = 
  0 #:: 1 #:: (_fibs zip _fibs.tail map { case (x, y) ⇒ x + y })

This defines an “infinite” list of fibonacci numbers — infinite in the sense that the list can be as long as we have memory for. Note that Scala allows us to have recursive data structures as well as recursive functions. Here the tail of the _fibs stream (after the 2nd element, anyway) is defined in turns of itself — we zip the stream with the tail of the stream, and then each element is the sum of the two previous elements — just as we had defined.

We can access this list in a similar way to our previous functions to get positive and negative fibonacci numbers:

def stream_fibonacci(n: Int) = n match {
    case _ if (n >= 0)     ⇒ _fibs(n) // natural numbers 
    case _ if (n % 2 == 0) ⇒ -_fibs(-n) // even, negative numbers*/
    case _                 ⇒ _fibs(-n) // odd, negative numbers 
  }

How does compare in speed with our previous functions? Calling stream_fibonacci(1000) the first time results in zipping though the stream and calculating the sums. Somewhat surprisingly, this is not more expensive than our fastest version. Afterwards, though, we just “look up” the 1001st element, which requires traversing the stream’s first 1001 elements. But this is very fast, and is about 100x times faster on my machine (see the updated benchmarks).

Of course, this comes at the expense of memory storage for the 1001 fibonacci numbers in the stream. Depending on the application, this could be just fine. Of course, fibonacci numbers get big fast, so maintaining a large stream of them might not be good (the other versions only require two or so BigInts to be held in memory during the computation).

A Fibonacci A Day: Fibonacci(n) in O(lg(n)) steps

April 2, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

I am taking Martin Odensky’s Functional Programming in Scala class, and discovered (or rediscovered) that the Fibonacci function is discussed in SICP (Structure and Interpretation of Computer Programs, the book that Odensky seems to be basing much of his Functional Programming in Scala course on).

Even more interesting is that a Fibonacci function which runs in O(lg(n)) steps is presented — that is, it takes about lg(n) steps to compute the nth Fibonacci number, instead of n steps. It’s given as an exercise (Exercise 1.19). Bill the Lizard has been working through the SICP exercises, and his explanation of Exercise 1.19 is the clearest I’ve found. The code below might not make much sense without looking at the SICP chapter or his explanation, but here is the code in Scala, and put in the form that I’ve been using — this computes fibonacci(n) for any integer (including negative integers). Running the benchmark indicates that this version a lot faster than the straight tail recursive version.

  def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(a: BigInt, b: BigInt, p: BigInt, q: BigInt, count: Int): BigInt =
        if (count == 0) b
        else if (count % 2 == 0) {
          val qq = q * q
          recurse(a, b, (p * p + qq), (2 * p * q) + qq, count / 2)
        } else {
          val aq = a * q
          recurse(b * q + aq + a * p, b * p + aq, p, q, count - 1)
        }
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 0, 1, m)
      }
    }
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers
    }
  }

A Fibonacci A Day

March 21, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

The goal of this series was to come up with a general, efficient fibonacci function that goes beyond the typical computer science introduction. The version we developed is fast, efficient with respect to memory, and general. I’d be glad to hear comments.

The Series:

A Fibonacci A Day — Extending to Negative Numbers

March 21, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

We are now confident that the fast recursive version is the best of the algorithms we have explored — it is fast, and can handle large values of the Fibonacci function.

I haven’t been happy, though, with returning a negative -1 when the function is called with a negative number — this feels like a kind of silent error we were trying to avoid in going to BigInts. Fortunately, it turns out that there is a natural extension of the Fibonacci sequence to negative numbers. If the nth Fibonacci number is defined as F(n) = F(n-2) + F(n-1), then we can rearrange this to F(-n) = (-1)ⁿF(n).

So, this is the final version of the Fibonacci function, that supports any integer, up to the the memory size of the computer, computed quite quickly. Much better than our simple recursive or iterative version, which were ripe for error and inefficiency.

 def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(i: Int, u: BigInt, v: BigInt): BigInt =
        if (i == limit) v
        else recurse(i + 1, v, u + v)
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 1)
      }
    }
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers
    }
  }

I hope you’ve enjoyed these series.

A Fibonacci A Day — Benchmarking

March 20, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

The following table describes the limits of the various Fibonacci algorithms I’ve created during this series. I ran them through a very simple benchmarking routine, and empirically tested them for the maximum size they can handle. Each benchmark was run 1000 times, except the first algorithm, which was run just once — we already know it runs too slow. These were run on a MacBook Air (Late 2010).

Log Recursive (BigInt)0.0082Unlimited (except for memory)

Algorithm	Milliseconds per call for F(46)	Max size
Simple Recursive	21,360.00	40 or so
Classic Iterative (Int)	0.0040	46
Classic Iterative (Long)	0.0040	164
Classic Iterative (Double)	0.0030	76
Classic Iterative (BigInt)	0.0080	unlimited (except memory)
Tail Recursive (BigInt)	0.0060	unlimited (except memory)
Using Phi (BigInt)	0.0030	70 or so
Memoization (BigInt)	0.0020	unpredictable
Memoization Monad (BigInt)	0.0940	unpredictable

Here is a table comparing the versions which allow really big fibonacci numbers, calculating F(20000) 100 times each.

Algorithm	Milliseconds per call for F(20000)	Max size
Classic Iterative (BigInt)	24.23	unlimited (except memory)
Tail Recursive (BigInt)	23.50	unlimited (except memory)
Log Recursive (BigInt)	1.00	unlimited (except memory)
Stream (BigInt), first time	3.19	unlimited (except memory)
Stream (BigInt), subsequent calls	0.22	unlimited (except memory)

F(20000) is a really big integer, and a lot of this time is creating and destoying the memory associated with creating the integers. The thing to notice is that the iterative and the tail recursive versions take approximately the same amount of time — the recursive version actually runs a little more quickly than the iterative version. And the Log recursive version of the same is two orders of magnitude faster — — the stream version is yet another order faster, but it may use too much memory. Note that even the first use is faster than the iterative and tail recursive versions!

Still, there’s really no reason not to use the BigInt, log recursive version of this function — it’s both fast and accurate.

(April 2, 2013: Updated to include log recursive version.

(April 13, 2013: Updated to include < a href="stream”>http://willwhim.wpengine.com/2013/04/13/a-fibonacci-a-day-using-streams-and-a-fast-memoization-method/”>stream version; rerun benchmarks for F(20000); and update conclusions.

A Fibonacci A Day — Memoization

March 20, 2013A Fibonacci A Day, Scala, Science and Techwillwhim

Although the iterative recursive version is stateless, it looks different from the standard mathematical definition. The recursive version is a very close mapping, but it’s extremely inefficient. One way to overcome the inefficiencies is to dynamically record (or memoize) previous calculations of F(n). The idea is simple — if we already know the values of F(n – 1) and F(n – 2), we can simply add them without having to do the difficult steps in recreating them.

Here is a version that does just that —

  val memo = scala.collection.mutable.HashMap[Int, BigInt]()
  def fibonacci_8(n: Int): BigInt = {
    def fib(i: Int): BigInt =
      memo.getOrElseUpdate(i, fibonacci_8(i))
    n match {
      case 0 ⇒ BigInt(0)
      case 1 ⇒ BigInt(1)
      case m ⇒ fib(m - 1) + fib(m - 2)
    }
  }

Note that a memoization table is defined outside the scope of the function, and an internal function, fib looks for a memoized value in the table. If it finds it, it simply returns it; if it doesn’t, it makes a recursive call to the outer fibonacci function. Consider F(2). It looks up F(1), which is not yet in the table, so it calls the outer fibonacci function with 1. That returns BigInt(1), which is stored in the table and returned. It does a similar thing with F(0). It now knows enough to calculate the results, which it returns.

The main body of this function looks very similar to the mathematical definition. But there are two disadvantages to this version. One is that we still have mutable state (the memoization table). A clever, if somewhat convoluted way, to overcome this is to use a State monad —Tony Morris describes how to do this in his article, Memoization WIth State Using Scala. Another problem is this is somewhat less predictable — calling F(2000) out of the gate is likely to fail (with too many stack calls) unless most of the intermediate values are already calculated. That is, fibonacci_8(2000) will probably fail, but for (i ← 1 to 2000) fibonacci_8(n) might not. I have more problems with the second issue than the second —

Side note: English already had a word for remembering things by rote, memorization, and a technical term, cacheing as well. Why did memoization seem necessary?

Will.Whim

A weblog by Will Fitzgerald

Category Archives: Science and Tech

Of truth

Elasticsearch 1.6 upserts and partial updates

How to to named entity recognition: machine learning oversimplified

What is Named Entity Recognition?

General approaches

Metrics

Methods and feature engineering

How to do part of speech tagging: machine learning, oversimplified

What is part of speech tagging?

Machine learning approaches to tagging

Quality of state-of-the-art POS taggers

List of Penn Treebank Part of Speech tags

A Fibonacci A Day: Using Streams and a Fast Memoization Method

A Fibonacci A Day: Fibonacci(n) in O(lg(n)) steps

A Fibonacci A Day

A Fibonacci A Day — Extending to Negative Numbers

A Fibonacci A Day — Benchmarking

A Fibonacci A Day — Memoization