Tuesday, February 08, 2011

An Error on the Internet!

So google did this experiment where it looked at the frequency of words in books and websites.  There is well over a trillion words in the data set.  A subset of that data is used to power an online N-gram viewer.  I was looking at the frequency of TCP and Internet, and I saw something funny.  So then I looked at just Internet...

Frequency of the word Internet by year in published books

Notice the bump around 1900?  If we look a little closer...

Frequency of the word Internet by year in published books

And if we look into some of the books that they were searching, they erroneously identified a common abbreviation for international, internat, as Internet.  The result is what looks like someone writing about a concept that would not be invented for another 70 years.

This is another example where processing these large amounts of data may benefit from crowd sourcing.  If the algorithm were told that the Internet didn't exist in 1900 and that it is unlikely to occur in literature then it might have picked internat. instead.

I almost forgot, the obligatory xkcd cartoon ("Duty Calls" #386).