When the Google Books Ngram project was made public, I got quite excited about having such a significant data set of human knowledge that, even if in a simplistic form, represents centuries in written form. It comprises hundreds of millions of n-grams extracted from books dating from 1500 to 2008. Of course, the availability of the data suddenly created an itch for having more insight into the content in there. The 5-Gram Experiment is a temporary project that helps with this.
Technologically, this experiment mashes up the Go language, MongoDB, and Ubuntu running on Google Compute Engine, using data processed from the Google Books Ngram project. The cluster of machines running in the backend handle about 350 million 5-grams while searching for the suffix matching the last 1-4 tokens (words or punctuation) that are entered. The result is the suffix most observed in book pages according to the data set.
Note that if you keep typing, the full sentence may not make sense because only the last 4 tokens (words or punctuation) are taken into account when looking for the best matching 5-gram. Casing matters.
It's also worth mentioning that this is a temporary experiment put in place thanks to a grant of resources during the limited preview of Google Compute Engine. It won't last long.
Gustavo Niemeyer <firstname.lastname@example.org>