clip_image001

Once a company reaches certain level of processing and storage, it can tackle formerly impossible projects.

One of the latest Google’s efforts to provide access to any kind of information involves time traveling over newspaper issues.

The image in this post is taken from the edition of the St.Petersburg Times from the day I was born. And of course, this is just one of the newspapers they scanned until now.

I didn’t do a lot of testing, but I found issues as old as the year 1908 (and no, I’m not that old).

These are scanned images, but they obviosly did an OCR of the full text, as this is entirely searchable. Just do the math for a single newspaper of an average of 12 pages (a really thin one):
12 pages x 365 issues a year (skipping leap years) x 100 years (just to round up) x 500 words per page (a quite conservative average) = 219 million words. And then, multiply this for the number of newspapers that they can agree to add to the database (a number increasing by the minute).

Now just imagine the amount of experiments you can do over such a massive full-text database. Things like comparing certain type of news, advertisement, etc, across different papers in the same dates, or correlation between papers in different languages, and more. Using these brute-force techniques over their web archive they got one of the best free machine translator services available to date, so go figure what’s next.