Well, it happened. I didn’t think it would come out so soon, but here it is: Google Patent Search. I expected it as it’s a natural application for Google technology and is well aligned with Google’s overall direction. The search interface looks somewhat lighter and easier than what you see on USPTO site. You can use the search form to enter the corresponding search words, but essentially the search string is a combination of keyword:value pairs.
What is significantly different in Google Patent Search is that now you can search full text of U.S. patents since 1790s. That’s approximately 7 million issued patents. Google did that by using the same technology that powers Google Book Search. Google has taken the entire image database of U.S. patents and extracted the plain text from all relevant sections of the patents. That has made full text search possible for patents that only existed as page images on USPTO.
The service is in beta. It may stay that way permanently though like many other Google services. Obviously there are features that are missing like the ability to print or save patents. And there are those little things that need to be ironed out as it matures. For one, mistakes made in character recognition. Sometimes P is recognized as an F. It could have been funny but in the end it makes finding relevant patents really difficult. Even without looking hard I’ve also seen B in place of E. Another is that words that are divided at the end of one line and continued on the next line of text cannot be found. It appears they’re not recognized as full words at all.
It’s interesting how Google presents the patents - the same way it does with books. You always get to see images of all pages of a patent. Also, you get bibliographic data and text of a patent when USPTO has it available. When you do full text search, the images you see have the search words highlighted. For old patents you don’t get the full text of a patent as USPTO does not have it. It’s obvious that Google does have the full text, does searches on it, and then generates the image with the highlights - yet not giving the full text out. That is because it is built on the platform of Google Book Search where they do not want to give away the text to avoid possible issues with publishers and right owners. I’m not convinced it should be the same way with patents though.
Does anyone know what technology Google uses for character recognition (OCR)?