The new Google Desktop has privacy watchdogs barking. Enough complaining -- what's your solution? I offer a couple information condoms.
These are pretty simple-minded ideas, yet they each have their merits. They are based on how a search index is an abstract of a document's contents. Somebody smarter than me should be able to hatch something better and put this issue to bed.
A page of words can say a lot, until you randomize the words. For purposes of search, it is enough to know what page or document contains my search query. So create an index that treats word order very loosely. I won't get readable snippets of text in search results, but I wouldn't mind. How about a thumbshot, instead?
This script of mine randomizes the text on web pages, to give you an idea of how effective this obfuscation is. It chunks words using block-level tags:
If that's not enough, then consider hashing each word before entering them into the index using a one-way hash. Be sure to stem them, first. When you go to search this index, stem and hash your query. Salt your hash or get as fancy as you want. This way the server hosting your index really has no idea what you're storing.
FWIW, my $0.02 on how to solve the remote privacy problem. Shoot them down, invent your own, but please let's talk about a solution to this issue. "Don't use it" isn't good enough. I want darknet/p2p search!
Sid Steward is a programmer, writer and entrepreneur. He maintains the PDF Toolkit and wrote PDF Hacks.
oreillynet.com Copyright © 2006 O'Reilly Media, Inc.