Efficient Peer-to-Peer Keyword Searching
Patrick Reynolds
Reynolds’ paper stars out by discussing the increased scalability and reliability of peer-to-peer networks with distributed has tables (DHTs). He points out however that there is no support for keyword searching. Because it is a distributed system a method of partitioning had to be used. Vertical partitioning was reasoned to be the best choice with peer-to-peer networks because it minimized the amount of network bandwidth. With vertical partitioning an index assigns each keyword to a single node, as opposed to dividing the list of documents for a keyword across many nodes. Another issue was the ranking of results. Word position and heuristics are currently the best solution given the P2P network structure. For the issue of updating, push model where new or modified content notifies the servers directly.
The major issue dealing with keyword searches is multiple keyword searches. What is need is the document intersection of the keywords. The basic (and inefficient approach) is for a client, who is doing a two keyword search, (keywords a and b) to send keyword a to Server A (which is the node for keyword a). Server A then sends it’s list of documents for keyword a to Server B, which then calculates the intersection with keyword b and returns the result to the client. Instead of sending the entire set of matching document IDS from server A to Server B the paper proposes three techniques to limit the bandwidth: Bloom filters, caches, and incremental results.
A Bloom filter uses hashing to summarize membership in a set. A Bloom Filter base on list A for keyword a ( F(A) ) would be sent from Server A, instead of the entire list. Server B would then calculate an approximate intersection that would return some false positives along with proper hits, but no false negatives. This list from Server B could then either be sent back to Server A to get rid of the false positives, or the list can be sent to the client, leaving the client with some incorrect results.
Cache essentially the same as the Bloom filter system described above, except the filters (such as F(A) ) could be cached on the other servers. The benefit of this system is obvious, as the filter would not need to be sent over the network all the time and thus bandwidth would be saved. However problems occur with the filter being not being up to date.
Incremental results is based on the concept of searching and returning only a specified, constant number of results. This system also uses Bloom filters, however the set A (the set of documents for given keyword a) is divided into chunks and a Bloom filter is sent for each chunk. The chunk size is set adaptively based on how many results are desired.