OpenP2P.com    
 Published on OpenP2P.com (http://www.openp2p.com/)
 See this if you're having trouble printing code examples


The Clip2 Report

Gnutella and the Transient Web

03/22/2001

Related Articles:

Gene Kan & Mike Clary on Sun's Infrasearch Buy

Gnutella: Alive, Well, and Changing Fast

Free Riding on Gnutella With audio

All related Gnutella content


More from the OpenP2P.com

The Web was born in a research environment of always-on, permanently connected computers that could just as easily support Web publishing as they could facilitate Web browsing. During the subsequent popularization of the Web, consumer PCs and other devices that are transiently connected to the Internet became prevalent. These transient devices, which came and went unpredictably and held dynamically assigned Internet addresses, broke the Web's symmetry, because they were only effective at browsing.

Clay Shirky aptly characterized them as the "dark matter of the Internet" - apparent only because of their tug on servers and never luminous as servers themselves. To the lament of Tim Berners-Lee and other Web pioneers, the Web became more of a one-to-many medium than the many-to-many communication system originally envisioned.

The Gnutella protocol restores the Web's original symmetry, enabling even transient computers to effectively participate as servers. It's far from a complete solution, and alternative systems may eclipse it. Nonetheless, this simple and idiosyncratic protocol is currently in the vanguard of the emergence of the transient Web. The transient Web has the potential to be every bit as disruptive as the conventional "permanent" Web, and possibly more so.

In this article, on the occasion of the first anniversary of Gnutella's release, we'll take a fresh look at this peer-to-peer network as the harbinger of something bigger than itself: a Web that includes transient devices as servers. We'll conclude by reviewing some recent notable developments on the transient Web, including the exposure of personal data, the Mandragore P2P virus and the BearShare Defender.

Gnutella's Relation to the Web

What do Gnutella and the Web have to do with each other? Isn't Gnutella just one of many P2P file-sharing systems?

Yes, Gnutella enables P2P file sharing, but take a closer look. With Gnutella, file transfer is accomplished via HTTP, the same protocol Web browsers and servers use to transfer Web pages and other data. Under the hood, each Gnutella application contains a no-frills Web-server component for serving files and a primitive browser-like element for retrieving them.

The relation between Gnutella and the Web is therefore quite simple: Gnutella hosts are Web sites, albeit transient ones, and downloading a file from a Gnutella host is technically equivalent to fetching a file from a Web site. Most Gnutella applications combine server and client functionality into a package known as a "servent." Even users who share no files, and who instead simply use Gnutella to search and download, are running (empty) Web sites while they run servents. This would be equivalent to running a Web server each time you started your browser.

Comment on this articleDo you think Gnutella is the "transient Web?" Or is it too unreliable? Does the Gnutella virus indicate a fatal flaw with the system?
Post here to discuss the issues with Kelly, or start your own thread.

The transient nature of the sites is, of course, terribly important. As a protocol, Gnutella's specification of HTTP for file transfer is rather mundane. After all, users have always been able to run Web server programs on their PCs, but the problem is that these sites and their contents were impossible to discover by other users due to the transience of the PCs. What's novel is that the Gnutella protocol addresses the problems of how to discover and search transient Web sites.

Why these problems deserve addressing is an entirely different story. For purposes of this article, the key point is that Gnutella can be viewed as augmenting HTTP with an additional layer of intersite plumbing that enables transient Web sites to be found and searched. This plumbing supports:

  1. the broadcast transmission of queries across transient sites and the routing back of responses;
  2. the broadcast transmission of "is anybody out there?" pings and the routing back of "I'm here" responses. For more detail on how Gnutella works, see Andy Oram's early article.

The Gnutella network is itself similar to the Web in many ways. Open and decentralized, there is no single responsible company, no central server and no single point of failure. Gnutella is a protocol for which many developers have created compatible code, and a Gnutella network exists only to the extent these programs are running and communicating with one another. There is a general public network, and private networks can co-exist in isolation or attached to the public one. Because of this, Gnutella application developers feel themselves akin to makers of Web browser, server, proxy and other applications. They are builders of interoperable software for a network larger than the sum of its parts. In fact, they are makers of transient-Web applications. The Gnutella network is simply one form of the transient Web.

Contrasting the Transient and Permanent Webs

To better understand the transient Web, let's contrast it with the conventional permanent Web on some key points.

The Web is Web-like because of hyperlinks. Hyperlinks work under the assumptions that content remains accessible at a fixed URL and that a server specified by the URL is available to serve the content. Unfortunately, both assumptions fail on the transient Web. The machine at a given IP address may not be there tomorrow, or in one hour or five minutes or one second. For this reason, with a few exceptions, you cannot browse your way from the permanent Web to the transient Web, nor will you find transient Web sites indexed by conventional search engines.

In fact, the transient Web is presently mostly devoid of the sense of "place" that dominates the vocabulary of the permanent Web. Normally, we "visit" sites specified by a "location" or "address." When we can't enter a fixed address or follow a static hyperlink, this sense of place vanishes. An alternative sense of "medium" fills its shoes.

Instead of laboring to locate a particular site carrying a sought-after piece of content, we currently turn to the transient Web primarily as a medium. A search goes out into the ether and answers come back. The simplicity of this is so desirable that search engines arose on the permanent Web to provide precisely the same experience, although the execution under the hood is rather different. On the transient Web instantiated by Gnutella, a search engine is built in to the infrastructure. Without it, no one would find anything. Another way to find transient sites and their content would be to maintain a resource registry - essentially a dynamic DNS that provides a basis for location-independent URLs instead of location-dependent URLs (see FirstPeer and XDegrees for more on this approach). A unique registry implies centralization and an external dependency anathema to Gnutella, however, which opts for a more decentralized approach with minimal reliance on outside systems.

The transient Web's "sense of medium" has a profound effect on marketing and distribution. As many dot-coms discovered on the conventional Web, just because you build it doesn't mean that they'll come - and if by good fortune they do come, you may not be able to handle the load. Promotional expense is required to expose your address and bring people to your site, and a key promotional tactic is to be listed as prominently as possible in search engines. In a system dominated by a sense of place, you must distribute as many signposts as possible, and reach doesn't come cheap.

Gnutella's search scheme makes the query stream a publicly accessible resource. By tuning in to the query signal on the ether, it's possible for anyone to hear a torrent of broadcast searches and route back responses with results advertising your content. In short, on the transient Web enabled by Gnutella, reach is nearly free. Moreover, since content is more important than its source, users are willing to obtain it from almost any site. LimeWire's "Smart Downloader" feature even automates this process, retrying multiple sources of the same content item until a complete download succeeds. Users who download content can easily become re-distributors, leading to the phenomenon known as "superdistribution." The upshot: On the transient Web, distribution is almost free.

Gnutella's search capability is not perfect, of course. From the point of view of the searcher, there is no guarantee your query will reach the sites holding what you seek, and the results that you do receive will arrive in a jumble. From the point of view of the content provider, there is no guarantee you will hear every query you're interested in hearing; maximum possible reach might still take some effort.

A number of projects are addressing these shortcomings. For example, search engine Gnutella.it attempts to comprehensively track content on the transient Web. A search issued at Gnutella.it first hits the engine's local database of transient site content listings, and as a more time-consuming fallback, it simultaneously broadcasts the query to the Gnutella network. Because Gnutella.it holds relatively fresh addresses of transient sites, it is one exception to the rule that the transient Web cannot be reached from the permanent Web. Addressing the other side of the coin, at Clip2 we have continuously studied the distribution of query traffic across the network in order to inform strategies for connecting to the network in order to hear as many queries as possible.

While Gnutella's search functionality makes queries public, it keeps them anonymous to a degree. Each query is assigned a unique ID at its source. As queries are handed from site to site across the network, each site keeps a temporary record in memory of which neighboring site handed it which query, but no record is passed of who originated the query. In this way, query responses have to route back through the chain, and only your immediate neighbors can correlate your IP address with your queries. The privacy of your queries is therefore dependent upon the hosts to which you are connected, which are likely to be operated by random users such as yourself. By contrast, on the permanent Web the privacy of your queries is dependent upon the policies of the search engine you use.

The modest measure of anonymity afforded to queries does not extend to downloading. Some Gnutella applications, e.g. BearShare, keep logs in standard Web server format of all file requests and downloads. This strengthens the concept of a running instance of BearShare being effectively equivalent to a transient Web site, and it gives the BearShare operator as much visibility into his site's traffic as any Web site operator has: time-stamped records of IP addresses and the files they requested and downloaded. Anonymity in downloading is strengthened to the degree transient Web site operators do not keep or do not review logs, just as with conventional Web sites.

"Sharing" files by hosting them on a transient Web site is only somewhat more anonymous than doing so on a permanent Web site. It is not difficult to detect and track new sites on the public network. Clip2, LimeWire and other endeavors run automated systems that continuously discover host addresses. GnuFrog, a Web-based gateway to Gnutella, discovers transient sites and conveniently rank-orders them by the number of files each has available. In fact, simply connecting a Gnutella application to the network will result in passive discovery of host addresses. BearShare, LimeWire and other servents support browsing of the content available on a given site, making it easy to see the entirety of a user's shared files. In the end, the anonymity enjoyed by the operator of a transient Web site is no stronger than an ISP's records of and policies related to tracking which customers were assigned which IP addresses when.

A scarcity of hyperlinks, a lack of sense of place, a built-in search engine, negligible marketing costs, negligible distribution costs, semi-anonymous broadcast querying, downloading and sharing anonymity dependent on other users and ISPs; from just this partial survey, the transient Web as realized through Gnutella certainly introduces some new wrinkles relative to the Web we're used to. The fun is only beginning, however.

Recent Tales from the Transient Web

The exciting untold adventures of your cookies and old e-mails

The fact that transient Web servers run on end-user PCs puts end-user data at risk of unauthorized or unintentionally authorized exposure. Many users have a limited understanding of what they are doing by running a Gnutella application, and they may wind up exposing much more content to the network than they intended. CNET reported in early February that browser cookie files were not hard to find on the Gnutella network.

A search for Netscape's cookies.txt usually turns up a number of hits; searching for the name of a popular Web site may return corresponding Internet Explorer cookie files. Presumably, these "private" files are available on the public network because users unintentionally shared the folders containing them. Worse yet, once the files are made available, other users may download and redistribute them. Once the cat is out of the bag, it's nearly impossible to get it back in. Less frequently, Clip2 has found instances of Microsoft Outlook data files on transient sites. A single outlook.pst file can be extremely compromising in that it contains a combination of e-mail messages, calendar data, contact records, notes and other personal data. Users are well advised to exercise caution when configuring what they "share."

Is it possible that even after setting the "shared" folders with care, data can be compromised unexpectedly? Yes, Gnutella applications could have security vulnerabilities - cookie and other files may be exposed due to bugs in servents - but no such faults have been identified thus far. This concern is a good argument for installing well-supported, well-tested Gnutella servents instead of "unknown" apps. But what if an unknown app installed itself?

The Tale of Mandragore

P2P and Gnutella aficionado Ben Houston sounded the alarm in late February that Gnutella searches were turning up odd results: Windows-executable files consistently 8,192 bytes in size and with names that precisely matched any search a user initiated. After a user downloads and runs one of these files, it seems nothing happens. Behind the scene, however, a program known as Mandragore installs itself on the PC and hides.

There may be variants, but the version Clip2 studied on Windows 9x/ME comes out of hiding when a Gnutella servent listening on port 6346 is started. At that point, Mandragore initiates a connection to the servent - it's readily visible as an "incoming" connection in the servent's connection screen, with an address identical to that of the local host. From this point on, the Gnutella servent thinks Mandragore is another Gnutella host, and follows the usual procedure of forwarding all incoming queries to it from the network. Mandragore returns the aforementioned misleading response to each query and the unwitting servent passes these responses back out to the network. In this way, Mandragore advertises itself so that it will be downloaded again and thus propagate to new hosts. Although this is the extent of the known actions of Mandragore, and the program is therefore mostly harmless, it shines the way for substantially more malicious developments.

The first line of defense against programs like Mandragore is to heed the old refrain to not run any program that you do not have a basis to trust. The last line of defense is a good PC firewall. In Clip2's testing, Zone Labs' ZoneAlarm product detected Mandragore's initial attempt to establish communications and offered to block it.

Fortunately, not all Gnutella development in recent weeks has been nefarious.

The Reflective Defender

The scalability of Gnutella's approach to enabling the transient Web is a topic of much interest and frequent discussion. Clearly, a system involving broadcasting messages has the potential to easily become so thick with traffic as to overload the individual nodes on the network, and Gnutella has a history of such trouble.

One solution is to tier the network so that the more capable nodes carry more traffic and less capable ones carry less. To that end, Clip2 introduced the "super peer" concept of a Gnutella proxy-and-index server, embodied in the Clip2 Reflector program. Since the Reflector's introduction, a new wave of servent developers has discussed endowing their servents with Reflector-like capabilities. Recently, in a significant new development that might positively signal the arrival of a third generation of Gnutella applications and negatively signal the proprietary segmentation of the network, one developer has gone and done it.

BearShare 3.0.0, as of this writing available as an alpha build at BearShare.net, includes a new set of features collectively termed Defender. This version of BearShare can operate in three different modes: peer, client or server. In peer mode, it operates like any other Gnutella servent; you can think of this as BearShare Classic mode. The client and server modes are, however, rather different.

Figure 1.
Figure 1. BearShare lets you choose between Client, Peer, and Server modes.

In client mode, BearShare broadcasts queries to discover BearShare hosts running in server mode, which might be called Defenders. The BearShare UI contains a Servers tab that lists currently known Defenders as discovered via the querying process. BearShare does not rely on a centralized listing of currently running Defenders. Instead, it relies on the same sort of decentralized ping-pong process normal Gnutella nodes use to discover each other.

These pings and pongs are encoded as special queries and query responses so that they will be relayed by other types of Gnutella applications and earlier versions of BearShare, although there is no guarantee that future versions of other applications will relay this BearShare-specific traffic. While in client mode, BearShare can make one outgoing connection to a Defender, and the client takes no incoming connections from other hosts. When it needs functionality not addressed by the Gnutella protocol, BearShare utilizes a closed, proprietary protocol for client-server communication. This enables clients connected to a given Defender to chat with each other, see how many other clients are currently connected, and so on.

In server mode, a BearShare user can define a server name and customize other characteristics of his Defender. In particular, he can configure it as a "public access" or "private access" Defender. In public-access mode, the Defender operator can specify a language for his chat room and a content subject for his server; these are relayed in the Defender's advertisement of itself to clients. Private-access servers can be secured with a password.

Figure 2. Setting up a Defender.
Figure 2. Setting up a Defender.

Just as with the Reflector, the operator can choose to have his Defender make outgoing connections to the public network, in which case it serves as a proxy for connected clients. Alternatively, he can run the Defender as the hub of an isolated client-server network. A key difference between the Defender and the Reflector is that the former is compatible with only BearShare clients, thereby maximizing client-server features, while the latter strives for open interoperability.

To stimulate the use of the client and server modes, the current alpha version defaults users with low-speed connections to client mode and users with high-speed connections to server mode. In other words, a BearShare user might run a Defender without necessarily realizing it. The objective is for there to be a large number of end-user-operated, transient Defenders running at any one time with many incoming connection slots available for BearShare users in client mode.

In November, Clip2 wrote that Gnutella network scalability would be aided by the regular use of Reflectors and adoption of improved connection logic. As I described more recently in Gnutella: Alive, Well, and Changing Fast, the second generation of Gnutella applications implemented connection logic that appears to have positively impacted network performance.

With the third generation of applications, we may see a meaningful number of distributed and transient Reflector-style nodes likewise produce additional positive impact on network performance. Through the self-organized tiering that results from connection logic and the distributed brokerization of Gnutella's originally fully decentralized P2P model, we see how a decentralized system can evolve to accommodate greater load.


Kelly Truelove is an independent research analyst who, via Truelove Research, covers peer-to-peer technology with a focus on P2P content search, storage, and distribution networks. He is regarded as a leading expert on consumer file-sharing systems, which he covers with a data-driven approach.


Copyright © 2009 O'Reilly Media, Inc.