Published on (
 See this if you're having trouble printing code examples

Gene Kan & Mike Clary on Sun's Infrasearch Buy

by Richard Koman

Related Articles:

Bill Joy announces JXTA

Remaking the P2P Meme

Dimensions of P2P - A Conversation With audio

All related Gnutella content

More from the

Sun Microsystems announced March 6 that it would acquire Infrasearch, a real-time P2P searching service invented by Gnutella lead developer Gene Kan. With big-time investors such as Marc Andreessen and former Netscape exec Mike Homer, Infrasearch might have the next big crazed Internet IPO. But times have changed, and the prospect of making gazillions in the stock market seems remote at best right now. So being part of Sun and Bill Joy's Project Juxtapose no doubt looks good to Kan. To get a closer look at the acquisition, interviewed Kan and Sun VP Mike Clary, who heads Project Juxtapose, or JXTA.

To start with, Gene, can you describe the InfraSearch technology.

Gene Kan: OK, Infrasearch is a distributed-search engine that kind of leverages a lot of the assets behind peer-to-peer computing style. In my mind, in a peer-to-peer environment there are three critical resources: network capacity, processing power and hard disk space ? essentially storage capacity. Infrasearch leverages two of those axes: processing power and storage capacity.

The basic idea is that Infrasearch is able to effectively turn all of the computers on a network into a collective brain, if you will, in disseminating the information that is available on each of those computers. And that's something that's really unique when compared to the World Wide Web. On the Web, the hosts of information are in fact treated as second-class citizens when it comes to answering requests based upon the information that is located on each Web host. And by that I mean that the information that is residing on each host must first be interpreted by a crawler and so on before any kind of questions can be answered about that information.

That doesn't work in a peer-to-peer world, for at least two reasons. The first is that peer networks are extremely transient and the information available on those networks is constantly changing ? not only because the computers are appearing and disappearing all of the time, but because the information itself is changing at a much more rapid rate. And the second thing is that on a peer network it's important to treat every host as a first-class information provider, because the key idea behind peer computing is that each node in the network has the possibility to make a very important contribution to the network as a whole.

Comment on this articleGene Kan says the Web treats information hosts as second-class citizens. Do you take issue with that concept? Are you comfortable with the role Sun is taking in P2P?
Discuss this interview with your peers.

What's the impact on distributed searching when every node is a first-class citizen?

Kan: The key advantage is the ability for much finer-grain information to be found. When you ask a question, it's answered by a very localized database.

In Infrasearch's real-time searching, you can search peers not only for static data, but for a machine to calculate an equation, for example, or process some set of data. Is that right?

Kan: Right. The query is live, in a sense. It's not simply compared against a list of words. Your query is actually distributed in real time, which means that each provider of information has the capability to interpret that query and act upon it. And apart from the qualitative issues of peer searching, there's really a key quantitative issue, which is in a peer-to-peer computing world, where the footprint of the host can be extremely small, it's critical to have a scalable solution ? not just for the architecture of the peer network itself but for all of the services that come along with that network, and one of those services, of course, is search.

So at the O'Reilly Conference, Bill [Joy] talked a lot about a billion-device future as did just about everyone else at the conference, and really that's something that we need to be looking at because current Web search engines have a problem keeping up with a billion pages, much less a billion hosts that are constantly popping in and out of the network. So clearly this is a problem that needs to be addressed, and we think that Infrasearch takes several key steps toward answering that problem.

It isn't really a question of scale, right? It's a question of whether all of your technologies scale up with the number of participants in your network, and for a peer network to be successful that has to be answered positively. On the World Wide Web, so far we've been able to get away, in many instances, without answering that positively.

Is Infrasearch based on your earlier work with Gnutella?

Kan: The prototype was, yes. And we based the prototype on Gnutella because we really wanted to demonstrate unequivocally that Gnutella had a much broader appeal than just file sharing, which it was kind of consistently associated with. And we thought that a clear way to show that Gnutella is a kind of peer information interchange protocol, or technology, was by demonstrating a peer information discovery type tool using nothing but Gnutella. Since then the technology has become a proprietary thing. Rather than using Gnutella we're using network protocols and a communications architecture that is more uniquely suited to the problem of searching.

So it's a proprietary framework right now?

Kan: That's correct.

And do you intend to release parts of it through an open-source license?

Mike Clary: We had our ideas of where we would open some aspects of the search and several of those ideas have been discussed with the JXTA team. As we move forward, we'll clarify and kind of blend Infrasearch into the JXTA effort, so I think that those kind of questions will be answered as we move toward a tighter integration.

Mike, can you talk a little about your conversations with Gene and why Infrasearch seemed like such a natural fit with the JXTA project.

Clary: I think if you take a look at what we're doing with JXTA and its research set of lenses, you'll find that what we're talking about is distributed computing and making all these nodes that Gene's talking about part of a large collective, or a large collective computer, if you will. And I think the primitives that we're going to try and get established or adopted in JXTA ? you know, the notion of distribution and pipes and grouping and monitoring and everything else that we're trying to do ? that's really about distributing all the capability, all the functionality, or processing power or storage, as Gene was talking about, across a lot of different nodes.

And so one of the first things that we concluded after we started thinking about the primitives was search; how do you search in the context of a distributed space that is largely transitory, where things are there some days and some days not? We knew it was different than the conventional Web crawler approach, where you run a spider across a bunch of static data, you roll it into a big index, and then you pose a query against that index and you just walk across the directory.

So Gene's technology gave us the ability to say two things: One is how do we satisfy searching in this very large distributed space that has ad hoc or transitory characteristics? That was issue No. 1, so InfraSearch looked very attractive from that standpoint.

And the second thing is I think what Gene's technology does: It exposes what some people refer to as the deep Web, the stuff that's behind the interfaces on all of those nodes or those Web sites or those computers on the network. So how do you go out there and tickle those interfaces and find real-time or close-to-real-time data from those nodes that come and go. So I think it's a combination of those two things, and we were looking at this infrastructure layer, thinking about some of the first services that were going to be required in order to build interesting applications, and search definitely fell into that category, and so Gene's technology popped up and it looked good from our perspective.

Will InfraSearch be migrated into the infrastructure layer, or will it remain an independent service?

Clary: I think we'd like to maintain, if you will, a pretty bright line between what infrastructure is ? the JXTA stuff ? and what are services on top of it. And so we always think of it as a three-layer cake: the infrastructure, peer-to-peer or network services, and then interesting applications that use those services.

I think we're going to maintain a strong demarcation between those things that are infrastructure ? fundamental, if you will, plumbing ? versus those things that may reach users from a service level or an application level. So we're going to try to keep those things separate. It's not a case where we're going to say that Gene's technology is the only searching mechanism that will ever be out there in existence. We think it's going to be a very popular and compelling, powerful searching capability, but I'm sure there'll be other technologies that we're not going to prohibit from sitting on top of the infrastructure. We just think it's a good first foray, if you will, into "How do we actually search in this distributed space?" And we're going to continue to invest in it and advance the technology as time goes by.

Kan: There are really as many different kinds of searching as there are different kinds of information out there. And so in specific applications it's going to make a lot of sense to pursue application-specific searching techniques and part of JXTA is a really open platform. That's one of the core ideas of the project, and its ability to adopt to every third-party solution out there is really a central idea.

Are there pieces of Infrasearch that you'll bring down to the infrastructure layer?

Clary: I think it might be too early to say. We haven't really done the mind meld yet, in terms of the technology and the pieces of the technology. I do fundamentally believe that what Gene has is something that really should live and breathe in the context of the network service, that either applications or end-users get direct access to. There may be some networking stuff, but it's too early to tell.

So my understanding of JXTA was that it was totally focused on the infrastructure layer, and with the acquisition it seems like you're also interested in providing some services that may or may not rely on that infrastructure layer. Is that right?

Clary: Well, I think it will rely on the infrastructure layer. You know, Gene built some things underneath in order to satisfy what he had to do from an infrastructure standpoint, and so we intend to bring the Infrasearch open-search technology on top of JXTA and release it in that manner, as well. So it's not a case that these are two disparate efforts. I think these are pretty well lined up.

Are you concerned that the peer-to-peer community will think that Sun is trying to sort of buy up the P2P space?

Clary: No, I look at it a little bit differently. We're not trying to lock out anybody else, or as Gene was saying, it's an open platform, so any other searching mechanism can go on top of it. What we're really trying to do is catalyze this market so that people have some facilities or services so they can build compelling applications. So it's not a case that we're trying to lock other people out or slow the progress down. We'd just like to catalyze it and bring that level of coherence forward so people can do those interesting applications. The first problem everybody's going to be facing is searching; we'd like to bring a coherent solution to market so people can go off and build those applications that may bring back proprietary value for them. That's just our motivation. There's an awful lot of activity going on in that space. I think Gene knows better than I do, but I think there's a lot of things that are going on that no one could ever get their arms around or lock up or buy up the entire set. It's kind of hard to believe.

What other areas beside searching seem like important things to help be a catalyst for?

Clary: I think that's about where we're going to stop for the time being. We're going to have a hard enough time integrating this stuff and getting it to market and everything else. I think as far as we can see right now is we think that uniform infrastructure is important, followed by a searching mechanism. I think we're going to hold right about there for the time being.

We just put up a new piece by Clay Shirky called "Interoperability, Not Standards." What's your take on the P2P working group's movements toward a standards body?

Clary: I think Gene was touching on this. What we'd really like to do is promote a mechanism or approach that is not a standards-body approach. I think, if anything, the way or the manner we're approaching this right now is to say, "Let's get some technology out there. Let's see if we can do it on an open-source basis so that we can enlist the widest possible audience for influencing what this platform can be."

I think the day that a single organization comes out and unilaterally declares that "Here's the platform and this is what you need to build to" is over, and I think the notion that "standards first, applications second" is quickly coming to a close as well.

I think we need to let the market run for a while, and that's really our approach that we're going to take here. Let's throw some interesting services out there and let's let the market determine which ones are going to succeed over time. So I don't really hold out a lot of faith in this notion that either a big huge .NET specification and implementation is going to be the way to go, or a sanctions standard body declaring that this is the way to go in this new space. I think it's way too early. So we're going to approach this from a different way, and I think that's reflected in our license and our approach to the open-source community in general.

So services and applications first, standards later.

Kan: Yeah, really. The pace of change is so rapid on peer-to-peer, it's a book that's still being written. Well, O'Reilly published one ... but as a technology area, it's in a constant state of change, and to drive that process through a standards body now doesn't really make sense. Standards really should formalize practice, and the practices are still being formulated.

So I take it you won't be taking part in the peer-to-peer working group in the near future.

Clary: I think we'll keep an eye on what's going on, but I think if you look closely at it with the submissions that are in there, it's a tall stack of technology that they're proposing as standards that everybody adopt. I think that's why we're trying to keep our technology very thin, very small, but bring back some value for higher level applications, and I think similar to Gene's technology we're offering the fundamental facility, not an integrated, tightly integrated stack of technology of, "Here's the only APIs that you can use."

Kan: Right. I think the difference really between the Sun view and the Intel working group view is that many participants in the Intel working group feel that it's necessary to drive the development of peer-to-peer through standards whereas the idea at Sun is to drive the technology through example first.

Richard Koman is a freelancer writer and editor based in Sonoma County, California. He works on SiliconValleyWatcher, ZDNet blogs, and is a regular contributor to the O'Reilly Network.

Copyright © 2009 O'Reilly Media, Inc.