Distributed Computing Economics and the Semantic Web

Email.Email weblog link
Blog this.Blog this
William Grosso

William Grosso
Sep. 22, 2003 07:13 PM

Atom feed for this author. RSS 1.0 feed for this author. RSS 2.0 feed for this author.

URL: http://research.microsoft.com/~Gray/...

I went to see Jim Gray speak the other night. He was the first speaker in this fall's Distinguished Speaker series at SDForum. I liked the talk a lot. In particular, I very much enjoyed the part of his talk dealing with Distributed Computing Economics.

The argument itself is basic economic analysis, and can be boiled down to the notion that since everything costs money, you should consider the costs of everything when building applications. In particular, Gray focuses on the costs of cpu time (small, and dropping all the time) and the cost of network bandwidth (not so small, and decreasing at a slower rate). By putting actual dollar values on things, Gray is able to draw some startling conclusions about when it makes sense to use grid-computing techniques, and when it makes sense to either use a LAN-based system or a single machine (as opposed to distributing the computation over a WAN, or using "on-demand" computing).

In particular, he says the following: the break-even point is 10,000 instructions per byte of network traffic, or about a minute of computation per MB of network traffic. That is, unless the cpu time at the other end of the pipe is free, and you get a minute of computation for every MB of data you send to it, you're better off doing the computation locally.

There's an interesting reverse to this. If you're running a database on the wire, you'd much rather someone ask you to do a computation than ask you to send a large amount of data in response to a query (the economics apply when you're sending an answer as well).

Two things struck me while Gray was speaking. The first is that the analysis isn't very different from that in Gray's classic papers on the five minute rule. But despite the fact that a Turing Award winner repeatedly uses this style of argument, I don't see it being applied very often in other areas.

The second is that I think it very much applies to the semantic web. If you'll recall, the idea of the semantic web is to create a giant distributed knowledge-base, with lots of information encoded in RDF triples so that the machines, as well as the humans, can process the data.

Now along comes Gray, making an argument that, when you think about it, implies that the semantic web, as currently conceived, might just be all wrong. His basic point is that it's far cheaper to vend high-level apis than give access to the data (because the cost of shipping large amounts of data around is prohibitive). Since the semantic web is basically a data web, one wonders: why doesn't Gray's argument apply?

Here are three possible counterarguments:

  1. The idea of the semantic web is that there are literally hundreds of thousands of data sources. In such a universe, the only feasible programming model is to gather data into a central location and then perform the computation (coordinating a distributed application on such a scale is simply not feasible).

  2. The point of the semantic web is that it concerns data which is inherently impossible to gather in one location. Gray's economic argument doesn't apply because it assumes that it is possible to put the application on a LAN (or using high level apis) instead of fetching data over a WAN. Clearly that's not currently the case for web-applications like Google, and the proposed semantic web applications (what are they, anyway?) are more google-like than not.

  3. Gray's argument assumes infinite divisibility of computing resources. While it may be true that, once you've bought a computer, the cost of computation is cheap, you can't buy a single unit of computation-- you have to buy the entire computer and then amortize. So, depending on cash flow considerations, and the amount of computing power you really need, some applications might still make sense in an on-demand model.

My point? In everything I've read about the semantic web, nobody's addressed Gray's implicit question. Have I missed a large section of papers? Is it obvious that one of the above three arguments is the "killer rejoinder" to "vend high level APIS, not data"? Is the semantic web really about APIs (and I just missed it)? Or is there a crucial hole in the roadmap to the semantic web?

William Grosso is a coauthor of Java Enterprise Best Practices.