Web DevCenter
oreilly.comSafari Books Online.Conferences.
MySQL Conference and Expo April 14-17, 2008, Santa Clara, CA

Sponsored Developer Resources

Web Columns
Adobe GoLive
Essential JavaScript
Megnut

Web Topics
All Articles
Browsers
ColdFusion
CSS
Database
Flash
Graphics
HTML/XHTML/DHTML
Scripting Languages
Tools
Weblogs

Atom 1.0 Feed RSS 1.0 Feed RSS 2.0 Feed

Learning Lab






O'Reilly Book Excerpts: Spidering Hacks

More Spidering Hacks

by Kevin Hemenway and Tara Calishain

Related Reading

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Kevin Hemenway, Tara Calishain

Editor's note: In last week's sample hacks, excerpted from Spidering Hacks, we showed you two workarounds that will save you time and extra trips to your favorite web sites. This week we offer two more hacks on grabbing--or scraping--the information you need, whether it's the link count for a particular Yahoo! category, or the quick answer for the word that's just on the tip of your tongue. Enjoy.

Hack #49: Yahoo! Directory Mindshare in Google

How does link popularity compare in Yahoo!'s searchable subject index versus Google's full-text index? Find out by calculating mindshare!

Yahoo! and Google are two very different animals. Yahoo! indexes only a site's main URL, title, and description, while Google builds full-text indexes of entire sites. Surely there's some interesting cross-pollination when you combine results from the two.

This hack scrapes all the URLs in a specified subcategory of the Yahoo! directory. It then takes each URL and gets its link count from Google. Each link count provides a nice snapshot of how a particular Yahoo! category and its listed sites stack up on the popularity scale.

TIP: What's a link count? It's simply the total number of pages in Google's index that link to a specific URL.

There are a couple of ways you can use your knowledge of a subcategory's link count. If you find a subcategory whose URLs have only a few links each in Google, you may have found a subcategory that isn't getting a lot of attention from Yahoo!'s editors. Consider going elsewhere for your research. If you're a webmaster and you're considering paying to have Yahoo! add you to their directory, run this hack on the category in which you want to be listed. Are most of the links really popular? If they are, are you sure your site will stand out and get clicks? Maybe you should choose a different category.

We got this idea from a similar experiment Jon Udell (http://weblog.infoworld.com/udell/) did in 2001. He used AltaVista instead of Google; see mindshare-script.txt. We appreciate the inspiration, Jon!

The Code

You will need a Google API account (http://api.google.com/), as well as the SOAP::Lite and HTML::LinkExtor Perl modules to run the following code:

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
                  "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...

    my ($tag, %attr) = @_;

    # continue on only if the tag was a link,
    # and the URL matches Yahoo!'s redirectory.
    return if $tag ne 'a';
    return unless $attr{href} =~ /srd.yahoo/;
    return unless $attr{href} =~ /\*http/;

    # now get our real URL.
    $attr{href} =~ /\*(http.*)/; my $url = $1;

    # and process each URL through Google.
    my $results = $google_search->doGoogleSearch(
                        $google_key, "link:$url", 0, 1,
                        "true", "", "false", "", "", ""
                  ); # wheee, that was easy, guvner.
    $urls{$url} = $results->{estimatedTotalResultsCount};
}

# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Running The Hack

The hack has its only configuration — the Yahoo! directory you're interested in — passed as a single argument (in quotes) on the command line. If you don't pass one of your own, a default directory will be used instead.

% perl mindshare.pl "/Entertainment/Humor/Procrastination/"

Your results show the URLs in those directories, sorted by total Google links:

340: http://www.p45.net/
246: http://www.ishouldbeworking.com/
81: http://www.india.com/
33: http://www.jlc.net/~useless/
23: http://www.geocities.com/SouthBeach/1915/
18: http://www.eskimo.com/~spban/creed.html
13: http://www.black-schaffer.org/scp/
3: http://www.angelfire.com/mi/psociety
2: http://www.geocities.com/wastingstatetime/

Hacking the Hack

Yahoo! isn't the only searchable subject index out there, of course; there's also the Open Directory Project (DMOZ, http://www.dmoz.org/), which is the product of thousands of volunteers busily cataloging and categorizing sites on the Web — the web community's Yahoo!, if you will. This hack works just as well on DMOZ as it does on Yahoo!; they're very similar in structure.

Replace the default Yahoo! directory with its DMOZ equivalent:

my $dmoz_dir = shift || "/Reference/Libraries/Library_and_Information_&return;
Science/Technical_Services/Cataloguing/Metadata/RDF/Applications/RSS/&return; 
News_Readers/";

You'll also need to change the download instructions:

# download the Dmoz.org directory.
my $data = get("http://dmoz.org" . $dmoz_dir) or die $!;

Next, replace the lines that check whether a URL should be measured for mindshare. When we were scraping Yahoo! in our original script, all directory entries were always prepended with http://srd.yahoo.com/ and then the URL itself. Thus, to ensure we received a proper URL, we skipped over the link unless it matched that criteria:

return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;

Since DMOZ is an entirely different site, our checks for validity have to change. DMOZ doesn't modify the outgoing URL, so our previous Yahoo! checks have no relevance here. Instead, we'll make sure it's a full-blooded location (i.e., it starts with http://) and it doesn't match any of DMOZ's internal page links. Likewise, we'll ignore searches on other engines:

return unless $attr{href} =~ /^http/;
return if $attr{href} =~ /dmoz|google|altavista|lycos|yahoo|alltheweb/;

Our last change is to modify the bit of code that gets the real URL from Yahoo!'s modified version. Instead of "finding the URL within the URL":

# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;

we simply assign the URL that HTML::LinkExtor has found:

# now get our real URL.
my $url = $attr{href};

Can you go even further with this? Sure! You might want to search a more specialized directory, such as the FishHoo! fishing search engine (http://www.fishhoo.com/).

You might want to return only the most linked-to URL from the directory, which is quite easy, by piping the results [Hack #28] to another common Unix utility:

% perl mindshare.pl | head 1

Alternatively, you might want to go ahead and grab the top 10 Google matches for the URL that has the most mindshare. To do so, add the following code to the bottom of the script:

print "\nMost popular URLs for the strongest mindshare:\n";
my $most_popular = shift @sorted_urls;
my $results = $google_search->doGoogleSearch(
                    $google_key, "$most_popular", 0, 10,
                    "true", "", "false", "", "", "" );

foreach my $element (@{$results->{resultElements}}) {
   next if $element->{URL} eq $most_popular;
   print " * $element->{URL}\n";
   print "   \"$element->More Spidering Hacks\"\n\n";
}

Then, run the script as usual (the output here uses the default hardcoded directory):

% perl mindshare.pl
27800: http://radio.userland.com/
6670: http://www.oreillynet.com/meerkat/
5460: http://www.newsisfree.com/
3280: http://ranchero.com/software/netnewswire/
1840: http://www.disobey.com/amphetadesk/
847: http://www.feedreader.com/
797: http://www.serence.com/site.php?page=prod_klipfolio
674: http://bitworking.org/Aggie.html
492: http://www.newzcrawler.com/
387: http://www.sharpreader.net/
112: http://www.awasu.com/
102: http://www.bloglines.com/
67: http://www.blueelephantsoftware.com/
57: http://www.blogtrack.com/
50: http://www.proggle.com/novobot/

Most popular URLs for the strongest mindshare:
 * http://groups.yahoo.com/group/radio-userland/
   "Yahoo! Groups : radio-userland"

 * http://groups.yahoo.com/group/radio-userland-francophone/message/76
   "Yahoo! Groupes : radio-userland-francophone Messages : Message 76 ... "

 * http://www.fuzzygroup.com/writing/radiouserland_faq.htm
   "Fuzzygroup :: Radio UserLand FAQ"
...

Pages: 1, 2

Next Pagearrow