Web Crawling: You Have Options!

If you have ever wanted to build a custom search engine, this is the post for you.

Web crawlers/scrapers/spiders might seem, at first glance, to be rather complicated. Not only must your crawler navigate its way through the sprawling landscape of the internet, but it must also manage to make sense of this landscape and parse it into readable text. However, once you delve into the problem, you realize that it’s actually quite simple. This is because all the tools you need to make a web crawler have already been made for you.

One powerful, open-source, out-of-the-box web crawler I recommend is Apache Nutch. Not only can you start running it pretty much right off the bat, but it is readily compatible with other Apache tools – for instance, Apache Solr, an open-source search engine. Combine Nutch with Solr, and you can have a fully functional search engine up in less than half an hour.

Of course, sometimes you don’t want or need all the extra features that come with a crawler like Nutch. Sometimes you just want a simple, flexible little web crawler whose source code you can easily modify to do whatever you want. In that case, I would say: “Write it yourself!” (But don’t write everything yourself. Do a Google search to see if you can find someone who’s done most – or all – of it for you already.)

Personally, I am partial to Perl for all things I/O related. Perl has a couple modules that can help you out with parsing HTML, like WWW::Mechanize, but I like Mojo. It makes things simple while still keeping it easy to modify, and also lets you select CSS elements and parse out their text, links, and/or images. Neat!

Here is a guide to help you write a web crawler in Perl using the Mojo module.

Leave a comment