Nutch and Solr are two solid tools created by the great folks at Apache that you can use to crawl the web (Nutch) and index your crawled data (Solr). There are obviously far more uses for these tools than just indexing random websites and I won’t go into those in this post, but seeing as though I struggled to find documentation on all of this when I started using them I thought I’d put together a quick starter’s guide to crawling the web with Nutch and using Solr to index and search the data that you have crawled.
There could be many reasons for needing to do this kind of thing such as creating your own search engine, automatically importing public data into your database, or just trying to show off how sweet it is being a geek. In this starter’s guide I won’t go into a huge amount of detail on how these things work – instead I will tell you exactly what you need to do to run a crawl and the index that data. I will assume you know how to setup your server or it is already done for you.
- Make sure you have Java installed correctly and the JAVA_HOME and CLASSPATH variables are set up correctly (if you’re not sure about this then ask Google).
- Download and unpack Nutch and Solr into separate folders.
- Configure Nutch:
NUTCH_ROOT/conf/nutch-default.xmland set the value of
http.agent.nameto be the name of your crawler. You can then fill in any other info about your crawler that you wish, but it is not necessary.
- Create folder
- Create file
NUTCH_ROOT/urls/nutchand into it type all the URLs you wish to crawl (one per line) - make sure to include ‘http://’ and the trailing slash.
NUTCH_ROOT/conf/crawl-urlfilter.txt– beneath the line ‘# accept hosts in MY.DOMAIN.NAME’ replace
MY.DOMAIN.COMwith the first of the URLs you wish to crawl and then make a new line for each of the URLs (formatted in the same way as the first one).
- Configure Solr:
- Copy all the files from the
SOLR_ROOT/example/solr/conf(overwrite any files it asks you to).
SOLR_ROOT/example/solr/conf/schema.xmland in line 71 change the
SOLR_ROOT/example/solr/conf/solrconfig.xmland add the following above the first
<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>
- Copy all the files from the
- Start Solr:
$ cd SOLR_ROOT/example
$ java -jar start.jar
- Start the crawl:
$ cd NUTCH_ROOT
- The crawl command has the following options:
-dirnames the directory to put the crawled data into
-threadsdetermines the number of threads that will be fetched in parallel (optional)
-depthindicates the link depth from the root page that should be crawled
-topNdetermines the maximum number of URLs to be retrieved at each level up to the depth
- You can set these numbers to whatever you like, but the general rule is that the higher the numbers are then the more data you will crawl and the longer your crawl will take. This all depends on the setup of your server and what you want to do with your crawl. For example, this is a crawl command that will take a couple of days to complete:
$ bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
- Index the crawl results:
$ bin/nutch solrindex http://HOST_ADDRESS:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*(the port number here and in the next point will differ depending on your server set up – check the Solr wiki for more info about that).
- Go to
http://HOST_ADDRESS:8983/solr/adminfor the default Solr admin panel to search the index. You can also hit the results XML directly by hitting the right URL – you will see this URL in the address bar when you get to the results.
Once you’ve followed all these steps you will have your very own mini search engine. At the moment it will only search the URLs that you specify, but this can be changed and I encourage you to learn more. Once you’ve done a bit of reading you will realise the power of Nutch and Solr and the amazing things you can do with them. Unfortunately there isn’t a huge amount of documentation on how to use these tools that is written in a user-friendly manner, but here are some links that should help get you started: