post icon

Using Nutch and Solr to crawl and index the web

SolrNutch and Solr are two solid tools created by the great folks at Apache that you can use to crawl the web (Nutch) and index your crawled data (Solr). There are obviously far more uses for these tools than just indexing random websites and I won’t go into those in this post, but seeing as though I struggled to find documentation on all of this when I started using them I thought I’d put together a quick starter’s guide to crawling the web with Nutch and using Solr to index and search the data that you have crawled.

There could be many reasons for needing to do this kind of thing such as creating your own search engine, automatically importing public data into your database, or just trying to show off how sweet it is being a geek. In this starter’s guide I won’t go into a huge amount of detail on how these things work – instead I will tell you exactly what you need to do to run a crawl and the index that data. I will assume you know how to setup your server or it is already done for you.

  1. Make sure you have Java installed correctly and the JAVA_HOME and CLASSPATH variables are set up correctly (if you’re not sure about this then ask Google).
  2. Download and unpack Nutch and Solr into separate folders.
  3. Configure Nutch:
    • Edit NUTCH_ROOT/conf/nutch-default.xml and set the value of http.agent.name to be the name of your crawler. You can then fill in any other info about your crawler that you wish, but it is not necessary.
    • Create folder NUTCH_ROOT/crawl
    • Create file NUTCH_ROOT/urls/nutch and into it type all the URLs you wish to crawl (one per line)  - make sure to include ‘http://’ and the trailing slash.
    • Edit NUTCH_ROOT/conf/crawl-urlfilter.txt – beneath the line ‘# accept hosts in MY.DOMAIN.NAME’ replace MY.DOMAIN.COM with the first of the URLs you wish to crawl and then make a new line for each of the URLs (formatted in the same way as the first one).
  4. Configure Solr:
    • Copy all the files from the NUTCH_ROOT/conf into SOLR_ROOT/example/solr/conf (overwrite any files it asks you to).
    • Edit SOLR_ROOT/example/solr/conf/schema.xml and in line 71 change the stored attribute form false to true.
    • Edit SOLR_ROOT/example/solr/conf/solrconfig.xml and add the following above the first requestHandler tag:
      <requestHandler name="/nutch" class="solr.SearchHandler" >
      <lst name="defaults">
      <str name="defType">dismax</str>
      <str name="echoParams">explicit</str>
      <float name="tie">0.01</float>
      <str name="qf">
      content^0.5 anchor^1.0 title^1.2
      </str>
      <str name="pf">
      content^0.5 anchor^1.5 title^1.2 site^1.5
      </str>
      <str name="fl">
      url
      </str>
      <str name="mm">
      2&lt;-1 5&lt;-2 6&lt;90%
      </str>
      <int name="ps">100</int>
      <str name="q.alt">*:*</str>
      <str name="hl.fl">title url content</str>
      <str name="f.title.hl.fragsize">0</str>
      <str name="f.title.hl.alternateField">title</str>
      <str name="f.url.hl.fragsize">0</str>
      <str name="f.url.hl.alternateField">url</str>
      <str name="f.content.hl.fragmenter">regex</str>
      </lst>
      </requestHandler>
  5. Start Solr:
    • $ cd SOLR_ROOT/example
    • $ java -jar start.jar
  6. Start the crawl:
    • $ cd NUTCH_ROOT
    • The crawl command has the following options:
      • -dir names the directory to put the crawled data into
      • -threads determines the number of threads that will be fetched in parallel (optional)
      • -depth indicates the link depth from the root page that should be crawled
      • -topN determines the maximum number of URLs to be retrieved at each level up to the depth
    • You can set these numbers to whatever you like, but the general rule is that the higher the numbers are then the more data you will crawl and the longer your crawl will take. This all depends on the setup of your server and what you want to do with your crawl. For example, this is a crawl command that will take a couple of days to complete:
      $ bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
  7. Index the crawl results:
    • $ bin/nutch solrindex http://HOST_ADDRESS:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* (the port number here and in the next point will differ depending on your server set up – check the Solr wiki for more info about that).
    • Go to http://HOST_ADDRESS:8983/solr/admin for the default Solr admin panel to search the index. You can also hit the results XML directly by hitting the right URL – you will see this URL in the address bar when you get to the results.
  8. Profit.

Once you’ve followed all these steps you will have your very own mini search engine. At the moment it will only search the URLs that you specify, but this can be changed and I encourgae you to learn more. Once you’ve done a bit of reading you will realise the power of Nutch and Solr and the amazing things you can do with them. Unfortunately there isn’t a huge amount of documentation on how to use these tools that is written in a user-friendly manner, but here are some links that should help get you started:

16 Comments

Leave a comment
  1. Andries
    18 March, 2011 at 10:28 am #

    Awww sweet. Definitely going to try this out and see how it works.

    • Hugh Lashbrooke
      18 March, 2011 at 10:36 am #

      Good man – you definitely won’t regret it. It can get pretty complex, but they’re fun tools to use.

  2. Arvin
    11 April, 2011 at 8:12 pm #

    Dude! I’ve been looking for a tutorial for ages!!! I hope this works on my first try!!! I’m going to be doing it on an Ubuntu in a virtual machine…..

    Thanks!

  3. Martin
    18 April, 2011 at 4:36 pm #

    Ups,

    until step 6 everthink works fine, I’ve got a new crawldb + linkdb + segments (plus subfolders and files)

    step 7 failt by an IOException

    $ bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
    SolrIndexer: starting at 2011-04-18 16:17:05
    java.io.IOException: Job failed!

    Would be nice to know solr and nutch Version.

    Cheers Martin

    • Hugh Lashbrooke
      20 April, 2011 at 9:25 am #

      Hi Martin

      As far as versions go, I’m using the latest versions of Nutch & Solr (1.2 and 3.1 respectively, but this also worked with v1.4.1 of Solr).

      If you’re failing at that point I would say you should try using the server IP instead of ‘localhost’ (I found that to be a problem in some installations – not too sure why). Also, and I’m sure you’ve done this already, make sure your Solr port is correct. You can view a list of ports that are currently in use by running ‘nmap localhost’.

      Aside from that I don’t know why you would be getting that particular error – an IOException is a rather generic error, so it could have a number of different causes.

    • vijay
      26 May, 2011 at 3:13 pm #

      This is perticularly version issue, check the solrj.jar in nutch and solr , i also got the same problem and this solution worked for me,
      best luck

  4. vijay
    26 May, 2011 at 3:22 pm #

    Your Stuff is best. Keep It Up.
    It helps a lot to developers like me.

    • Hugh Lashbrooke
      26 May, 2011 at 4:24 pm #

      Thanks! Always good to know devs find my stuff helpful. If there’s anything in particular you would find useful to know about then let me know and I’ll do a post on it.

  5. Sam
    6 June, 2011 at 9:17 am #

    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments

    it give me errors like that
    input path does not exist: file:/ c:/wamp/nutch-1.2/crawl/segments/crawl_fetch
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_parse
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_data
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_text

    in segments folder, folders like 20110606102332, 20110606102455, 20110606102814 are created in and all the above file are in it

    i also provide the side link [like:
    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/20110606102332]

    then it will give me the an error like this:
    java.io.IOException: Job failed!

    • Hugh Lashbrooke
      6 June, 2011 at 9:41 am #

      Your initial solrindex command needs to have “/*” at the end so that it actually gets all the segments – the subfolders it’s looking for are in the segment folders themselves, not the parent folder. That being said, I’ve never tried it with specifying only one segment instead of all of them (as you did here), so I’m not sure if that would work out. Also, I’ve never tried running this on Windows – that could be part of the issue you’re having I guess. I find these kinds of things are far easier to manage in a Linux environment.

      • sam
        6 June, 2011 at 9:46 am #

        yes i have tried
        bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
        but the same error i have
        java.io.IOException: Job failed!

        as Vijay said that a version issue but notting could happen when i change it to solr-solrj-1.3.0
        isn’t there another way to push data to solr.

      • sam
        6 June, 2011 at 10:31 am #

        its working gr8 bro
        i have some version and solr.xml issues. it works great on windows server 2003 Now.

        • Hugh Lashbrooke
          8 June, 2011 at 1:39 pm #

          Awesome man – glad to hear it. What are you using them for exactly? While Nutch and Solr are indexing and search tools, they have a wide range of implementations and can be invaluable in a lot of different projects. Would be interesting to hear what kind of thing you’re using them for.

  6. Pervanee
    30 September, 2011 at 11:52 am #

    HI dude,
    Nice posting and pretty straight forward.I am new to Nutch , Hadoop and Solr. I am trying to make a search engine with three ubuntu 10.4 lts box. I formatted the hadoop nodes successfully and named them as hadoop@master, hadoop@slave, hadoop@slave1. hadoop@master is the master node of the hdfs system. I installed Nutch 1.1-rc package to run nutch and hadoop. When I tried to run the commands
    hadoop@master:~#bin/nutch inject crawl/crawldb urls
    hadoop@master:~#bin/nutch generate crawl/crawldb crawl/segments
    hadoop@master:~#export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
    hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing

    Until this point i didnt have nay problem but when I tried to launch the following command to launch the crawler I got these errors…

    hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing

    Fetcher: segment: crawl/segments
    Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

    By the way my hadoop cluster is working well. I checked the status using the jps cpmmand and found all the node are up.
    Besides I created a directory urlsdir inside nutch installation folder and put the urls seed inside a txt file in that folder. Then I put it to the hdfs filesystem. I checked with bin/hadoop dfs -ls and i found all the folder are exists and inside crawl i found the segments directory as well with one segment file. Right now I am quite worried that what could be the problem actually. Please give me some clue. I will be very glad if u can figure out my problem.
    Thanks in advance

  7. Avinash
    15 February, 2012 at 5:48 am #

    Hugh,

    Impressed a lot. I followed the steps as mentioned, and it worked perfect. Thanks a lot.

Leave a Reply