Using Nutch and Solr to crawl and index the web

Nutch and Solr are two solid tools created by the great folks at Apache that you can use to crawl the web (Nutch) and index your crawled data (Solr). There are obviously far more uses for these tools than just indexing random websites and I won’t go into those in this post, but seeing as though I struggled to find documentation on all of this when I started using them I thought I’d put together a quick starter’s guide to crawling the web with Nutch and using Solr to index and search the data that you have crawled.

There could be many reasons for needing to do this kind of thing such as creating your own search engine, automatically importing public data into your database, or just trying to show off how sweet it is being a geek. In this starter’s guide I won’t go into a huge amount of detail on how these things work – instead I will tell you exactly what you need to do to run a crawl and the index that data. I will assume you know how to setup your server or it is already done for you.

  1. Make sure you have Java installed correctly and the JAVA_HOME and CLASSPATH variables are set up correctly (if you’re not sure about this then ask Google).
  2. Download and unpack Nutch and Solr into separate folders.
  3. Configure Nutch:
    • Edit NUTCH_ROOT/conf/nutch-default.xml and set the value of http.agent.name to be the name of your crawler. You can then fill in any other info about your crawler that you wish, but it is not necessary.
    • Create folder NUTCH_ROOT/crawl
    • Create file NUTCH_ROOT/urls/nutch and into it type all the URLs you wish to crawl (one per line)  - make sure to include ‘http://’ and the trailing slash.
    • Edit NUTCH_ROOT/conf/crawl-urlfilter.txt – beneath the line ‘# accept hosts in MY.DOMAIN.NAME’ replace MY.DOMAIN.COM with the first of the URLs you wish to crawl and then make a new line for each of the URLs (formatted in the same way as the first one).
  4. Configure Solr:
    • Copy all the files from the NUTCH_ROOT/conf into SOLR_ROOT/example/solr/conf (overwrite any files it asks you to).
    • Edit SOLR_ROOT/example/solr/conf/schema.xml and in line 71 change the stored attribute form false to true.
    • Edit SOLR_ROOT/example/solr/conf/solrconfig.xml and add the following above the first requestHandler tag:
      <requestHandler name="/nutch" class="solr.SearchHandler" >
      <lst name="defaults">
      <str name="defType">dismax</str>
      <str name="echoParams">explicit</str>
      <float name="tie">0.01</float>
      <str name="qf">
      content^0.5 anchor^1.0 title^1.2
      </str>
      <str name="pf">
      content^0.5 anchor^1.5 title^1.2 site^1.5
      </str>
      <str name="fl">
      url
      </str>
      <str name="mm">
      2<-1 5<-2 6<90%
      </str>
      <int name="ps">100</int>
      <str name="q.alt">*:*</str>
      <str name="hl.fl">title url content</str>
      <str name="f.title.hl.fragsize">0</str>
      <str name="f.title.hl.alternateField">title</str>
      <str name="f.url.hl.fragsize">0</str>
      <str name="f.url.hl.alternateField">url</str>
      <str name="f.content.hl.fragmenter">regex</str>
      </lst>
      </requestHandler>
  5. Start Solr:
    • $ cd SOLR_ROOT/example
    • $ java -jar start.jar
  6. Start the crawl:
    • $ cd NUTCH_ROOT
    • The crawl command has the following options:
      • -dir names the directory to put the crawled data into
      • -threads determines the number of threads that will be fetched in parallel (optional)
      • -depth indicates the link depth from the root page that should be crawled
      • -topN determines the maximum number of URLs to be retrieved at each level up to the depth
    • You can set these numbers to whatever you like, but the general rule is that the higher the numbers are then the more data you will crawl and the longer your crawl will take. This all depends on the setup of your server and what you want to do with your crawl. For example, this is a crawl command that will take a couple of days to complete:
      $ bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
  7. Index the crawl results:
    • $ bin/nutch solrindex http://HOST_ADDRESS:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* (the port number here and in the next point will differ depending on your server set up – check the Solr wiki for more info about that).
    • Go to http://HOST_ADDRESS:8983/solr/admin for the default Solr admin panel to search the index. You can also hit the results XML directly by hitting the right URL – you will see this URL in the address bar when you get to the results.
  8. Profit.

Once you’ve followed all these steps you will have your very own mini search engine. At the moment it will only search the URLs that you specify, but this can be changed and I encourage you to learn more. Once you’ve done a bit of reading you will realise the power of Nutch and Solr and the amazing things you can do with them. Unfortunately there isn’t a huge amount of documentation on how to use these tools that is written in a user-friendly manner, but here are some links that should help get you started:

Tags: , , ,

51 Responses to “Using Nutch and Solr to crawl and index the web”

  1. Andries March 18, 2011 at 10:28 am #

    Awww sweet. Definitely going to try this out and see how it works.

    • Hugh Lashbrooke March 18, 2011 at 10:36 am #

      Good man – you definitely won’t regret it. It can get pretty complex, but they’re fun tools to use.

  2. Arvin April 11, 2011 at 8:12 pm #

    Dude! I’ve been looking for a tutorial for ages!!! I hope this works on my first try!!! I’m going to be doing it on an Ubuntu in a virtual machine…..

    Thanks!

    • Hugh Lashbrooke April 12, 2011 at 8:11 am #

      Cool man – let me know how it goes…

      • Arjun April 2, 2013 at 5:27 am #

        Hey, I am unable to crawl any website other than nutch.apache.org even though I have my urls folder in the right location and am filling it with my own urls. i have also modified the regex-urlfilter.txt to accept any extension(+.). Why does it keep crawling nutch.apache.org in that case? Your help is highly appreciated!

  3. Martin April 18, 2011 at 4:36 pm #

    Ups,

    until step 6 everthink works fine, I’ve got a new crawldb + linkdb + segments (plus subfolders and files)

    step 7 failt by an IOException

    $ bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
    SolrIndexer: starting at 2011-04-18 16:17:05
    java.io.IOException: Job failed!

    Would be nice to know solr and nutch Version.

    Cheers Martin

    • Hugh Lashbrooke April 20, 2011 at 9:25 am #

      Hi Martin

      As far as versions go, I’m using the latest versions of Nutch & Solr (1.2 and 3.1 respectively, but this also worked with v1.4.1 of Solr).

      If you’re failing at that point I would say you should try using the server IP instead of ‘localhost’ (I found that to be a problem in some installations – not too sure why). Also, and I’m sure you’ve done this already, make sure your Solr port is correct. You can view a list of ports that are currently in use by running ‘nmap localhost’.

      Aside from that I don’t know why you would be getting that particular error – an IOException is a rather generic error, so it could have a number of different causes.

    • vijay May 26, 2011 at 3:13 pm #

      This is perticularly version issue, check the solrj.jar in nutch and solr , i also got the same problem and this solution worked for me,
      best luck

    • lokesh April 16, 2014 at 6:57 pm #

      check in solr-config.xml for “_version field “

  4. vijay May 26, 2011 at 3:22 pm #

    Your Stuff is best. Keep It Up.
    It helps a lot to developers like me.

    • Hugh Lashbrooke May 26, 2011 at 4:24 pm #

      Thanks! Always good to know devs find my stuff helpful. If there’s anything in particular you would find useful to know about then let me know and I’ll do a post on it.

  5. Sam June 6, 2011 at 9:17 am #

    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments

    it give me errors like that
    input path does not exist: file:/ c:/wamp/nutch-1.2/crawl/segments/crawl_fetch
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_parse
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_data
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_text

    in segments folder, folders like 20110606102332, 20110606102455, 20110606102814 are created in and all the above file are in it

    i also provide the side link [like:
    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/20110606102332]

    then it will give me the an error like this:
    java.io.IOException: Job failed!

    • Hugh Lashbrooke June 6, 2011 at 9:41 am #

      Your initial solrindex command needs to have “/*” at the end so that it actually gets all the segments – the subfolders it’s looking for are in the segment folders themselves, not the parent folder. That being said, I’ve never tried it with specifying only one segment instead of all of them (as you did here), so I’m not sure if that would work out. Also, I’ve never tried running this on Windows – that could be part of the issue you’re having I guess. I find these kinds of things are far easier to manage in a Linux environment.

      • sam June 6, 2011 at 9:46 am #

        yes i have tried
        bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
        but the same error i have
        java.io.IOException: Job failed!

        as Vijay said that a version issue but notting could happen when i change it to solr-solrj-1.3.0
        isn’t there another way to push data to solr.

      • sam June 6, 2011 at 10:31 am #

        its working gr8 bro
        i have some version and solr.xml issues. it works great on windows server 2003 Now.

        • Hugh Lashbrooke June 8, 2011 at 1:39 pm #

          Awesome man – glad to hear it. What are you using them for exactly? While Nutch and Solr are indexing and search tools, they have a wide range of implementations and can be invaluable in a lot of different projects. Would be interesting to hear what kind of thing you’re using them for.

        • azhar April 12, 2012 at 2:16 pm #

          Hi,
          I am a new memeber here and i have the same problem (java.io.IOException: Job failed!). what do you mean when you say (i have some version and solr.xml issues) could you please explain how you solve it???

  6. Pervanee September 30, 2011 at 11:52 am #

    HI dude,
    Nice posting and pretty straight forward.I am new to Nutch , Hadoop and Solr. I am trying to make a search engine with three ubuntu 10.4 lts box. I formatted the hadoop nodes successfully and named them as hadoop@master, hadoop@slave, hadoop@slave1. hadoop@master is the master node of the hdfs system. I installed Nutch 1.1-rc package to run nutch and hadoop. When I tried to run the commands
    hadoop@master:~#bin/nutch inject crawl/crawldb urls
    hadoop@master:~#bin/nutch generate crawl/crawldb crawl/segments
    hadoop@master:~#export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
    hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing

    Until this point i didnt have nay problem but when I tried to launch the following command to launch the crawler I got these errors…

    hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing

    Fetcher: segment: crawl/segments
    Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

    By the way my hadoop cluster is working well. I checked the status using the jps cpmmand and found all the node are up.
    Besides I created a directory urlsdir inside nutch installation folder and put the urls seed inside a txt file in that folder. Then I put it to the hdfs filesystem. I checked with bin/hadoop dfs -ls and i found all the folder are exists and inside crawl i found the segments directory as well with one segment file. Right now I am quite worried that what could be the problem actually. Please give me some clue. I will be very glad if u can figure out my problem.
    Thanks in advance

  7. Avinash February 15, 2012 at 5:48 am #

    Hugh,

    Impressed a lot. I followed the steps as mentioned, and it worked perfect. Thanks a lot.

  8. Mike Blackstock February 24, 2012 at 10:57 pm #

    Great man! I was bulling my hair out over the io exception error as well but putting in my actual domain instead solved everything!

    Cheers,
    Mike

  9. Geetha February 29, 2012 at 12:59 pm #

    Hi Hugh,

    I followed the steps given and it worked perfectly. I have query now, i crawled the local solr admin site (localhost:8080/solr/admin) and indexed the same in Solr. but the problem here is the crawler crawls the whole page including the Menu items,and other unwanted things and my XML response looks like this :

    1.2018504
    Solr admin page Solr Admin (nutch) CHNMCT81643D.ad.infosys.com:8080 cwd=D:eclipse on chnmct113863d SolrHome=D:SOLrsolr-3.5.0examplesolr. HTTP caching is OFF Solr [ Schema ] [ Config ] [ Analysis ] [ Schema Browser ] [ Statistics ] [ Info ] [ Distribution ] [ Ping ] [ Logging ] App server: [ Java Properties ] [ Thread Dump ] Make a Query [ Full Interface ] Query String: *:* Assistance [ Documentation ] [ Issue Tracker ] [ Send Email ] [ Solr Query Syntax ] Current Time: Mon Feb 13 15:31:05 IST 2012 Server Start At: Mon Feb 13 15:26:36 IST 2012
    4aa984ea3ffe85f33da56da69db3cbd9
    localhost
    http://localhost:8080/solr_linux/admin/
    20120213153103
    localhost
    Solr admin page
    2012-02-13T10:01:05.59Z
    http://localhost:8080/solr_linux/admin/

    If you can see the content field you can check all the contents(menu items, hrefs, etc) getting displayed. please help me if there is any way to restrict the content to be crawled.

    Nutch version 1.4
    Solr Version 3.5.0

  10. Shameema April 26, 2012 at 2:52 pm #

    Hi Hugh,

    This tutorial is very helpful. But I am stuck with 2 problems:
    1. my java -jar start.jar run stops here – 2012-04-26 18:08:40.158:INFO::Started SocketConnector@0.0.0.0:8983
    No progress from that line onwards(FYI: I m new to nutch)
    2. when i click the search button on http://localhost:8080/solr/admin/, I am taken to a
    HTTP Status 400 – Missing solr core name in path

    Please help me solve these.

    Thanks
    Shameema
     

    • Syed Aqueel Haider Rizvi May 8, 2012 at 4:38 pm #

       ASA Shameema!

      I am also new to Nutch. ” 2012-04-26 18:08:40.158:INFO::Started SocketConnector@0.0.0.0:8983″ this starts a server and keeps running. It will only echo something when you try to search etc.

      • Shameema May 14, 2012 at 7:34 am #

         thanks. My solr admin port number was wrong which took three days for me to recognize it, as i m new to this.
        Now my problem is can we crawl and fetch only pages that are relevant to a particular set of keywords?

        • Syed Aqueel Haider Rizvi May 28, 2012 at 8:35 pm #

          lets assume that a page is relevant to a keyword if that keyword occurs in that page.

          now can you find the keyword’s  occurrence in a page without crawling the page?  

    • PhoenixPyDev June 1, 2012 at 12:21 pm #

      1- That means your Solr server is running, and will now accept requests. If you want your Solr server to run in the background, try this command(linux):
       `nohup java -jar start.jar >logfile 2>&1 &`

      2- Looks like you need to select a core name on the admin UI before select search. Have a look in the solr.xml file in your solr home directory, what cores are defined? Solr is expecting the core name in the URL: http://localhost:8080/solr//admin/

  11. Dimas Koro May 27, 2012 at 10:29 am #

     can  solr showing  images from nutch result? not only text result but images too. if you have tutorial or suggestion can you share too. thanks 

    • Hugh Lashbrooke May 28, 2012 at 10:23 am #

      Nutch returns the full HTML of each page it crawls, so if you use some regex to extract the src attributes from the img tags then you should be able to find all the images on the page. Just search Google for some regex tips in that regard.

  12. Hugh Lashbrooke July 4, 2012 at 10:41 am #

    Blog post by Coolpanda that links back here about building a search interface for Solr: http://coolpandaca.wordpress.com/2012/07/03/search-interface-for-nutch-solr-server/

    • akhil November 23, 2012 at 2:32 pm #

      @ Hugh Lashbrooke ….. I m trying to setup the configuration as you described but I m not able to find solr 3.1 on the solr official website (oldest one is 3.6). How can i get the ver 3.1…..

      • Hugh Lashbrooke November 23, 2012 at 2:47 pm #

        Hi Akhil,

        I’m afraid I haven’t used Solr since v3.1, so I can’t say whether this tutorial will still work or not for later versions or not. I would suggest Googling for the version you need and seeing if there’s an older repo that contains it somewhere.

        Sorry I can’t be of more help with this!

        Hugh

  13. Jon September 21, 2012 at 6:59 pm #

    I am using Solr 1.5.1, not seeing crawl-urlfilter.txt in the NUTCH_ROOT/Conf directory, just automaton-urlfilter.txt. Any idea what I should do?

    • den September 26, 2012 at 4:51 pm #

      you need regex-urlfilter.txt file

  14. David December 20, 2012 at 8:55 am #

    Hi Lashbooke,

    Do you have any idea on how to deal with deadlinks in nutch? say for example I have crawled a blog site today and all the documents are indexed to solr. Tomorrow one of the blog in the above blog site is deleted which mean that one of the URL indexed yesterday is no more working today! So how do I update solr indexes such that this particular blog doesn’t come in search results? Recrawling the site didn’t delete this record in solr is what I observed. I am using nutch 1.5.1 binary version.

    Thanks
    David

  15. Jayant January 3, 2013 at 11:42 pm #

    I am not able to find the file NUTCH_ROOT/conf/crawl-urlfilter.txt
    What can be the problem?

    • Hugh Lashbrooke January 4, 2013 at 7:31 am #

      That will probably be because you are using a more recent version of Nutch that I am not familiar with, so I can’t really help you out there.

  16. Md Aarif Equbal February 5, 2013 at 1:56 pm #

    Thanks for such a great article.It works smoothly.

  17. Xie March 22, 2013 at 6:56 pm #

    Hi
    How do i set the -depth for crawling the whole website, because i dont know how much depth the website has.

    Thank you and looking forward ur reply.

    • Hugh Lashbrooke March 25, 2013 at 9:21 am #

      As far as I know there’s no way to do that dynamically, so your best option would probably be to set the depth to really high number that is unlikely to be reached by any crawl of the site.

  18. Thomas April 2, 2013 at 7:44 pm #

    Hugh, I am new to Nutch and SOLR. Is the /nutch requestHandler above somehow interacting with Nutch itself or is just a request handler that gives boosts to some of the “nutch fields” already indexed in SOLR?

    • Hugh Lashbrooke April 2, 2013 at 10:33 pm #

      It’s been quite a while since I’ve worked with either Nutch or Solr, so I don’t fully remember how it all interacts. That being said, if my memory serves me correctly then the new request handler allows Solr to correctly index the data points (fields) provided by Nutch. Without it, Solr wouldn’t know where to put the Nutch data.

  19. Megan May 20, 2013 at 6:42 pm #

    Hello! Do you know if they make any plugins to protect against hackers?

    I’m kinda paranoid about losing everything I’ve worked hard on.

    Any recommendations?

  20. Ramakrishna June 28, 2013 at 9:30 am #

    Hi pal…
    I’m really fed up with the nutch,solr and eclipse integration. just tel me step by step from the starting(new->javapro), how to create a simple project in eclipse(with solr,nutch)…. i reffered wiki also.. but i dint get that after few steps… plz don’t give any other links… if possible plz send me step by step screenshots or text document to ramakrishna756@gmail.com.

    • Hugh Lashbrooke June 28, 2013 at 9:37 am #

      I’m not involved with the development of Nutch or Solr so I’m not a channel for support for all your Nutch/Solr needs. I have also said in previous comments that I no longer work with either of these tools, so I can’t be of much further help. Even if I could help, however, I wouldn’t be emailing you a detailed guide with screenshots – to be honest that’s a pretty ridiculous thing to request as a favour. I suggest searching Google some more and seeing what you can find because I can’t help you any further. I hope you find a solution in the end.

  21. john July 19, 2013 at 7:44 am #

    I got this as well. using nutch 1.2, solr 3.1. java 1.7.0_21.

    Tried various nutch 1.x and solr 3.x versions unsuccessfully.

    SolrIndexer: starting at ….
    java.io.IOException: Job failed!

  22. Evan Donovan August 9, 2013 at 6:41 pm #

    To get this to work for Solr 4.4, I had to do the following additions/modifications to these steps:

    1) Skip the modifications to solrconfig.xml. They will cause XML parsing errors in that version of Solr. Instead, follow the advice in http://stackoverflow.com/questions/17649567/nutch-message-no-indexwriters-activated-while-loading-to-solr.
    2) Remove from schema.xml the reference to the EnglishPorterStemmerFilter.
    3) Add in a _version_ field to schema.xml as described here: http://solrhelp.blogspot.com/2013/02/how-to-migrate-for-solr-3x-to-cloud.html. Don’t add the long data type, however.
    4) Run nutch as follows: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 20, where depth and topN are what you want to use.

    Note that the modifications to schema.xml, et al. are to be done under $SOLR_ROOT/example/solr/collection1/conf in Solr 4.4.

    • Dan August 21, 2013 at 9:56 pm #

      Thanks :)

  23. Ceyhun September 16, 2013 at 9:44 am #

    Hi,

    Thanks for ur tutorial which helps my graduation. Now i want to use my project in real life but i couldn’t redirected to my domain. İf it is possible could u help me?

    Greetings from Turkiye

  24. Rahul July 5, 2014 at 6:00 pm #

    Hi,
    I am rails developer and I have to use solr and nutch with my rails application.
    I have setup nutch and solr at my local machine(ubuntu 12.04).
    I follwed al the instrcutions provided by you but still I am unable to know that how to use solr with rail.
    How to fetch data from my local database.
    Please help me regarding solr and nutch.

    Thanks in advance!

Trackbacks/Pingbacks

  1. Search interface for Nutch, Solr Server « Coolpanda's Space - July 3, 2012

    [...] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ http://www.hughlashbrooke.com/using-nutch-and-solr-to-crawl-and-index-the-web/ [...]

Leave a Reply

css.php
Post
Comments