Nutch and Solr are two solid tools created by the great folks at Apache that you can use to crawl the web (Nutch) and index your crawled data (Solr). There are obviously far more uses for these tools than just indexing random websites and I won’t go into those in this post, but seeing as though I struggled to find documentation on all of this when I started using them I thought I’d put together a quick starter’s guide to crawling the web with Nutch and using Solr to index and search the data that you have crawled.
There could be many reasons for needing to do this kind of thing such as creating your own search engine, automatically importing public data into your database, or just trying to show off how sweet it is being a geek. In this starter’s guide I won’t go into a huge amount of detail on how these things work – instead I will tell you exactly what you need to do to run a crawl and the index that data. I will assume you know how to setup your server or it is already done for you.
- Make sure you have Java installed correctly and the JAVA_HOME and CLASSPATH variables are set up correctly (if you’re not sure about this then ask Google).
- Download and unpack Nutch and Solr into separate folders.
- Configure Nutch:
- Edit
NUTCH_ROOT/conf/nutch-default.xmland set the value ofhttp.agent.nameto be the name of your crawler. You can then fill in any other info about your crawler that you wish, but it is not necessary. - Create folder
NUTCH_ROOT/crawl - Create file
NUTCH_ROOT/urls/nutchand into it type all the URLs you wish to crawl (one per line) - make sure to include ‘http://’ and the trailing slash. - Edit
NUTCH_ROOT/conf/crawl-urlfilter.txt– beneath the line ‘# accept hosts in MY.DOMAIN.NAME’ replaceMY.DOMAIN.COMwith the first of the URLs you wish to crawl and then make a new line for each of the URLs (formatted in the same way as the first one).
- Edit
- Configure Solr:
- Copy all the files from the
NUTCH_ROOT/confintoSOLR_ROOT/example/solr/conf(overwrite any files it asks you to). - Edit
SOLR_ROOT/example/solr/conf/schema.xmland in line 71 change thestoredattribute formfalsetotrue. - Edit
SOLR_ROOT/example/solr/conf/solrconfig.xmland add the following above the firstrequestHandlertag:<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>
- Copy all the files from the
- Start Solr:
$ cd SOLR_ROOT/example$ java -jar start.jar
- Start the crawl:
$ cd NUTCH_ROOT- The crawl command has the following options:
-dirnames the directory to put the crawled data into-threadsdetermines the number of threads that will be fetched in parallel (optional)-depthindicates the link depth from the root page that should be crawled-topNdetermines the maximum number of URLs to be retrieved at each level up to the depth
- You can set these numbers to whatever you like, but the general rule is that the higher the numbers are then the more data you will crawl and the longer your crawl will take. This all depends on the setup of your server and what you want to do with your crawl. For example, this is a crawl command that will take a couple of days to complete:
$ bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
- Index the crawl results:
$ bin/nutch solrindex http://HOST_ADDRESS:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*(the port number here and in the next point will differ depending on your server set up – check the Solr wiki for more info about that).- Go to
http://HOST_ADDRESS:8983/solr/adminfor the default Solr admin panel to search the index. You can also hit the results XML directly by hitting the right URL – you will see this URL in the address bar when you get to the results.
- Profit.
Once you’ve followed all these steps you will have your very own mini search engine. At the moment it will only search the URLs that you specify, but this can be changed and I encourgae you to learn more. Once you’ve done a bit of reading you will realise the power of Nutch and Solr and the amazing things you can do with them. Unfortunately there isn’t a huge amount of documentation on how to use these tools that is written in a user-friendly manner, but here are some links that should help get you started:







Awww sweet. Definitely going to try this out and see how it works.
Good man – you definitely won’t regret it. It can get pretty complex, but they’re fun tools to use.
Dude! I’ve been looking for a tutorial for ages!!! I hope this works on my first try!!! I’m going to be doing it on an Ubuntu in a virtual machine…..
Thanks!
Cool man – let me know how it goes…
Ups,
until step 6 everthink works fine, I’ve got a new crawldb + linkdb + segments (plus subfolders and files)
step 7 failt by an IOException
$ bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-04-18 16:17:05
java.io.IOException: Job failed!
Would be nice to know solr and nutch Version.
Cheers Martin
Hi Martin
As far as versions go, I’m using the latest versions of Nutch & Solr (1.2 and 3.1 respectively, but this also worked with v1.4.1 of Solr).
If you’re failing at that point I would say you should try using the server IP instead of ‘localhost’ (I found that to be a problem in some installations – not too sure why). Also, and I’m sure you’ve done this already, make sure your Solr port is correct. You can view a list of ports that are currently in use by running ‘nmap localhost’.
Aside from that I don’t know why you would be getting that particular error – an IOException is a rather generic error, so it could have a number of different causes.
This is perticularly version issue, check the solrj.jar in nutch and solr , i also got the same problem and this solution worked for me,
best luck
Your Stuff is best. Keep It Up.
It helps a lot to developers like me.
Thanks! Always good to know devs find my stuff helpful. If there’s anything in particular you would find useful to know about then let me know and I’ll do a post on it.
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments
it give me errors like that
input path does not exist: file:/ c:/wamp/nutch-1.2/crawl/segments/crawl_fetch
input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_parse
input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_data
input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_text
in segments folder, folders like 20110606102332, 20110606102455, 20110606102814 are created in and all the above file are in it
i also provide the side link [like:
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/20110606102332]
then it will give me the an error like this:
java.io.IOException: Job failed!
Your initial solrindex command needs to have “/*” at the end so that it actually gets all the segments – the subfolders it’s looking for are in the segment folders themselves, not the parent folder. That being said, I’ve never tried it with specifying only one segment instead of all of them (as you did here), so I’m not sure if that would work out. Also, I’ve never tried running this on Windows – that could be part of the issue you’re having I guess. I find these kinds of things are far easier to manage in a Linux environment.
yes i have tried
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
but the same error i have
java.io.IOException: Job failed!
as Vijay said that a version issue but notting could happen when i change it to solr-solrj-1.3.0
isn’t there another way to push data to solr.
its working gr8 bro
i have some version and solr.xml issues. it works great on windows server 2003 Now.
Awesome man – glad to hear it. What are you using them for exactly? While Nutch and Solr are indexing and search tools, they have a wide range of implementations and can be invaluable in a lot of different projects. Would be interesting to hear what kind of thing you’re using them for.
HI dude,
Nice posting and pretty straight forward.I am new to Nutch , Hadoop and Solr. I am trying to make a search engine with three ubuntu 10.4 lts box. I formatted the hadoop nodes successfully and named them as hadoop@master, hadoop@slave, hadoop@slave1. hadoop@master is the master node of the hdfs system. I installed Nutch 1.1-rc package to run nutch and hadoop. When I tried to run the commands
hadoop@master:~#bin/nutch inject crawl/crawldb urls
hadoop@master:~#bin/nutch generate crawl/crawldb crawl/segments
hadoop@master:~#export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing
Until this point i didnt have nay problem but when I tried to launch the following command to launch the crawler I got these errors…
hadoop@master:~#bin/nutch fetch $SEGMENT -noParsing
Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
By the way my hadoop cluster is working well. I checked the status using the jps cpmmand and found all the node are up.
Besides I created a directory urlsdir inside nutch installation folder and put the urls seed inside a txt file in that folder. Then I put it to the hdfs filesystem. I checked with bin/hadoop dfs -ls and i found all the folder are exists and inside crawl i found the segments directory as well with one segment file. Right now I am quite worried that what could be the problem actually. Please give me some clue. I will be very glad if u can figure out my problem.
Thanks in advance
Hugh,
Impressed a lot. I followed the steps as mentioned, and it worked perfect. Thanks a lot.