(Article) Embedding open source search engine
Article : Embedding open source search engine
A common feature of websites is to have an inbuilt search facility for retrieving data of user's interest. Developers generally incorporate in their website the customized search APIs of popular search engines like Google, Yahoo!, MSLive, Amazon, etc. These companies crawl the related websites and provide search facility among the documents of those websites and of worldwide web also. It may also act as an advertisement for them through the websites. As a matter of pride, many organizations would prefer to have their own search engine embedded in the website.
A decade ago, many search engines like Altavista , Lycos, Yahoo, Askjeeves (now ask.com) were popular. Later, Google with its sophisticated ranking strategy ensured acceptable results for different types of user queries. But getting the customized service of Google is shareware and many sites may not be able to pay for availing the facility. Then different search engines like cuil, guruji, khoj, terrior came up with their own ranking strategy in the web supporting multiple languages. Along with these developments, Open Source search engines also emerged aside.
Nutch
Nutch is an Open Source search engine developed in JAVA on top of Lucene, which itself is a free Open Source information retrieval system. Nutch can be deployed in Internet or Intranet environments and can be customized for building small or large scale information retrieval systems supporting multiple languages.
Prerequisites
1. JAVA and JRE should be installed and path variables for JAVA_HOME and JRE_HOME should be set.
2. Set Path to current ANT build, if not done already. Apache Ant is a JAVA-based build tool which builds the project using configuration files that are XML based. Its current version (1.7.1) can be downloaded from
www.apache.org/dist/ant/binaries/apache-ant-1.7.1-bin.tar.gz
Installing and configuring Nutch
The latest version of Nutch (ver 0.9) can be downloaded from http://www.apache.org/dist/lucene/nutc. Assume that the login is pcquest and the home folder is /home/pcquest. Create a folder, named say <mySearch> and download the file nutch-0.9.tar.gz (Size 68MB) in it, extract the contents therein and then go to folder /home/pcquest/mySearch/nutch-0.9/ which is the root folder of Nutch.
Courtesy:- ciol.com
- guru's blog
- Login to post comments
