I am interested in search engines,
My search is based on nutch and integrated with ICTCLAS. Both word segmentation and speed are good.
In this way, you do not need crywin to simulate Linux
The following is a script to call the nutch under Win NT,
You can change it to make it easy to run automatically.
If you are interested, you can use it to greatly facilitate the operation.
Nutch. bat
@ CMD/V: On/C % ~ Dp0nuttings. Bat % *
Nutch1.bat
@ Echo on
Rem *************************************** * **************************** REM * a script to launch nutch on Windows 2000 /XP system.
Rem *
Rem * written by babatu
Rem * babatu@gmail.com blog: blog.babatu.com
Rem *
Rem * Because delayed environment is used, CMD/V: On shocould be used
Rem * Run this script.
Rem *****************************************************************
If "% OS %" = "windows_nt" @ setlocal
If "% OS %" = "winnt" @ setlocal
If "% 1" = "" Goto: msg
Goto: Begin
: Msg
Echo "Usage: nutch command"
Echo "where command is one :"
Echo "Crawl one-step crawler for intranets"
Echo "readdb read/dump crawl DB"
Echo "readlinkdb read/dump link DB"
Echo "inject new URLs into the database"
Echo "generate new segments to fetch"
Echo "Fetch fetch a segment's pages"
Echo "parse a segment's pages"
Echo "segread read/dump segment data"
Echo "updatedb update crawl DB from segments after fetching"
Echo "invertlinks create a linkdb from parsed segments"
Echo "index run the indexer On Parsed segments and linkdb"
Echo "merge several segment indexes"
Echo "dedup remove duplicates from a set of segment indexes"
Echo "plugin load a plugin and run one of its classes main ()"
Echo "server run a search server"
Echo "or"
Echo "classname run the class named classname"
Echo "Most commands print help when invoked w/o parameters ."
Pause
Goto: End
: Begin
REM % ~ Dp0 extension path (expanded pathname of the current script under NT)
Set default_nutch_home = % ~ Dp0 ..
Rem set default_nutch_home = ..
If "% nutch_home %" = "" set nutch_home = % default_nutch_home%
Set default_nutch_home = ""
Rem sets the default value default_nutch_home.
Echo % nutch_home %
Rem SET _ use_classpath = Yes
If "% classpath %" = "" (set classpath = % java_home %/lib/Tools. Jar) else set
Classpath = % classpath %; % Java_ Home %/lib/tools. Jar
Set classpath = % classpath %; % nutch_ Home %/conf;
Echo % classpath %
ECHO before other
REM for developers, add plugins, job & test code to classpath
If exist % nutch_home %/build/plugins set
Classpath = % classpath %; % nutch_ Home %/build
For/R % nutch_home %/build % I in (nutch *. Job) do set
Classpath =! Classpath !; % I
If exist % nutch_home %/build/test/Classes set
Classpath = % classpath %; % nutch_ Home %/build/test/classes
REM for releases, add nutch job to classpath
For/R % nutch_home % I in (nutch *. Job) do set classpath =! Classpath !; % I
Rem add plugins to classpath
If exist % nutch_home %/plugins set classpath = % classpath %; % nutch_ Home %
Rem add libs to classpath
For/R % nutch_home %/lib % F in (*. Jar) do set classpath =! Classpath !; % F
Echo % classpath %
Rem translate command
If "% 1" = "Crawl" set class = org. Apache. nutch. Crawl. Crawl
If "% 1" = "inject" set class = org. Apache. nutch. Crawl. injectoR
If "% 1" = "generate" set class = org. Apache. nutch. Crawl. generatOr
If "% 1" = "Fetch" set class = org. Apache. nutch. fetcher. Fetcher
If "% 1" = "parse" set class = org. Apache. nutch. parse. parseseGment
If "% 1" = "readdb" set class = org. Apache. nutch. Crawl. crawldbReader
If "% 1" = "readlinkdb" set class = org. Apache. nutch. Crawl. linkdbrEader
If "% 1" = "segread" set class = org. Apache. nutch. Segment. Segmentreader
If "% 1" = "updatedb" set class = org. Apache. nutch. Crawl. crawldb
If "% 1" = "invertlinks" set class = org. Apache. nutch. Crawl. linkdb
If "% 1" = "Index" set class = org. Apache. nutch. Indexer. Indexer
If "% 1" = "dedup" set class = org. Apache. nutch. Indexer. Deleteduplicates
If "% 1" = "merge" set class = org. Apache. nutch. Indexer. Indexmerger
If "% 1" = "plugin" set class = org. Apache. nutch. plugin. Pluginrepository
If "% 1" = "server" set class ='
Org. Apache. nutch. searcher. Distributedsearch $ server'
If "% Class %" = "" set class = % 1
% Java_home %/bin/Java-CP % classpath % Class % *
If "% OS %" = "windows_nt" @ endlocal
If "% OS %" = "winnt" @ endlocal
: End
Search is not a purpose. It is very easy to implement search. You can easily download an ASP or other search source code. There are four key points: 1. Don't say that if the data reaches 10 million, what is your search speed?
2. Can your search engine discover the latest popular words.
3. How to classify webpages, how to clarify the user's intention, or how to approach them.
4. How to balance whether the content is broad and professional. I am just a joke.
As nutch is an open-source project of Apache, its performance is good.
What I do now is:
1. Found in new words:
When the latest popular words are found, the user's search term can be recognized. If the words are not found in the dictionary, it will be treated as a single word. In this case, the words that appear frequently within a period of time are combined, set a threshold value, recognize it, and add it to a temporary dictionary. after a longer period of recognition, add it to the dictionary.
2. How to classify webpages, how to clarify the user's intention, or how to approach them.
Nutch is the second generation of search engines. Full-text indexing seems like Google Baidu does not classify webpages. Therefore, it does not classify webpages. If you want to expand in this direction, it seems that a better method is to use the training set method for a certain number of keywords (words with the largest weight) on the webpage, comparing them with the samples manually classified in advance can produce a high recognition rate.
4. How to balance whether the content is broad and professional?
The specificity seems to have to be ensured by constantly capturing more webpages in the nutch. In an index library that is close to the current actual web page, the degree of extensiveness can be ensured, but this is based on effective queries. Like Google, your search term is poor and it is impossible to return the desired result. Professionalism: first, good word segmentation and indexing, and second, extensive index libraries. In fact, what you need more in professionalism (as you said) is the classification of web pages. Under the premise of classification, the degree of professionalism is easier to guarantee.
In the absence of classification, it is the biggest technical difficulty to perform the most correct word segmentation (while searching by users) after capturing a webpage. Because of the special nature of Chinese, it is very difficult. I joined ICTCLAS of the Chinese Emy of Sciences in nutch for this purpose.
The above is just my personal experience. Please advise.