Summary of the second development of nutch (1)

Source: Internet
Author: User

Through a series of offline activities (for querying users), The nutch retrieval system is much simpler. During the secondary development, you need to adjust the page and display data of the nutch.
1 abstract Extraction

1.1 extract source code analysis

**
* Low level API to get the most relevant (formatted) sections of the document.
* Underlying API to obtain the most relevant (formatted) part of the document
* This method has been made public to allow visibility of score information held in textfragment objects.
* Thanks to Jason Calabrese for help in redefining the interface.
* @ Param tokenstream
* @ Param text
* @ Param maxnumfragments
* @ Param mergecontiguousfragments
* @ Throws ioexception
*/
Public final textfragment [] getbesttextfragments (
Tokenstream,
String text,
Boolean mergecontiguousfragments,
Int maxnumfragments)
Throws ioexception
{
Arraylist docfrags = new arraylist ();
Stringbuffer newtext = new stringbuffer ();

Textfragment currentfrag = new textfragment (newtext, newtext. Length (), docfrags. Size ());
Fragmentscorer. startfragment (currentfrag );
Docfrags. Add (currentfrag );

Fragmentqueue fragqueue = new fragmentqueue (maxnumfragments );

Try
{
Org. Apache. Lucene. analysis. Token token;
String tokentext;
Int startoffset;
Int endoffset;
Int lastendoffset = 0;
Textfragmenter. Start (text );

Tokengroup = new tokengroup ();
Token = tokenstream. Next ();
While (Token! = NULL) & (token. startoffset () <maxdocbytestoanalyze ))
{
If (tokengroup. numtokens> 0) & (tokengroup. isdistinct (token )))
{
// The current token is distinct from previous tokens-
// Markup the cached token group info
Startoffset = tokengroup. matchstartoffset;
Endoffset = tokengroup. matchendoffset;
Tokentext = text. substring (startoffset, endoffset );
String markeduptext = formatter. highlightterm (encoder. encodetext (tokentext), tokengroup );
// Store any whitespace etc from between this and last group
If (startoffset> lastendoffset)
Newtext. append (encoder. encodetext (text. substring (lastendoffset, startoffset )));
Newtext. append (markeduptext );
Lastendoffset = math. Max (endoffset, lastendoffset );
Tokengroup. Clear ();

// Check if current token marks the start of a new Fragment
If (textfragmenter. isnewfragment (token ))
{
Currentfrag. setscore (fragmentscorer. getfragmentscore ());
// Record stats for a new Fragment
Currentfrag. textendpos = newtext. Length ();
Currentfrag = new textfragment (newtext, newtext. Length (), docfrags. Size ());
Fragmentscorer. startfragment (currentfrag );
Docfrags. Add (currentfrag );
}
}

Tokengroup. addtoken (token, fragmentscorer. gettokenscore (token ));

// If (lastendoffset> maxdocbytestoanalyze)
//{
// Break;
//}
Token = tokenstream. Next ();
}
Currentfrag. setscore (fragmentscorer. getfragmentscore ());

If (tokengroup. numtokens> 0)
{
// Flush the accumulated text (same code as in abve loop)
Startoffset = tokengroup. matchstartoffset;
Endoffset = tokengroup. matchendoffset;
Tokentext = text. substring (startoffset, endoffset );
String markeduptext = formatter. highlightterm (encoder. encodetext (tokentext), tokengroup );
// Store any whitespace etc from between this and last group
If (startoffset> lastendoffset)
Newtext. append (encoder. encodetext (text. substring (lastendoffset, startoffset )));
Newtext. append (markeduptext );
Lastendoffset = math. Max (lastendoffset, endoffset );
}

// Test what remains of the original text beyond the point where we stopped Analyzing
If (
// If there is text beyond the last token considered ..
(Lastendoffset <text. Length ())
&&
// And that text is not too large...
(Text. Length () <maxdocbytestoanalyze)
)
{
// Append it to the last fragment
Newtext. append (encoder. encodetext (text. substring (lastendoffset )));
}

Currentfrag. textendpos = newtext. Length ();

// Sort the most relevant sections of the text
For (iterator I = docfrags. iterator (); I. hasnext ();)
{
Currentfrag = (textfragment) I. Next ();

// If you are running with a version of Lucene before 11th Sept 03
// You do not have priorityqueue. insert ()-So uncomment the code below
/*
If (currentfrag. getscore ()> = minscore)
{
Fragqueue. Put (currentfrag );
If (fragqueue. Size ()> maxnumfragments)
{// If hit queue overfull
Fragqueue. Pop (); // remove lowest in hit queue
Minscore = (textfragment) fragqueue. Top (). getscore (); // reset minscore
}

}
*/
// The above code caused a problem as a result of Christoph Goller's 11th Sept 03
// Fix to priorityqueue. The correct method to use here is the new "insert" Method
// Use abve code if this does not compile!
Fragqueue. insert (currentfrag );
}

// Return the most relevant fragments
Textfragment frag [] = new textfragment [fragqueue. Size ()];
For (INT I = frag. Length-1; I> = 0; I --)
{
Frag [I] = (textfragment) fragqueue. Pop ();
}

// Merge any contiguous fragments to improve readability
If (mergecontiguousfragments)
{
Mergecontiguousfragments (frag );
Arraylist fragtexts = new arraylist ();
For (INT I = 0; I <frag. length; I ++)
{
If (frag [I]! = NULL) & (frag [I]. getscore ()> 0 ))
{
Fragtexts. Add (frag [I]);
}
}
Frag = (textfragment []) fragtexts. toarray (New textfragment [0]);
}

Return frag;

}
Finally
{
If (tokenstream! = NULL)
{
Try
{
Tokenstream. Close ();
}
Catch (exception E)
{
}
}
}
}

1.2 change the digest Length

The Digest length can be changed in the query result of nutch. It is modified in the form of configuration engineers. The configuration file is nutch-site: XML:
<Configuration>
...
<Property>
<Name> searcher. Summary. Length </Name>
<Value> 50 </value> // The default value is 20.
<Description>
The total number of terms to display in a hit summary.
</Description>
</Property>
...
</Configuration>
The default configuration of nutch may be set in the nutch-default.xml, and if you want to override its configuration, you just need to add the corresponding configuration in the nutch-site.xml.

2. Web snapshots

Webpage snapshots and copies stored on the search engine server. When using a keyword to search for a webpage, the relevant information of the keyword, such as title,
URL, content, and so on. The URL can be used to link to the webpage corresponding to the URL. Web snapshots are actually the web content crawled by a web crawler. Therefore, when I click a web snapshot
You can index the original webpage content based on the index Document ID. In cache. jsp of the query service system, the source code is as follows:
Hit hit = new hit (integer. parseint (request. getparameter ("idx ")),
Request. getparameter ("ID "));
Hitdetails details = bean. getdetails (hit );
....

String content = new string (bean. getcontent (details ));
In addition, it also involves the question of the Chinese nature of the Web page snapshot, Chinese when using UTF-8 to get the content on the line.
Modify cached. jsp
***
Else
Content = new string (bean. getcontent (details ));
Change
Content = new string (bean. getcontent (details), "UTF-8 ");
If you need to modify the display of the content, you can also modify it on this page.

For the second part of this article, see: http://hi.baidu.com/zhumulangma/blog/item/2c0f05f4b55e38e77709d7ce.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.