Summary of the second development of nutch (1)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Through a series of offline activities (for querying users), The nutch retrieval system is much simpler. During the secondary development, you need to adjust the page and display data of the nutch.
1 abstract Extraction

1.1 extract source code analysis

**
* Low level API to get the most relevant (formatted) sections of the document.
* Underlying API to obtain the most relevant (formatted) part of the document
* This method has been made public to allow visibility of score information held in textfragment objects.
* Thanks to Jason Calabrese for help in redefining the interface.
* @ Param tokenstream
* @ Param text
* @ Param maxnumfragments
* @ Param mergecontiguousfragments
* @ Throws ioexception
*/
Public final textfragment [] getbesttextfragments (
Tokenstream,
String text,
Boolean mergecontiguousfragments,
Int maxnumfragments)
Throws ioexception
{
Arraylist docfrags = new arraylist ();
Stringbuffer newtext = new stringbuffer ();

Textfragment currentfrag = new textfragment (newtext, newtext. Length (), docfrags. Size ());
Fragmentscorer. startfragment (currentfrag );
Docfrags. Add (currentfrag );

Fragmentqueue fragqueue = new fragmentqueue (maxnumfragments );

Try
{
Org. Apache. Lucene. analysis. Token token;
String tokentext;
Int startoffset;
Int endoffset;
Int lastendoffset = 0;
Textfragmenter. Start (text );

Tokengroup = new tokengroup ();
Token = tokenstream. Next ();
While (Token! = NULL) & (token. startoffset () <maxdocbytestoanalyze ))
{
If (tokengroup. numtokens> 0) & (tokengroup. isdistinct (token )))
{
// The current token is distinct from previous tokens-
// Markup the cached token group info
Startoffset = tokengroup. matchstartoffset;
Endoffset = tokengroup. matchendoffset;
Tokentext = text. substring (startoffset, endoffset );
String markeduptext = formatter. highlightterm (encoder. encodetext (tokentext), tokengroup );
// Store any whitespace etc from between this and last group
If (startoffset> lastendoffset)
Newtext. append (encoder. encodetext (text. substring (lastendoffset, startoffset )));
Newtext. append (markeduptext );
Lastendoffset = math. Max (endoffset, lastendoffset );
Tokengroup. Clear ();

// Check if current token marks the start of a new Fragment
If (textfragmenter. isnewfragment (token ))
{
Currentfrag. setscore (fragmentscorer. getfragmentscore ());
// Record stats for a new Fragment
Currentfrag. textendpos = newtext. Length ();
Currentfrag = new textfragment (newtext, newtext. Length (), docfrags. Size ());
Fragmentscorer. startfragment (currentfrag );
Docfrags. Add (currentfrag );
}
}

Tokengroup. addtoken (token, fragmentscorer. gettokenscore (token ));

// If (lastendoffset> maxdocbytestoanalyze)
//{
// Break;
//}
Token = tokenstream. Next ();
}
Currentfrag. setscore (fragmentscorer. getfragmentscore ());

If (tokengroup. numtokens> 0)
{
// Flush the accumulated text (same code as in abve loop)
Startoffset = tokengroup. matchstartoffset;
Endoffset = tokengroup. matchendoffset;
Tokentext = text. substring (startoffset, endoffset );
String markeduptext = formatter. highlightterm (encoder. encodetext (tokentext), tokengroup );
// Store any whitespace etc from between this and last group
If (startoffset> lastendoffset)
Newtext. append (encoder. encodetext (text. substring (lastendoffset, startoffset )));
Newtext. append (markeduptext );
Lastendoffset = math. Max (lastendoffset, endoffset );
}

// Test what remains of the original text beyond the point where we stopped Analyzing
If (
// If there is text beyond the last token considered ..
(Lastendoffset <text. Length ())
&&
// And that text is not too large...
(Text. Length () <maxdocbytestoanalyze)
)
{
// Append it to the last fragment
Newtext. append (encoder. encodetext (text. substring (lastendoffset )));
}

Currentfrag. textendpos = newtext. Length ();

// Sort the most relevant sections of the text
For (iterator I = docfrags. iterator (); I. hasnext ();)
{
Currentfrag = (textfragment) I. Next ();

// If you are running with a version of Lucene before 11th Sept 03
// You do not have priorityqueue. insert ()-So uncomment the code below
/*
If (currentfrag. getscore ()> = minscore)
{
Fragqueue. Put (currentfrag );
If (fragqueue. Size ()> maxnumfragments)
{// If hit queue overfull
Fragqueue. Pop (); // remove lowest in hit queue
Minscore = (textfragment) fragqueue. Top (). getscore (); // reset minscore
}

}
*/
// The above code caused a problem as a result of Christoph Goller's 11th Sept 03
// Fix to priorityqueue. The correct method to use here is the new "insert" Method
// Use abve code if this does not compile!
Fragqueue. insert (currentfrag );
}

// Return the most relevant fragments
Textfragment frag [] = new textfragment [fragqueue. Size ()];
For (INT I = frag. Length-1; I> = 0; I --)
{
Frag [I] = (textfragment) fragqueue. Pop ();
}

// Merge any contiguous fragments to improve readability
If (mergecontiguousfragments)
{
Mergecontiguousfragments (frag );
Arraylist fragtexts = new arraylist ();
For (INT I = 0; I <frag. length; I ++)
{
If (frag [I]! = NULL) & (frag [I]. getscore ()> 0 ))
{
Fragtexts. Add (frag [I]);
}
}
Frag = (textfragment []) fragtexts. toarray (New textfragment [0]);
}

Return frag;

}
Finally
{
If (tokenstream! = NULL)
{
Try
{
Tokenstream. Close ();
}
Catch (exception E)
{
}
}
}
}

1.2 change the digest Length

The Digest length can be changed in the query result of nutch. It is modified in the form of configuration engineers. The configuration file is nutch-site: XML:
<Configuration>
...
<Property>
<Name> searcher. Summary. Length </Name>
<Value> 50 </value> // The default value is 20.
<Description>
The total number of terms to display in a hit summary.
</Description>
</Property>
...
</Configuration>
The default configuration of nutch may be set in the nutch-default.xml, and if you want to override its configuration, you just need to add the corresponding configuration in the nutch-site.xml.

2. Web snapshots

Webpage snapshots and copies stored on the search engine server. When using a keyword to search for a webpage, the relevant information of the keyword, such as title,
URL, content, and so on. The URL can be used to link to the webpage corresponding to the URL. Web snapshots are actually the web content crawled by a web crawler. Therefore, when I click a web snapshot
You can index the original webpage content based on the index Document ID. In cache. jsp of the query service system, the source code is as follows:
Hit hit = new hit (integer. parseint (request. getparameter ("idx ")),
Request. getparameter ("ID "));
Hitdetails details = bean. getdetails (hit );
....

String content = new string (bean. getcontent (details ));
In addition, it also involves the question of the Chinese nature of the Web page snapshot, Chinese when using UTF-8 to get the content on the line.
Modify cached. jsp
***
Else
Content = new string (bean. getcontent (details ));
Change
Content = new string (bean. getcontent (details), "UTF-8 ");
If you need to modify the display of the content, you can also modify it on this page.

For the second part of this article, see: http://hi.baidu.com/zhumulangma/blog/item/2c0f05f4b55e38e77709d7ce.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary of the second development of nutch (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary of the second development of nutch (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support