Scenario Description:
In the past, when we extracted the title of the Web page, we extracted the content directly. But the reality is that, for example, a Javaeye article http://www.iteye.com/news/21643, "10 software development taught me the most important 10 things-non-technical-iteye information", but the actual reference to the title we expected should be "1 0 years of software development taught me the most important 10 things. " So after the title piled up a lot of irrelevant keywords (should be for SEO it). So we want to filter out these keywords. The following methods can be consulted:
1. Find tags such as h1. (After analyzing some websites of Sina News, I feel that it is not feasible, there will be a lot of interference)
2. From the full text to the title, cut the content between (press _ |-) to A1,A2,A3,A4, and then from the longest phrase A3 start from the full text search. If the lookup succeeds, the query a2,a1 begins to the left iteration until the query fails. After the left failure, continue to the right iteration, the same. (This is the method I'm using here.)
PHP code
/**
* @author PQCC
* @date: 2011-06-18
* Description: Given a page content, extract the title of the page. The extracted titles do not include SEO keywords.
* E.g: an article from the headlines <title>Direct extraction of the results as "College English level 46 This Saturday 9.09 million people reference _ Sina Education _ Sina Net", <BR> * but we hope that the result is: "College English 46 in this Saturday open examination 9.09 million people reference." <BR> * Scope of application: The final page title of the article is extracted, not including thematic pages. <BR> * * <BR> <br>class titlepurify{<BR> <BR> private $matches _preg = [-_s|-]; <BR> <BR> function GetTitle ($contents) {/*{{{*/<BR> $preg = "/<title[^>]*> ([w| | | W]*?)</title>/I ";
Preg_match ($preg, $contents, $matches);
if (count ($matches) <=1) {
Return "header extraction failed";
}
$title = $matches [1];
return $this->trimtitle ($title, $contents);
}/*}}}*/
function Trimmeta ($contents) {/*{{{*/
First remove <title>Content, <meta> content. <BR> $preg = "/<title[^>]*> ([w| | | W]*?)</title>/I ";
$contents = Preg_replace ($preg,, $contents);
$preg = "/ ]*>/i ";
$contents = Preg_replace ($preg,, $contents);
return $contents;
}/*}}}*/
Gets the index of the longest item in length.
function Getmaxindex ($titles) {/*{{{*/
$maxItemIndex = 0;
$maxLength = 0;
$loop = 0;
foreach ($titles as $item) {
if (strlen ($item) > $maxLength) {
$maxLength = strlen ($item);
$maxItemIndex = $loop;
}
$loop + +;
}
return $maxItemIndex;
}/*}}}*/
function Trim ($title, $titles, $contents, $maxItemIndex) {/*{{{*/
@todo: here to optimize contents
If the lookup succeeds. result = Temptitle.
$tempTitle = $titles [$maxItemIndex];
$result = $tempTitle;
$count = count ($titles);
While iterating from the current index to the left (until the first one is reached or the match fails to abort).
$leftIndex = $maxItemIndex-1;
while (true && $leftIndex >=0) {
Temptitle+ the left one.
Preg_match ("/({$this->matches_preg}+{$tempTitle})/I", $title, $matches);
if (count ($matches) >1) {
After the temp is used to match the failure, it is rolled back.
$temp = $titles [$leftIndex]. $matches [1];
$tempTitle = $titles [$leftIndex]. $matches [1];
Continue to take the temptitle to match.
Preg_match ("/$tempTitle/I", $contents, $matches);
If the lookup fails ....
if (count ($matches) <1) {
$tempTitle = $temp;
Break
}else{
$result = $tempTitle;
}
}else{//Normally, this condition does not occur.
Break
}
$leftIndex--;&
http://www.bkjia.com/PHPjc/478770.html www.bkjia.com true http://www.bkjia.com/PHPjc/478770.html techarticle Scenario Description: In the past, when we extracted the title of the Web page, we extracted the content directly from each other. But the reality is that, for example javaeye an article http://www.iteye.com/news/2164 ...