Scenario Description:
In the past, when we extracted the webpage title, we would directly extract the content between them. but the actual situation is like this, for example, javaeye an article http://www.iteye.com/news/21643, the content is & quot; 10 years of software development has taught me the most important 10 things-non-technical-ITeye News & quot;, but the title we expected in actual reference should be & quot; 10 years of software development has taught me the most important 10 Things & quot ;. therefore, a lot of irrelevant keywords are piled up behind the title (it should be for seo ). so we want to filter out these keywords. you can refer to the following methods:
1. Search for labels such as h1. (After analyzing some websites such as sina news, I think it is not feasible and there will be a lot of interference)
2. after the title is removed from the full text, cut the content (by _ |-) into a1, a2, a3, and a4, and then search for the full text from the longest phrase a3. if the query is successful, iteration a2 and a1 are queried on the left until the query fails. After the failure on the left side, continue iteration to the right. (This method is used here)
Php code
<? Php
/**
* @ Author pqcc <struts.ec@mgail.com>
* @ Date: 2011-06-18
* Description: extract the title of a webpage based on the content of a webpage. The extracted title does not include the seo keyword.
* E. g: the result of a news subject is directly extracted from <title>: "9.09 million students in CET4 and cet6 this Saturday ",
* But we hope the result is: "9.09 million students of CET4 and cet6 will be admitted this Saturday ".
* Applicability: extract the title of the final page of the article, excluding the topic pages.
*/
Class TitlePurify {
Private $ matches_preg = [-_ s |-];
Function getTitle ($ contents ){/*{{{*/
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
Preg_match ($ preg, $ contents, $ matches );
If (count ($ matches) <= 1 ){
Return "title extraction failed ";
}
$ Title = $ matches [1];
Return $ this-> trimTitle ($ title, $ contents );
}/*}}}*/
Function trimMeta ($ contents ){/*{{{*/
// First remove the <title> content and <meta> content.
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
$ Contents = preg_replace ($ preg, $ contents );
$ Preg = "/<meta [^>] *>/I ";
$ Contents = preg_replace ($ preg, $ contents );
Return $ contents;
}/*}}}*/
// Obtain the index of the item with the longest length.
Function getMaxIndex ($ titles ){/*{{{*/
$ MaxItemIndex = 0;
$ MaxLength = 0;
$ Loop = 0;
Foreach ($ titles as $ item ){
If (strlen ($ item)> $ maxLength ){
$ MaxLength = strlen ($ item );
$ MaxItemIndex = $ loop;
}
$ Loop ++;
}
Return $ maxItemIndex;
}/*}}}*/
Function trim ($ title, $ titles, $ contents, $ maxItemIndex ){/*{{{*/
// @ Todo: contents can be optimized here.
// If the search is successful, result = tempTitle.
$ TempTitle = $ titles [$ maxItemIndex];
$ Result = $ tempTitle;
$ Count = count ($ titles );
// While iterates from the current index to the left (it does not stop until the first index is reached or the matching fails ).
$ LeftIndex = $ maxItemIndex-1;
While (true & $ leftIndex> = 0 ){
// TempTitle + one left.
Preg_match ("/({$ this-> matches_preg} + {$ tempTitle})/I", $ title, $ matches );
If (count ($ matches)> 1 ){
// Temp is used to roll back after the matching fails.
$ Temp = $ titles [$ leftIndex]. $ matches [1];
$ TempTitle = $ titles [$ leftIndex]. $ matches [1];
// Continue matching with tempTitle.
Preg_match ("/$ tempTitle/I", $ contents, $ matches );
// If the search fails ....
If (count ($ matches) <1 ){
$ TempTitle = $ temp;
Break;
} Else {
$ Result = $ tempTitle;
}
} Else {// normally, this will not happen.
Break;
}
$ LeftIndex --;&