PHP extracts the webpage title and removes irrelevant seo keywords. scenario description: & nbsp; in the past, when we extract the webpage title, we will directly extract the content between them. but the actual situation is like this, for example, a javaeye article www. iteye. the content of comnews21643, & nbsp; is & amp; quot; 10 years of software development taught me the most important 10 things-non-technical-it php extracts webpage titles and removes irrelevant seo keywords
Scenario description:
In the past, when we extracted the webpage title, we would directly extract the content between them. but the actual situation is like this, for example, javaeye's article http://www.iteye.com/news/21643, the content of "10 years of software development taught me the most important 10 things-non-technical-ITeye information ", however, the title we expected in the actual reference should be "10 most important things that software development taught me in 10 years ". therefore, a lot of irrelevant keywords are piled up behind the title (it should be for seo ). so we want to filter out these keywords. you can refer to the following methods:
1. search for labels such as h1. (after analyzing some websites such as sina news, I think it is not feasible and there will be a lot of interference)
2. after the title is removed from the full text, cut the content (by _ |-) into a1, a2, a3, and a4, and then search for the full text from the longest phrase a3. if the query is successful, iteration a2 and a1 are queried on the left until the query fails. After the failure on the left side, continue iteration to the right. (This method is used here)
* @ Date: 2011-06-18 * Description: extract the title of a webpage based on the content of a webpage. the extracted title does not include the seo keyword. * e. g: The title of a news articleThe direct extraction result is "(CET4 and CET6), 9.09 million students will be admitted this Saturday. reference _ Sina _ Sina Education _ Sina Net". * The expected result is: "For 9.09 million students of CET4 and CET6 this Saturday, refer ". * applicability: extract the title of the final page of the article, excluding the topic pages. */class TitlePurify {private $ matches_preg = '[-_ \ s |-]'; function getTitle ($ contents) {/* {*/$ preg = "/
] *> ([\ W | \ t | \ r | \ W] *?) <\/Title>/I "; preg_match ($ preg, $ contents, $ matches); if (count ($ matches) <= 1) {return "title extraction failed" ;}$ title = $ matches [1]; return $ this-> trimTitle ($ title, $ contents );} /* }}*/function trimMeta ($ contents) {/* {* // first remove
Content,
Content. $ preg = "/
] *> ([\ W | \ t | \ r | \ W] *?) <\/Title>/I "; $ contents = preg_replace ($ preg,'', $ contents); $ preg = "/
] *>/I "; $ contents = preg_replace ($ preg,'', $ contents); return $ contents ;} /* }}* // Obtain the longest-length item? The index. function getMaxIndex ($ titles) {/* {*/$ maxItemIndex = 0; $ maxLength = 0; $ loop = 0; foreach ($ titles as $ item) {if (strlen ($ item)> $ maxLength) {$ maxLength = strlen ($ item); $ maxItemIndex = $ loop;} $ loop ++;} return $ maxItemIndex ;} /* }}*/function trim ($ title, $ titles, $ contents, $ maxItemIndex) {/* {*/[email protected]: here you can optimize contents // if the search is successful. result = tempTitle. $ tempTitle = $ titles [$ max ItemIndex]; $ result = $ tempTitle; $ count = count ($ titles); // while iterates from the Current index to the left (does not stop until the first index is reached or the matching fails ). $ leftIndex = $ maxItemIndex-1; while (true & $ leftIndex> = 0) {// tempTitle + one left. preg_match ("/({$ this-> matches_preg} + {$ tempTitle})/I", $ title, $ matches); if (count ($ matches)> 1) {// temp is used to roll back after the matching fails. $ temp = $ titles [$ leftIndex]. $ matches [1]; $ tempTitle = $ titles [$ leftIndex]. $ matches [1]; // continue with te MpTitle to match. preg_match ("/$ tempTitle/I", $ contents, $ matches); // if the search fails .... if (count ($ matches) <1) {$ tempTitle = $ temp; break;} else {$ result = $ tempTitle ;}} else {//? Under normal circumstances ,? This will not happen. break;} $ leftIndex --;} // match (current [i-1]. [|-]. tempTitle), if successful, tempTitle = match successful value, continue. // After The while fails on the left, start from the right. $ rightIndex = $ maxItemIndex + 1; while (true & ($ rightIndex <= $ count )) {preg_match ("/({$ tempTitle} {$ this-> matches_preg} +)/I", $ title, $ matches); if (count ($ matches)> 1) {// temp is used to roll back after the matching fails. $ temp = $ matches [1]. $ titles [$ rightIndex]; $ tempTitle = $ matches [1]. $ titles [$ right Index]; // continue to match with tempTitle. preg_match ("/$ tempTitle/I", $ contents, $ matches); // if the search fails .... if (count ($ matches) <1) {$ tempTitle = $ temp; break;} else {$ result = $ tempTitle ;}} else {//? Under normal circumstances ,? This will not happen. break ;}$ rightIndex ++;} return $ result ;}/ * }}*/function trimTitle ($ title, $ contents) {/* {*/$ contents = $ this-> trimMeta ($ contents); // Configure the cut Title rule. $ titles = preg_split ("/$ this-> matches_preg/I", $ title); $ count = count ($ titles); // var_dump ($ titles); exit; // search the longest item from the full text. $ maxItemIndex = $ this-> getMaxIndex ($ titles); $ tempTitle = $ titles [$ maxItemIndex]; preg_match ("/$ tempTitle/I", $ c Ontents, $ matches); // if the search fails .... if (count ($ matches) <1) {return $ title;} return $ this-> trim ($ title, $ titles, $ contents, $ maxItemIndex );} /* }}*/} // --------------- test code ------------------------------ function convertEncoding ($ contents) {preg_match ("/charset = ([\ w | \-] + );? /I ", $ contents, $ match); $ charset = isset ($ match [1])? $ Match [1]: 'utf-8'; $ contents = mb_convert_encoding ($ contents, 'utf-8', $ charset); return $ contents;} $ url =' http://china.nba.com/news/4/2011/0617/61383331/10451.html '; $ Contents = file_get_contents ($ url); $ contents = convertEncoding ($ contents); $ startTime = microtime (); $ purify = new TitlePurify (); $ title = $ purify-> getTitle ($ contents); $ endTime = microtime (); echo "title: $ title"; echo "cost :". ($ endTime-$ startTime);?>