PHP extracts web page questions and removes irrelevant seo keywords

Source: Internet
Author: User
PHP extracts the webpage title and removes irrelevant seo keywords. scenario description: & nbsp; in the past, when we extract the webpage title, we will directly extract the content between them. but the actual situation is like this, for example, javaeye an article http://www.iteye.com/news/2164 PHP extracted web page title and remove irrelevant seo keywords
Scenario description:

In the past, when we extracted the webpage title, we would directly extract the content between them. but the actual situation is like this, for example, javaeye's article http://www.iteye.com/news/21643, the content of "10 years of software development taught me the most important 10 things-non-technical-ITeye information ", however, the title we expected in the actual reference should be "10 most important things that software development taught me in 10 years ". therefore, a lot of irrelevant keywords are piled up behind the title (it should be for seo ). so we want to filter out these keywords. you can refer to the following methods:


1. search for labels such as h1. (after analyzing some websites such as sina news, I think it is not feasible and there will be a lot of interference)

2. after the title is removed from the full text, cut the content (by _ |-) into a1, a2, a3, and a4, and then search for the full text from the longest phrase a3. if the query is successful, iteration a2 and a1 are queried on the left until the query fails. After the failure on the left side, continue iteration to the right. (This method is used here)


 * @ Date: 2011-06-18 * Description: extract the title of a webpage based on the content of a webpage. the extracted title does not include the seo keyword. * e. g: The title of a news articleThe direct extraction result is "(CET4 and CET6), 9.09 million students will be admitted this Saturday. reference _ Sina _ Sina Education _ Sina Net". * The expected result is: "For 9.09 million students of CET4 and CET6 this Saturday, refer ". * applicability: extract the title of the final page of the article, excluding the topic pages. */class TitlePurify {private $ matches_preg = '[-_ \ s |-]'; function getTitle ($ contents) {/* {*/$ preg = "/
 
  
] *> ([\ W | \ t | \ r | \ W] *?) <\/Title>/I "; preg_match ($ preg, $ contents, $ matches); if (count ($ matches) <= 1) {return "title extraction failed" ;}$ title = $ matches [1]; return $ this-> trimTitle ($ title, $ contents );} /* }}*/function trimMeta ($ contents) {/* {* // first remove
  Content,
  Content. $ preg = "/
  
   
] *> ([\ W | \ t | \ r | \ W] *?) <\/Title>/I "; $ contents = preg_replace ($ preg,'', $ contents); $ preg = "/
   
    
] *>/I "; $ contents = preg_replace ($ preg,'', $ contents); return $ contents ;} /* }}* // Obtain the longest-length item? The index. function getMaxIndex ($ titles) {/* {*/$ maxItemIndex = 0; $ maxLength = 0; $ loop = 0; foreach ($ titles as $ item) {if (strlen ($ item)> $ maxLength) {$ maxLength = strlen ($ item); $ maxItemIndex = $ loop;} $ loop ++;} return $ maxItemIndex ;} /* }}*/function trim ($ title, $ titles, $ contents, $ maxItemIndex) {/* {* // @ todo: here you can optimize contents // if the search is successful. result = tempTitle. $ tempTitle = $ titles [$ maxItemIndex]; $ Result = $ tempTitle; $ count = count ($ titles); // while iterates from the Current index to the left (does not stop until the first index is reached or the matching fails ). $ leftIndex = $ maxItemIndex-1; while (true & $ leftIndex> = 0) {// tempTitle + one left. preg_match ("/({$ this-> matches_preg} + {$ tempTitle})/I", $ title, $ matches); if (count ($ matches)> 1) {// temp is used to roll back after the matching fails. $ temp = $ titles [$ leftIndex]. $ matches [1]; $ tempTitle = $ titles [$ leftIndex]. $ matches [1]; // continue to take the tempTitle Configuration. preg_match ("/$ tempTitle/I", $ contents, $ matches); // if the search fails .... if (count ($ matches) <1) {$ tempTitle = $ temp; break;} else {$ result = $ tempTitle ;}} else {//? Under normal circumstances ,? This will not happen. break;} $ leftIndex --;} // match (current [i-1]. [|-]. tempTitle), if successful, tempTitle = match successful value, continue. // After The while fails on the left, start from the right. $ rightIndex = $ maxItemIndex + 1; while (true & ($ rightIndex <= $ count )) {preg_match ("/({$ tempTitle} {$ this-> matches_preg} +)/I", $ title, $ matches); if (count ($ matches)> 1) {// temp is used to roll back after the matching fails. $ temp = $ matches [1]. $ titles [$ rightIndex]; $ tempTitle = $ matches [1]. $ titles [$ right Index]; // continue to match with tempTitle. preg_match ("/$ tempTitle/I", $ contents, $ matches); // if the search fails .... if (count ($ matches) <1) {$ tempTitle = $ temp; break;} else {$ result = $ tempTitle ;}} else {//? Under normal circumstances ,? This will not happen. break ;}$ rightIndex ++;} return $ result ;}/ * }}*/function trimTitle ($ title, $ contents) {/* {*/$ contents = $ this-> trimMeta ($ contents); // Configure the cut Title rule. $ titles = preg_split ("/$ this-> matches_preg/I", $ title); $ count = count ($ titles); // var_dump ($ titles); exit; // search the longest item from the full text. $ maxItemIndex = $ this-> getMaxIndex ($ titles); $ tempTitle = $ titles [$ maxItemIndex]; preg_match ("/$ tempTitle/I", $ c Ontents, $ matches); // if the search fails .... if (count ($ matches) <1) {return $ title;} return $ this-> trim ($ title, $ titles, $ contents, $ maxItemIndex );} /* }}*/} // --------------- test code ------------------------------ function convertEncoding ($ contents) {preg_match ("/charset = ([\ w | \-] + );? /I ", $ contents, $ match); $ charset = isset ($ match [1])? $ Match [1]: 'utf-8'; $ contents = mb_convert_encoding ($ contents, 'utf-8', $ charset); return $ contents;} $ url =' http://china.nba.com/news/4/2011/0617/61383331/10451.html '; $ Contents = file_get_contents ($ url); $ contents = convertEncoding ($ contents); $ startTime = microtime (); $ purify = new TitlePurify (); $ title = $ purify-> getTitle ($ contents); $ endTime = microtime (); echo "title: $ title"; echo "cost :". ($ endTime-$ startTime);?>
   
  
 



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.