PHP extracts web page titles and excludes irrelevant SEO keywords

Source: Internet
Author: User
Tags preg
PHP extracts page titles and excludes irrelevant SEO keywords
Scenario Description:

In the past, when we extracted the title of the Web page, we extracted the content directly. But the reality is that, for example, a Javaeye article http://www.iteye.com/news/21643, "10 software development taught me the most important 10 things-non-technical-iteye information", but the actual reference to the title we expected should be "1 0 years of software development taught me the most important 10 things. " So after the title piled up a lot of irrelevant keywords (should be for SEO it). So we want to filter out these keywords. The following methods can be consulted:


1. Find tags such as h1. (After analyzing some websites of Sina News, I feel that it is not feasible, there will be a lot of interference)

2. From the full text to the title, cut the content between (press _ |-) to A1,A2,A3,A4, and then from the longest phrase A3 start from the full text search. If the lookup succeeds, the query a2,a1 begins to the left iteration until the query fails. After the left failure, continue to the right iteration, the same. (This is the method I'm using here.)


 * @date: 2011-06-18 * Description: Given a page content, extract the title of the page. The extracted titles do not include SEO keywords. * E.g: an article from the headlines<title>The result of direct extraction is "college English level 46 This Saturday 9.09 million people for reference _ Sina Education _ Sina Net", * but we hope the result is: "College English 46 of this Saturday open examination 9.09 million people reference." * Scope of application: The final page title of the article is extracted, not including thematic pages.    */class titlepurify{private $matches _preg = ' [-_\s|-] '; function GetTitle ($contents) {/*{{{*/$preg = "/</title>
 
  ]*> ([\w|\t|\r|\w]*?)        <\/title>/i ";        Preg_match ($preg, $contents, $matches);        if (count ($matches) <=1) {return "header extraction failed";        } $title = $matches [1];    return $this->trimtitle ($title, $contents); }/*}}}*/function Trimmeta ($contents) {/*{{{*///First remove
  <title>Content</title>
  
   
  Content. $preg = "/
  
   ]*> ([\w|\t|\r|\w]*?)        <\/title>/i ";        $contents = Preg_replace ($preg, ", $contents); $preg = "/
   
    
]*>/i ";        $contents = Preg_replace ($preg, ", $contents);    return $contents;    }/*}}}*///Gets the longest item of the length, the index at which it is located.        function Getmaxindex ($titles) {/*{{{*/$maxItemIndex = 0;        $maxLength = 0;        $loop = 0;                foreach ($titles as $item) {if (strlen ($item) > $maxLength) {$maxLength = strlen ($item);            $maxItemIndex = $loop;        } $loop + +;    } return $maxItemIndex;         }/*}}}*/function Trim ($title, $titles, $contents, $maxItemIndex) {/*{{{*/[email protected]: Here you can optimize contents If the lookup succeeds.         result = Temptitle.        $tempTitle = $titles [$maxItemIndex];        $result = $tempTitle;        $count = count ($titles);        While iterating from the current index to the left (until the first one is reached or the match fails to abort).        $leftIndex = $maxItemIndex-1;            while (true && $leftIndex >=0) {//Temptitle+ left one. Preg_match ("/({$this->matches_preg}+{$tempTitle})/I ", $title, $matches);                if (count ($matches) >1) {//Temp is used for the match to fail after the rollback. $temp = $titles [$leftIndex].                $matches [1]; $tempTitle = $titles [$leftIndex].                $matches [1];                Continue to take the temptitle to match.                Preg_match ("/$tempTitle/I", $contents, $matches);                    If the lookup fails ... if (count ($matches) <1) {$tempTitle = $temp;                Break                }else{$result = $tempTitle;                }}else{//? The condition does not occur in the normal case.            Break        } $leftIndex--; }//Match (current[i-1].[|        -].temptitle), if successful, Temptitle = match success value, continue.        When the left side fails, start from the right.        $rightIndex = $maxItemIndex +1; while (True && ($rightIndex <= $count)) {Preg_match ("/({$tempTitle} {$this->matches_preg}+)/i", $titl            E, $matches);          if (count ($matches) >1) {      After the temp is used to match the failure, it is rolled back. $temp = $matches [1].                $titles [$rightIndex]; $tempTitle = $matches [1].                $titles [$rightIndex];                Continue to take the temptitle to match.                Preg_match ("/$tempTitle/I", $contents, $matches);                    If the lookup fails ... if (count ($matches) <1) {$tempTitle = $temp;                Break                }else{$result = $tempTitle;                }}else{//? The condition does not occur in the normal case.            Break        } $rightIndex + +;    } return $result;            }/*}}}*/function Trimtitle ($title, $contents) {/*{{{*/$contents = $this->trimmeta ($contents);        Configure rules for cutting titles.        $titles = Preg_split ("/$this->matches_preg/i", $title);        $count = count ($titles);        Var_dump ($titles); exit;        Finds the current longest item from the full text.        $maxItemIndex = $this->getmaxindex ($titles); $tempTitle = $titles[$maxItemIndex];        Preg_match ("/$tempTitle/I", $contents, $matches);        If the lookup fails ... if (count ($matches) <1) {return $title;    } return $this->trim ($title, $titles, $contents, $maxItemIndex); }/*}}}*/}//-------------Test Code------------------------------function convertencoding ($contents) {Preg_match ("/ charset= ([\w|\-]+);?    /i ", $contents, $match); $charset = Isset ($match [1])?    $match [1]: ' UTF-8 ';    $contents = mb_convert_encoding ($contents, ' UTF-8 ', $charset); return $contents;} $url = ' http://china.nba.com/news/4/2011/0617/61383331/10451.html '; $contents = file_get_contents ($url); $contents = Convertencoding ($contents); $startTime = Microtime (); $purify = new Titlepurify (); $title = $purify->gettitle ($c ontents); $endTime = Microtime (); echo "title: $title"; echo "Cost:". ($endTime-$startTime);? >
   
  
 



  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.