PHP extracts page titles and excludes irrelevant SEO keywords _php tutorial

Source: Internet
Author: User
Tags preg
Scenario Description:

In the past, when we extracted the title of the Web page, we extracted the content directly. But the reality is that, for example, a Javaeye article http://www.iteye.com/news/21643, "10 software development taught me the most important 10 things-non-technical-iteye information", but the actual reference to the title we expected should be "1 0 years of software development taught me the most important 10 things. " So after the title piled up a lot of irrelevant keywords (should be for SEO it). So we want to filter out these keywords. The following methods can be consulted:


1. Find tags such as h1. (After analyzing some websites of Sina News, I feel that it is not feasible, there will be a lot of interference)

2. From the full text to the title, cut the content between (press _ |-) to A1,A2,A3,A4, and then from the longest phrase A3 start from the full text search. If the lookup succeeds, the query a2,a1 begins to the left iteration until the query fails. After the left failure, continue to the right iteration, the same. (This is the method I'm using here.)


PHP code
/**
* @author PQCC
* @date: 2011-06-18
* Description: Given a page content, extract the title of the page. The extracted titles do not include SEO keywords.
* E.g: an article from the headlines <title>Direct extraction of the results as "College English level 46 This Saturday 9.09 million people reference _ Sina Education _ Sina Net", <BR> * but we hope that the result is: "College English 46 in this Saturday open examination 9.09 million people reference." <BR> * Scope of application: The final page title of the article is extracted, not including thematic pages. <BR> * * <BR> <br>class titlepurify{<BR> <BR> private $matches _preg = [-_s|-]; <BR> <BR> function GetTitle ($contents) {/*{{{*/<BR> $preg = "/<title[^>]*> ([w| | | W]*?)</title>/I ";
Preg_match ($preg, $contents, $matches);
if (count ($matches) <=1) {
Return "header extraction failed";
}
$title = $matches [1];
return $this->trimtitle ($title, $contents);
}/*}}}*/

function Trimmeta ($contents) {/*{{{*/
First remove <title>Content, <meta> content. <BR> $preg = "/<title[^>]*> ([w| | | W]*?)</title>/I ";
$contents = Preg_replace ($preg,, $contents);
$preg = "/ ]*>/i ";
$contents = Preg_replace ($preg,, $contents);
return $contents;
}/*}}}*/


Gets the index of the longest item in length.
function Getmaxindex ($titles) {/*{{{*/
$maxItemIndex = 0;
$maxLength = 0;
$loop = 0;
foreach ($titles as $item) {
if (strlen ($item) > $maxLength) {
$maxLength = strlen ($item);
$maxItemIndex = $loop;
}
$loop + +;
}
return $maxItemIndex;
}/*}}}*/

function Trim ($title, $titles, $contents, $maxItemIndex) {/*{{{*/
@todo: here to optimize contents
If the lookup succeeds. result = Temptitle.
$tempTitle = $titles [$maxItemIndex];
$result = $tempTitle;
$count = count ($titles);
While iterating from the current index to the left (until the first one is reached or the match fails to abort).
$leftIndex = $maxItemIndex-1;
while (true && $leftIndex >=0) {
Temptitle+ the left one.
Preg_match ("/({$this->matches_preg}+{$tempTitle})/I", $title, $matches);
if (count ($matches) >1) {
After the temp is used to match the failure, it is rolled back.
$temp = $titles [$leftIndex]. $matches [1];
$tempTitle = $titles [$leftIndex]. $matches [1];
Continue to take the temptitle to match.
Preg_match ("/$tempTitle/I", $contents, $matches);
If the lookup fails ....
if (count ($matches) <1) {
$tempTitle = $temp;
Break
}else{
$result = $tempTitle;
}
}else{//Normally, this condition does not occur.
Break
}
$leftIndex--;&

http://www.bkjia.com/PHPjc/478770.html www.bkjia.com true http://www.bkjia.com/PHPjc/478770.html techarticle Scenario Description: In the past, when we extracted the title of the Web page, we extracted the content directly from each other. But the reality is that, for example javaeye an article http://www.iteye.com/news/2164 ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.