PHP extracts webpage titles and removes irrelevant seo keywords

Source: Internet
Author: User
Tags preg

Scenario Description:

In the past, when we extracted the webpage title, we would directly extract the content between them. but the actual situation is like this, for example, javaeye an article http://www.iteye.com/news/21643, the content is & quot; 10 years of software development has taught me the most important 10 things-non-technical-ITeye News & quot;, but the title we expected in actual reference should be & quot; 10 years of software development has taught me the most important 10 Things & quot ;. therefore, a lot of irrelevant keywords are piled up behind the title (it should be for seo ). so we want to filter out these keywords. you can refer to the following methods:


1. Search for labels such as h1. (After analyzing some websites such as sina news, I think it is not feasible and there will be a lot of interference)

2. after the title is removed from the full text, cut the content (by _ |-) into a1, a2, a3, and a4, and then search for the full text from the longest phrase a3. if the query is successful, iteration a2 and a1 are queried on the left until the query fails. After the failure on the left side, continue iteration to the right. (This method is used here)


Php code
<? Php
/**
* @ Author pqcc <struts.ec@mgail.com>
* @ Date: 2011-06-18
* Description: extract the title of a webpage based on the content of a webpage. The extracted title does not include the seo keyword.
* E. g: the result of a news subject is directly extracted from <title>: "9.09 million students in CET4 and cet6 this Saturday ",
* But we hope the result is: "9.09 million students of CET4 and cet6 will be admitted this Saturday ".
* Applicability: extract the title of the final page of the article, excluding the topic pages.
*/

Class TitlePurify {

Private $ matches_preg = [-_ s |-];

Function getTitle ($ contents ){/*{{{*/
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
Preg_match ($ preg, $ contents, $ matches );
If (count ($ matches) <= 1 ){
Return "title extraction failed ";
}
$ Title = $ matches [1];
Return $ this-> trimTitle ($ title, $ contents );
}/*}}}*/

Function trimMeta ($ contents ){/*{{{*/
// First remove the <title> content and <meta> content.
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
$ Contents = preg_replace ($ preg, $ contents );
$ Preg = "/<meta [^>] *>/I ";
$ Contents = preg_replace ($ preg, $ contents );
Return $ contents;
}/*}}}*/


// Obtain the index of the item with the longest length.
Function getMaxIndex ($ titles ){/*{{{*/
$ MaxItemIndex = 0;
$ MaxLength = 0;
$ Loop = 0;
Foreach ($ titles as $ item ){
If (strlen ($ item)> $ maxLength ){
$ MaxLength = strlen ($ item );
$ MaxItemIndex = $ loop;
}
$ Loop ++;
}
Return $ maxItemIndex;
}/*}}}*/

Function trim ($ title, $ titles, $ contents, $ maxItemIndex ){/*{{{*/
// @ Todo: contents can be optimized here.
// If the search is successful, result = tempTitle.
$ TempTitle = $ titles [$ maxItemIndex];
$ Result = $ tempTitle;
$ Count = count ($ titles );
// While iterates from the current index to the left (it does not stop until the first index is reached or the matching fails ).
$ LeftIndex = $ maxItemIndex-1;
While (true & $ leftIndex> = 0 ){
// TempTitle + one left.
Preg_match ("/({$ this-> matches_preg} + {$ tempTitle})/I", $ title, $ matches );
If (count ($ matches)> 1 ){
// Temp is used to roll back after the matching fails.
$ Temp = $ titles [$ leftIndex]. $ matches [1];
$ TempTitle = $ titles [$ leftIndex]. $ matches [1];
// Continue matching with tempTitle.
Preg_match ("/$ tempTitle/I", $ contents, $ matches );
// If the search fails ....
If (count ($ matches) <1 ){
$ TempTitle = $ temp;
Break;
} Else {
$ Result = $ tempTitle;
}
} Else {// normally, this will not happen.
Break;
}
$ LeftIndex --;&

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.