PHP extracts webpage titles and removes irrelevant seo keywords

Last Update:2013-11-22 Source: Internet

Author: User

Tags preg

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scenario Description:

In the past, when we extracted the webpage title, we would directly extract the content between them. but the actual situation is like this, for example, javaeye an article http://www.iteye.com/news/21643, the content is & quot; 10 years of software development has taught me the most important 10 things-non-technical-ITeye News & quot;, but the title we expected in actual reference should be & quot; 10 years of software development has taught me the most important 10 Things & quot ;. therefore, a lot of irrelevant keywords are piled up behind the title (it should be for seo ). so we want to filter out these keywords. you can refer to the following methods:

1. Search for labels such as h1. (After analyzing some websites such as sina news, I think it is not feasible and there will be a lot of interference)

2. after the title is removed from the full text, cut the content (by _ |-) into a1, a2, a3, and a4, and then search for the full text from the longest phrase a3. if the query is successful, iteration a2 and a1 are queried on the left until the query fails. After the failure on the left side, continue iteration to the right. (This method is used here)

Php code
<? Php
/**
* @ Author pqcc <struts.ec@mgail.com>
* @ Date: 2011-06-18
* Description: extract the title of a webpage based on the content of a webpage. The extracted title does not include the seo keyword.
* E. g: the result of a news subject is directly extracted from <title>: "9.09 million students in CET4 and cet6 this Saturday ",
* But we hope the result is: "9.09 million students of CET4 and cet6 will be admitted this Saturday ".
* Applicability: extract the title of the final page of the article, excluding the topic pages.
*/

Class TitlePurify {

Private $ matches_preg = [-_ s |-];

Function getTitle ($ contents ){/*{{{*/
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
Preg_match ($ preg, $ contents, $ matches );
If (count ($ matches) <= 1 ){
Return "title extraction failed ";
}
$ Title = $ matches [1];
Return $ this-> trimTitle ($ title, $ contents );
}/*}}}*/

Function trimMeta ($ contents ){/*{{{*/
// First remove the <title> content and <meta> content.
$ Preg = "/<title [^>] *> ([w | W] *?) </Title>/I ";
$ Contents = preg_replace ($ preg, $ contents );
$ Preg = "/<meta [^>] *>/I ";
$ Contents = preg_replace ($ preg, $ contents );
Return $ contents;
}/*}}}*/

// Obtain the index of the item with the longest length.
Function getMaxIndex ($ titles ){/*{{{*/
$ MaxItemIndex = 0;
$ MaxLength = 0;
$ Loop = 0;
Foreach ($ titles as $ item ){
If (strlen ($ item)> $ maxLength ){
$ MaxLength = strlen ($ item );
$ MaxItemIndex = $ loop;
}
$ Loop ++;
}
Return $ maxItemIndex;
}/*}}}*/

Function trim ($ title, $ titles, $ contents, $ maxItemIndex ){/*{{{*/
// @ Todo: contents can be optimized here.
// If the search is successful, result = tempTitle.
$ TempTitle = $ titles [$ maxItemIndex];
$ Result = $ tempTitle;
$ Count = count ($ titles );
// While iterates from the current index to the left (it does not stop until the first index is reached or the matching fails ).
$ LeftIndex = $ maxItemIndex-1;
While (true & $ leftIndex> = 0 ){
// TempTitle + one left.
Preg_match ("/({$ this-> matches_preg} + {$ tempTitle})/I", $ title, $ matches );
If (count ($ matches)> 1 ){
// Temp is used to roll back after the matching fails.
$ Temp = $ titles [$ leftIndex]. $ matches [1];
$ TempTitle = $ titles [$ leftIndex]. $ matches [1];
// Continue matching with tempTitle.
Preg_match ("/$ tempTitle/I", $ contents, $ matches );
// If the search fails ....
If (count ($ matches) <1 ){
$ TempTitle = $ temp;
Break;
} Else {
$ Result = $ tempTitle;
}
} Else {// normally, this will not happen.
Break;
}
$ LeftIndex --;&

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More