PHP programming to obtain the website's Alexa ranking

Source: Internet
Author: User

Currently, the data of most websites that provide website rankings is obtained from data published by Alexa. However, Alexa's website ranking data cannot be obtained simply and directly, because Alexa uses the interference code technology, making programming difficult and cumbersome.

However, theoretically, as long as you can see the information on the page, except image recognition is still a top-notch technology, text information may be captured by the page to obtain the source file, then analyze the data to obtain the specific data.

First we analyze the computer learning network http://www.why100000.com in http://www.alexa.com query ranking data: In the http://www.alexa.com page enter the URL http://www.why100000.com, click the button "get traffic detail ". After the page is downloaded, view the source file and search for the string "Traffic Rank" (search for "why00000.com has a Traffic Rank" may not be found because there are multiple spaces between words, which cannot be matched ), find the vicinity of the ranking data. Later, you can see the following similar strings:

<! -- Did you know? Alexa offers this data programmatically. visit http://aws.amazon.com/awis for more information about the Alexa Web Information Service. --> <SPAN class = "ca53"> 61 </span> 5 <SPAN class = "c34d"> 57 </span>, <SPAN class = "c8a7"> 78 </span> <SPAN class = "c1db"> 35 </span> 4 </span> <! -- Google_ad_section_end (name = default) -->

The reason why we emphasize "similar" is that the information we see is generally not exactly the same. Alexa said that the data produced by "programmatically" has already interfered with the real data through programming, although the number "557354" is displayed in the eyes of the page, even if "select-copy-paste" is displayed on the page ", the result is the string "61557,78354" (almost different after each re-query ). From the above code, we can also see the "61557,78354" information.

However, if you observe it carefully, you will see that the true ranking data "557354" is included in the "61557,78354" string. In fact, it is "always" included in the variable string. You can try it several times to see it! The true ranking characters are sometimes not included in <SPAN class = "XXXX"> 61 </span>, interference characters are contained in <SPAN class = "XXXX"> ...... </Span>, use class to reference the style sheet "XXXX" to hide it on the page. However, sometimes the numbers that rank real data are included in <SPAN class = "YYYY"> ...... </Span>, the style sheet should be hidden without this character (string. -- This is the idea of interfering with algorithms: inserting numbers that are invisible to the page in real data does not affect human eyes, but does interfere with program analysis, in addition, some visible characters are added with <span>... </Span> labels further increase the complexity of program analysis.

However, since the browser sees the information, its secret should be included in the source file, and it may be related to JavaScript and CSS references. Based on this idea, we downloaded the JavaScript scripts and CSS scripts related to the source file for analysis.

The secret is actually included in the http://client.alexa.com/common/css/scramble.css style sheet file. Open the file and you can see the following 189 CSS class definitions:

. C11e {
Display: None
}
. C125 {
Display: None
}
. C12d {
Display: None
}
......
......
. Cfe9 {
Display: None
}

Use the <SPAN class = "XXXX"> ...... <Span> the "XXXX" contained is defined here! Real character <SPAN class = "YYYY"> ...... The <span> style yyyy may be randomly generated, but it will not be the same as the 189 definitions.

The truth is clear. We can write a program to obtain the real ranking data.

We can write desktop programs, such as using Delphi and C # To write Windows desktop programs. This requires the Internet browser component to obtain
Http://www.alexa.com/data/details/traffic_details/why100000.com
Page source file, and then analyze it and save it to the database.

You can also use JavaScript scripts to obtain the data in the webpage. This requires familiarity with JavaScript. However, to save the data to the backend database, you need to open the page manually.

It can also be implemented through website background programming. The background programming language must have the ability to obtain webpage source files in the background, which can be implemented in Java, ASP. NET, PHP, and other languages.

All the above programming processes require strong string analysis and processing capabilities in programming languages or scripts. common expressions are supported by popular languages.

I chose PHP, a popular web programming language.

Programming process description:

Observe the source file and find the paragraph:
<! -- Did you know? Alexa offers this data programmatically. Visit http://aws.amazon.com/awis for more information about the Alexa Web Information Service. -->
This is the first appearance in the source file. The ranking information is included in
<! -- Google_ad_section_end (name = default) -->
. This facilitates programming. We can first extract the most useful information based on this feature to narrow down the scope and provide the basis for fine-grained programming. This is implemented using the getbody () function. The feature Strings actually used are "Information Service. -->" and "<! -- Google_ad_section_end (name = default) --> "to extract strings between them.

Getalexarank is the main function. Other functions obtain <SPAN class = "zzzz"> ...... In <span>, the style table "ZZZZ" is stored in an array. By traversing the array, you can determine whether the number (string) in the core data block should be deleted or retained, eventually generate real data. The program gives full play to the power of PHP Regular Expressions and string processing functions, making it much shorter than ASP and other script code.

Other large amounts of data may also be used in the Alexa query results. You can also design and compile program code using this programming idea. For example, to extract the proportion of global users accessing a site and the average value in a week, you can refer to the code of the following parts:

Percent of global Internet users who visit this site
<TD> <! -- Did you know? Alexa offers this data programmatically.
Visit http://aws.amazon.com/awis for more information about of the Alexa Web Information Service. -->
<SPAN class = "cded"> 10 </span> 0 <SPAN class = "cd83">. 0 </span> 00 <SPAN class = "cd81"> 18 </span> %
</TD>

The method is to find the section features, extract the <span> part of the detailed information, and then conduct in-depth analysis to obtain the data.

This method can also be used to compile web page information collection programs, including a large number of articles on the Internet, such as news and blogs.

However, if Alexa modifies the algorithm next time, this method will become invalid.

Attachment:

Index. php file:

<Form name = "alexaform" method = "Post" Action = "get_alexa.php">
URL: <input type = "text" name = "url" value = "http://www.why100000.com" size = 40>
& Nbsp; <input type = "Submit" value = "">
</Form>

Get_alexa.php file:

<? PHP
// PHP version requirements: PhP 4.4.7
// Supports originality. Please keep the note here:
// Author: Zhang Qing (mesh) Shaanxi-Xi'an
// Website address:
// Computer learning network: http://www.why100000.com
// Blog: http://blog.why100000.com
// Question! Http://ask.why100000.com
// Demo address: http://www.why100000.com/test/alexa/alexa.php

$ Url = $ _ post ['url'];

If ($ URL! = "")
{
Echo "Your Website". $ URL. "ranked in Alexa: <br> ";
$ Rank = getalexarank ($ URL );
Echo '['. $ rank. ']';
}

Function getalexarank ($ weburl)
{
$ Weburl = strtolower ($ weburl );
$ Tempurl = getdomainurl ($ weburl );
// Read data from http://client.alexa.com/common/css/scramble.css
$ Stralexacss = file_get_contents ('HTTP: // client.alexa.com/common/css/scramble.css ');
$ Alexarankqueryurl = 'HTTP: // www.alexa.com/data/details/traffic_details/'. $ tempurl;
$ Stralexacontent = file_get_contents ($ alexarankqueryurl );
$ Rankcontent = getbody ($ stralexacontent, 'information service. --> ',' <! -- Google_ad_section_end (name = default) --> ');
Echo '<XMP> ';
Echo $ rankcontent;
Echo '</XMP> ';
$ Arrspanclass = getarray ($ rankcontent, '<SPAN class = "', '"> ');
Echo '<XMP> ';
Print_r ($ arrspanclass );
Echo '</XMP> ';
Foreach ($ arrspanclass as $ CSS)
{
// Global $ rankcontent;
If (strpos ($ stralexacss, '.'. $ CSS)> 0)
{
Echo $ CSS. '(h )';
$ Rankcontent = scripthtml ($ rankcontent, "span", 2, $ CSS );
}
Else
{
Echo $ CSS. '(s )';
$ Rankcontent = scripthtml ($ rankcontent, "span", 1, $ CSS );
}
Echo '<XMP> ';
Echo $ rankcontent;
Echo '</XMP> ';
}
$ Rankcontent = str_replace ('</span>', '', $ rankcontent );
$ Rankcontent = str_replace (',', '', $ rankcontent );
Return $ rankcontent;
}

Function getbody ($ contentstr, $ startstr, $ endstr)
{
$ Contentstr = strtolower ($ contentstr );
$ Startstr = strtolower ($ startstr );
$ Endstr = strtolower ($ endstr );
$ Startpos = strpos ($ contentstr, $ startstr );
$ Endpos = strpos ($ contentstr, $ endstr );
Return substr ($ contentstr, $ startpos + strlen ($ startstr), $ endpos-$ startpos-strlen ($ startstr ));
}

// Because the code is too long, the following code is omitted, you need to send an email to the zhangking2008@gmail.com
// Enter "PHP code for Alexa ranking" in the mail title"
//......
?>

Zhang Qing (mesh)
Mesh horizon: http://blog.why100000.com
2008-6-9

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.