The old driver takes you get a huge amount of Web request data

Source: Internet
Author: User

There is such a magical site, see here, this time we are introducing another equally magical site, see here. So what is the relationship between the two? The relationship between the two is given in Httparchive, as follows:

Successful societies and institutions recognize the need to record their history-this provides a-to-review the past, Find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record it history. As a result he founded the Internet Archive which collects and permanently stores the Web ' s digitized content.
In addition to the content of Web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of Web performance information such as size of pages, failed requests, and technologies Utili Zed. This performance information allows us to see trends in how the WEB was built and provides a common data set from which to Conduct Web performance.

In other words, Archive.org provides a historical snapshot of the Web page and so on, and httparchive.org provides access to these historical pages, including the various attributes of the page request, the number of links, redirection and other aspects of data information.
But the site can provide only a general description of the data, giving the trend of web development and so on. To get the data based on your own customization requirements, this site is still out of the office. Fortunately, the data is placed on Google's cloud platform, where you can use SQL statements to get the data content you need. This article will briefly describe how to use this database.
Since it's hosted on Google Cloud, it's natural to change Google email accounts. Here is a list of the API interfaces provided by Google, as shown below:

Click into the BigQuery API, and then first create a project to use the API interface

Click Enable, and then click on the cloud console to access this product, will go to the following interface this page is we use the BigQuery API page, you can see the following

Click on our project, then select Display Project

Enter the ID name, using the httparchive packet.

The runs part is the data table we requested, click on the Red section of compose query, bring up the SQL command input box, enter the SQL command and click on the red Run query to get the following output result.

The corresponding SELECT statement is as follows:

SELECT URL from [httparchive:runs.2017_04_15_pages] WHERE rank isn't NULL ORDER by rank ASC limit 1000

The runs directory contains data that is captured two times per month, divided into PC and mobile segments. Each one is divided into page description data and page request data. You can go to the view specifically.
Some of the interface functions of query and examples of use, see here, refer to Here
This article for the Csdn village in the original article, reprinted remember to add small tail puppet, Bo main link here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.