Linux Web page capture instance (shell + awk)

Source: Internet
Author: User

In the previous blog, we talked about how to capture web pages in Linux. There are two methods: curl and wget. This article will focus on the Linux Web page capture instance-capturing the top game rankings of Google Play in the world's 12 countries

To capture the Google Play game ranking page, you must first analyze the characteristics and rules of the page:

1. Google Play game ranking web page is in the form of "Total-Score", that is, a page of Web sites shows several rankings (such as 24 ), there are several such web pages that make up the overall ranking of all games

2. Click a separate game connection on each page to view the attributes of the game (such as rating, release date, version number, SDK version number, game category, and downloads)


Problems to be Solved:

1. How to capture the total ranking of all games?

2. After capturing the overall ranking, how do I splice URLs to capture each individual game webpage?

3. After capturing each individual game webpage, how can I extract the game attribute information (that is, rating and release date) from the webpage ...)?

4. How to save (MySQL), generate daily reports (HTML), and send daily reports (email) after extracting each individual game attribute information )?

5. Based on the captured game attribute information resources, how can I query my company's game ranking (JSP) and how can I clearly display the game ranking (jfreechart chart )?

6. What's more difficult is that Google Play's game rankings are not globally unified. Google adopts a localization strategy and dozens of countries have their own ranking algorithms and rules, how can we rank games in 12 countries?


Design Scheme and Technical Selection

After analyzing the requirements of these problems, how can we solve them one by one? What are the problems we need to think about, design, and solve (module process and technical implementation )?

Based on the questions raised above, the module design and technical solution selection will be carried out one by one:





1. In order to capture the game rankings of the 12 countries, You need to lease proxy servers in the 12 countries to capture the game rankings of the countries (the game ranking algorithms and languages of the 12 countries are different, chinese, English, Japanese, Russian, Spanish ...)

2. Capture webpages and use the curl + proxy method. Extract the downloaded webpage information and use the awk Text Analysis Tool (You Need To Know the HTML syntax tag, ID, and other elements very well, in order to accurately use awk to extract game attribute information)

3. Because the IP proxy filtering system, web page capturing programs, and game attribute information extraction modules are all completed using scripts, in order to maintain the consistency of programming languages, database creation and record insertion are also implemented using shell scripts.

4. Each game attribute information captured is displayed in HTML + Table on a webpage, which is clear and intuitive. The shell script is used to splice HTML strings (Table + Tr + TD + info)

5. The generated HTML webpage is regularly sent to the product director, PM, RD and QA by email every day to learn about the ranking of the company's published games, and the fastest-rising and most popular game trends in the world

6. Develop a JSP web page Query System to query the ranking and trend of a game based on the entered game name or game package name. The full attribute information of the game is displayed in the trend chart.


Module technical implementation

1. IP proxy Filtering

In terms of fees, each country rents a proxy server (VPN), which is calculated based on the lowest market price of 1000 yuan/month, 12000 yuan a year. The total cost of 12 countries is 12x12000 = 144000, that is, the VPN rental fee is about 0.14 million RMB/year.

Based on cost and overhead considerations, I later developed a free IP Proxy Server screening system by in-depth research on proxy servers and free IP addresses to capture the game rankings of 12 countries respectively.

Free proxy IP addresses mainly come from the two websites described in the previous blog: freeproxylists.net and xroxy.com.

The IP proxy filtering system is described separately in the next blog as it is complicated to implement text preprocessing and filtering logic.


2. Capture ranking webpages

By carefully analyzing the Google Play game ranking page, we can find that there are regular rules to follow:

Top 24 Web site: https://play.google.com/store/apps/category/GAME/collection/topselling_free

The second page top48 Web site: https://play.google.com/store/apps/category/GAME/collection/topselling_free? Start = 24 & num = 24

The third page top72 Web site: https://play.google.com/store/apps/category/GAME/collection/topselling_free? Start = 48 & num = 24

...

So far, observe the last string of each page url?Start = 24& Num = 24, we have found the rule.Start = 0You can also write it as follows:

The first page top24 Web site: https://play.google.com/store/apps/category/GAME/collection/topselling_free? Start = 0 & num = 24

Based on the above rule, you can concatenate a string cyclically and use curl + proxy to capture the ranking webpage (START = 'expr $ start + 24 ')


3. Extract game links

Ranking webpage, each page contains 24 game URL hyperlinks. How do I extract these 24 game URL hyperlinks?

At that time, we considered using XML parsing because HTML is a hierarchical XML-like format, but some webpages are not all in standard HTML format (for example, there is no closure of the right parenthesis after the left parenthesis ), this causes XML parsing to fail.

Later, I analyzed the content structure of the ranking Web Page Based on my learned HTML and JS knowledge, and found that there is a unique class = "title" in front of each game link. The specific format is as follows (Basketball shoot):



<a class="title" title="Basketball Shoot" data-a="1" data-c="1" href="/store/apps/details?id=com.game.basketballshoot&feature=apps_topselling_free">Basketball Shoot</a>

In this way, we can use awk to extract the text content near class = "title". The specific implementation is as follows:



# split url_24
page_key='class="title"'
page_output='output_page.log'
page_output_url_start='https://play.google.com/store/apps/'
page_output_url='output_top800_url.log'

function page_split(){
    grep $page_key $(ls $url_output* | sort -t "_" -k6 -n) > tmp_page_grepURL.log            # use $url_output
    awk -F'[<>]' '{for(i=1;i<=NF;i++){if($i~/'$page_key'/){print $i}}}' tmp_page_grepURL.log > tmp_page_titleURL.log
    awk -F'["""]' '{print $4 $10}' tmp_page_titleURL.log > $page_output
        
    rm -rf tmp_page_grepURL.log
    rm -rf tmp_page_titleURL.log

    merge top800 url
    rm -rf $page_output_url
    touch $page_output_url

    awk -F'["/"]' '{for(i=1;i<=NF;i++){if($i~/'details'/){print $i}}}' $page_output > tmp_top800_url.log
    
    index=1
    while read line
    do
        echo $line
        echo $index
        echo -e $page_output_url_start$line >> $page_output_url

        index=`expr $index + 1`
    done < tmp_top800_url.log

    rm -rf tmp_top800_url.log
}

Script Function Description:

Grep: first, sort the downloaded ranking webpage files in ascending order of numbers to ensure the ranking order of the Games. Then, use grep to extract some required rows to temporary files, greatly reduce the file content we need to analyze

Awk, specifying multiple delimiters <> to format the input text (awk-F' [<>] ') into a multi-substring array, determine whether the split substring array contains the $ page_key string (if ($ I ~ /'$ Page_key '/)). If yes, output all the substrings to tmp_page_titleurl.log in another temporary file. If no, discard them.

The game hyperlink extracted after formatting is as follows:


The text processing contains the game name (title) and game hyperlink (href). The next step is to extract the game name (title) and game hyperlink (href ), finally, the href and the domain name are spliced to form a complete hyperlink.


Awk, specifying the separator"""The text to be formatted (awk-F' ["] ') is an array of multiple substrings, then extract the 4th and 10th fields of the array (the fields separated by awk start with 1). The extraction result is as follows:


After further text processing, the game name (title) and game Link (href) are extracted, and then the link information we actually need (the string after details) is extracted)


The intermediate results of text processing are saved in the temporary file. After processing, you can delete the created temporary file (RM-RF ***)

Awk, specify the separator "/" to format the text (awk-F' ["/"] ') for multiple sub-string arrays, cyclically determine whether the split substring array contains the 'details' string (if ($ I ~ /'Details '/)). If yes, all the substrings are output to the tmp_top800_url.log file in the temporary file. If not, discard it.

The game hyperlink extracted after formatting is as follows:



Finally, the while read line loop adds the domain name header ($ page_output_url_start) and Concatenates the complete hyperlink address (echo-e $ page_output_url_start $ line> $ page_output_url). The splicing result is as follows:


OK. With the powerful text processing capability of awk, the game web page hyperlink processing is complete (coming soon ~~)! The following figure uses curl + proxy to capture the corresponding game webpage .... O (partition _ partition) O


4. Capture game webpages

Based on the game hyperlink extracted from Module 3, capture the game webpage using the curl + proxy method introduced in the previous blog, for example:

Curl-x 125.69.132.100: 8080-O html_2 https://play.google.com/store/apps/details? Id = com. Game. basketballshoot & feature = pai_topselling_free

Open the crawled web page html_2 using a browser. The result is as follows:


OK. After successfully capturing the game webpage, the next step is to extract the game attribute information in the total Green Box of each game webpage (the extraction method is similar to that of Module 3 to extract the game link, it also uses awk text analysis and processing tools)

However, pay attention to the following points:


  • The web page capture statements in my script program are more complex than the preceding statements, in actual tests, special cases such as connection timeout, speed, and IP proxy suddenly fail are taken into account. If you are interested, you can study the connection by yourself.
  • When downloading a game webpage, what should I do if the proxy IP suddenly becomes invalid and the webpage cannot be crawled? (In fact, this problem also occurs when module 2 crawls the ranking webpage. The specific solution of this problem will be detailed in the IP free proxy system of the next blog)
  • How can we determine the ranking order of captured game webpages? (In fact, in Module 3, the game ranking problem has been solved. You can find the answer in the script of Module 3 ...)


5. Extract game attributes

Extract and capture the attribute information in the game webpage (that is, extract the Green Box Information in html_2 of Module 4 above). The extraction method is similar to that of Module 3, and the awk text analysis and processing tool is also used.

By analyzing the content of the downloaded webpage, you can find the only field (such as ID, class, and tag) that can represent the property information. For specific implementation code, see the script provided in Module 3, the specific implementation code will not be posted here


6. Save the property to the database


  • To save the extracted game attribute information, you must first create a MySQL database and a table. The shell database creation script is as follows:

# Author        : yanggang
# Datetime      : 2011.10.24 21:10:28
# ============================================================
#!/bin/sh


log='SQL_insert_one.sh.log'

if [ ! -e $log ]; then
    touch $log
fi

date=$(date "+%Y-%m-%d__%H:%M:%S")
echo >> $log
echo "=================  $date  ================" >> $log


# mysql database and table to create
HOST='localhost'
PORT='3306'
USER='root'
PWD='xxxxxx'
DBNAME='top800'
TABLENAME='gametop800'

mysql_login=''
mysql_create_db=''
mysql_create_table=''


function mysql_create(){
    echo "login mysql $HOST:$PORT ..." >> $log
    mysql_login="mysql -h $HOST -P $PORT -u $USER -p$PWD"  # mysql -h host -P port -u root -p pwd
    echo | ${mysql_login}
    if [ $? -ne 0 ]; then
        echo "login mysql ${HOST}:${PORT} failed.." >> $log
        exit 1
    else
        echo "login mysql ${HOST}:${PORT} success!" >> $log
    fi

    echo "create database $DBNAME ..." >> $log
    mysql_create_db="create database if not exists $DBNAME"
    echo ${mysql_create_db} | ${mysql_login}
    if [ $? -ne 0 ]; then
        echo "create db ${DBNAME} failed.." >> $log
    else
        echo "create db ${DBNAME} success!" >> $log
    fi

    echo "create table $TABLENAME ..." >> $log
    mysql_create_table="create table $TABLENAME(
        id char(50) not null,
        url char(255),
        top int,
        name char(100),
        category char(50),
        rating char(10),
        ratingcount char(20),
        download char(30),
        price char(20),
        publishdate char(20),
        version char(40),
        filesize char(40),
        requireandroid char(40),
        contentrating char(40),
        country char(10) not null,
        dtime date not null default \"2011-01-01\",
        primary key(id, country, dtime)
    )"

    echo ${mysql_create_table} | ${mysql_login} ${DBNAME} 
    if [ $? -ne 0 ]; then
        echo "create table ${TABLENAME} fail..." >> $log
    else
        echo "create table ${TABLENAME} success!" >> $log
    fi
}

Script Function Description:

First, log on to the MySQL database and check whether the MySQL server, port number, user name, and password are correct. If they are incorrect, the login fails and exit (Exit 1). If yes, the login succeeds, next, create a database name to check whether the database exists. If the database does not exist, create the database. If the database exists, continue to the next step. (Note: when creating a database, you must first verify whether the database is successfully logged in, otherwise, the operation fails.) Finally, create a database table. First, design the fields of the database table, and then create a database table. The specific judgment method is the same as creating a database name.
  • Traverse the text of game attribute information and insert all the information to the MySQL database for Unified Storage and management.

# Author        : yanggang
# Datetime      : 2011.10.24 21:45:09
# ============================================================
#!/bin/sh


# insert mysql
file_input='output_top800_url_page'
file_output='sql_output'

HOST='localhost'
PORT='3306'
USER='root'
PWD='xxxxxx'
DBNAME='top800'
TABLENAME='gametop800'

col_id=''
col_url=''
col_top=1
col_name=''
col_category=''
col_rating=''
col_ratingcount=''
col_download=''
col_price=''
col_publishdate=''
col_version=''
col_filesize=''
col_requireandroid=''
col_contentrating=''
col_country=''
col_dtime=''

sql_insert='insert into gametop800 values("com.mobile.games", "url", 3, "minesweeping", "games", "4.8", "89789", "1000000-5000000000", "free", "2011-2-30", "1.2.1", "1.5M", "1.5 up", "middle", "china", "2011-10-10")'

function mysql_insert(){
    rm -rf $file_output
    touch $file_output

    DBNAME=$1
    col_dtime=$2
    col_country=$3
    echo 'col_dtime========='$col_dtime

    while read line
    do
        col_id=$(echo $line | cut -f 1 -d "%" | cut -f 1 -d "&" | cut -f 2 -d "=")
        col_url=$(echo $line | cut -f 1 -d "%")
        col_name=$(echo $line | cut -f 2 -d "%")
        col_category=$(echo $line | cut -f 3 -d "%")
        col_rating=$(echo $line | cut -f 4 -d "%")
        col_ratingcount=$(echo $line | cut -f 5 -d "%")
        col_download=$(echo $line | cut -f 6 -d "%")
        col_price=$(echo $line | cut -f 7 -d "%")
        col_publishdate=$(echo $line | cut -f 8 -d "%")
        col_version=$(echo $line | cut -f 9 -d "%")
        col_filesize=$(echo $line | cut -f 10 -d "%")
        col_requireandroid=$(echo $line | cut -f 11 -d "%")
        col_contentrating=$(echo $line | cut -f 12 -d "%")


        sql_insert='insert into '$TABLENAME' values('"\"$col_id\", \"$col_url\", $col_top, \"$col_name\", \"$col_category\", \"$col_rating\", \"$col_ratingcount\", \"$col_download\", \"$col_price\", \"$col_publishdate\", \"$col_version\", \"$col_filesize\", \"$col_requireandroid\", \"$col_contentrating\", \"$col_country\", \"$col_dtime\""');'
        
        echo $sql_insert >> $file_output
        mysql -h $HOST -P $PORT -u $USER -p$PWD -e "use $DBNAME; $sql_insert"

        col_top=`expr $col_top + 1`
    done < $file_input
}

Script Function Description:

The insert database script is relatively simple and mainly implements two functions: Game ranking number (col_top) and database statement insertion ($ SQL _insert)
Through the while read line loop, read the game attribute information text file extracted by Module 5, split each line to get the corresponding field (cut-F 2-D "% "), assign values to the insert Statement (SQL _insert) and use MySQL-h $ host-p $ port-U $ user-p $ PWD-e "use $ dbname; $ SQL _insert ", log on to the MySQL database and execute the insert statement $ SQL _insert



7. Generate HTML reports

Shell Concatenates the string table + Tr + TD + info to generate an HTML webpage report. For details, see my previous blog: Shell converts TXT to HTML.


8. Mail sending report

The email sending module mainly adopts the/usr/bin/mutt mode. The email body displays an HTML report (the default is the United States). Other countries send the report as attachments, for more information, see my previous blog: sending email attachments to Linux Shell.

The crontab command is used to regularly send emails. For detailed configuration and usage, see my previous blog: Linux scheduled Run Command Script-crontab


9. Web page query report

Use JSP to extract the game attribute information stored in MySQL and traverse the web pages that generate game rankings cyclically. For more information, see my previous blog: Linux JSP connection to MySQL database


10. Ranking trend chart

Trend chart, which uses a third-party jfreechart chart generation tool. For details, see my previous blog: jfreechart learning example.

After the game ranking trend chart is generated, it needs to be embedded into the JSP page for display. For the complete ranking trend chart, see my previous blog: jfreechart project instance.



Automated Master Control script

Game ranking system in 12 countries, screening from free IP proxy -- web page capture -- database save -- generate ranking report -- regularly send mail report -- game ranking query -- trend chart generation

All the processes are automated. The following describes the script implementation and functions of each module:


Run the process command periodically by configuring the crontab of the server. The top10_all.sh script is automatically started at 00:01:00 every day (00:01:00 ).


Daily reports generated on a daily basis are automatically generated using the master script to save the captured data, analyzed data, and result data for the current day, as shown in:


Note:The above folder data is copied last year's test data, which is not ranked in my own notebook

Because the top of the 12 countries ranked by remote proxy capture consume network resources, memory resources, and time, seriously affecting the Internet experience ~~~~ (>_< )~~~~



Architecture Design and Evaluation

The crawling game ranking system was not as complicated as it was initially designed. It took only a week to build a ranking for capturing domestic Google Play Games (using curl to capture domestic rankings without proxy)

Later, based on the various requirements raised, we continued to add, refactor, and improve the development process, a bit like a rapid prototype ~~ @_@

The Design and Implementation of the entire system, from requirement to demo prototype, to email sending, webpage viewing, trend chart query, and free IP proxy, it took about two months to develop two games... Busy ~)


In general, I followed two principles during the design:

1. Web page capturing, text processing, database storage, and other data source information are all implemented using scripts in a unified manner, and the development language remains pure.

2. Each functional module is divided into sub-Problems for independent implementation. The process is divided into hierarchical design and the glue language is used to build a combination of blocks.


Practice later proved that this design method and principle was completely correct.

The requirement was to capture the top 10 game rankings, and Europe and India were post-added. Remember to capture, test, and release the two newly added countries, it takes about one night, two or three hours!

Based on the above architecture design, it takes only a few steps to add a new country (Add India as an Example):

A. Go to freeproxylists.net to find the free IP proxy of the corresponding country (India), put it in the corresponding directory of the Free proxy filtering system (testproxy), and simply add the information of this country (the next blog will explain in detail)

B. Copy the Chinese script to India CP top800_proxy_china.sh top800_proxy_india.sh and change all China in top800_proxy_india.sh to India. The batch modification command is as follows:

Sed-I "s/China/India/g" top800_proxy_india.sh

C. In the 12-country Master Control script (top800_proxy_all.sh), add the top 800_proxy_india.sh web page capture script on India, and add HTML attachments on India in the mail attachment.

D. In the database script (SQL _insert_x_country.sh), add a webpage capture folder in India to save the game information in India to MySQL. In the JSP webpage and query options, add one item in India.

E. OK. added!



Overall Evaluation of this System Architecture

Advantages:

1. The function modules are relatively independent to facilitate function expansion and maintenance.

2. The development language uses shell + awk to facilitate calling and combination between module processes.

3. Add a new country ranking. You only need to modify several configurations. You do not need to understand the internal implementation process of the module.

4. The trend chart tool jfreechart is also implemented in Java to facilitate JSP calling and nesting and clearly display the game ranking Trend

5. Backing up MySQL Databases in different regions uses daily remote backup (SCP) in the automated Master Control script to reduce the probability of historical data loss (the trust relationship between the two machines must be established for remote backup, for more information, see my previous blog)


Further improvement is required:

1. Capture the generated 12-country game ranking Report, which contains nearly 10 languages, including Chinese, English, Japanese, Russian, Spanish, Korean, French, German, and Italian... Looks like an eight-nation coalition... Ha

In the future, you can call Google's translate translation API to translate all 10 languages of the report into Chinese or English, which is convenient for viewing. You do not need to manually query the dictionary in one game...

2. Currently, it is deployed on a server and multiple web page capture scripts (. /xxx. sh &), top in 12 countries. It takes a long time to capture and download data. Later, it is improved to simultaneously by multiple servers.

3. Although this system implements full process control, I have also written the function instruction document (Readme) for the corresponding module, but there are still problems with maintenance by others in the future, after all, it takes time and energy to learn the script, and the script language is not intuitive enough. The command parameter syntax is concise and bitter. It is not so easy to understand. What kind of good gui (GUI) will be used later) encapsulated to facilitate maintenance

4. Now the competition for Android development is fierce, and it is very important to grasp the game rankings, especially the development trend of the game. Later I will see if I can encapsulate this system into a commercial application software $ ^_ ^ $



Summary

All these systems are independently designed, implemented, tested, launched, maintained, and completed by myself.


Looking back, this system involves a lot of knowledge points and technologies, many of which I have never touched before, such as awk, JSP, tomcat, mutt, crontab, jfreechart

However, only with such opportunities and challenges can one take an Postgraduate Entrance Exam be able to think, analyze, and solve problems independently, especially the ability to learn quickly.



Next, let's talk about the difficulties, detours, and experiences in designing and implementing this system:

Difficulties encountered

1. It is unclear which implementation scheme should be selected

When crawling web page rankings, I don't know whether to choose application implementation or script implementation. Because the entrepreneurial team pays great attention to efficiency and cost, and needs to see the prototype and effect as soon as possible. Obviously, script development is faster.

2. Content Extraction after webpage capturing

Whether to use XML to parse the entire HTML file or use other better solutions for content extraction. I used XML parsing. At first I chose XML for analysis and extraction. Later I found that I was wrong, because some HTML non-standard Structures

3. How does one implement mail sending and timed running processes in Linux?

Ask a technical expert and get a positive answer: Some Linux commands can certainly be implemented, so I went online to find and verify and found a mature solution (some solutions on the Internet do not work, several methods were integrated at that time)

4. is a third-party Open-Source engine used to generate a chart of game trends? Or do I directly use the built-in JAVA graphics tool to draw images?

From the perspective of development efficiency, we certainly prefer open-source or free third-party graphics tools. At that time, we found that both jchart and jfreechart can be implemented, and both are written and developed in Java.



Experience

1. Wide knowledge

The system uses a variety of tools, such as shell scripts, proxy servers, MySQL databases, HTML web pages, Mutt email sending, JSP websites and setup, and jfreechart research (see my Baidu blog for details)

Front-end development (HTML and JSP), background services (Tomcat website construction, database connection), scripting language (shell, awk), and graphic tool (jfreechart ), stores like grocery stores require more inventory


2. Fast re-Learning Ability

Some shell scripts were learned in Baidu that year, but the ranking system for designing and developing full shell implementations obviously needs to be further learned and accumulated, in the process of project implementation, I need to learn any technology right away.

I have done website development for both universities and research institutes. I am familiar with some HTML/CSS and Asp.net. I learned a few JS tricks from my master in the first month of Baidu, So I spliced table and HTML, it is also easy to learn JSP. After all, there are some internal advantages.

MySQL database cheats have all been practiced in year 45, so it's easy. To set up tomcat, search for a tutorial on the Internet and complete the three-plus-five-and-two solutions, similar to IIS and Apache, parse the webpage Server

Jchart and jfreechart are both Java graphics generation tools, but they need to be comprehensively considered from the perspectives of cost, easy to learn, document examples, and versatility. Finally, the jfreechart graphics scheme is selected.


3. Learn more with an open mind

In the face of technical solution selection and knowledge blind spots, you need to consult more materials to investigate and use mature technical solutions for your own use. If you cannot find them, please consult with tech experts modestly, you are usually very enthusiastic about

Then, based on the ideas prompted by the experts, find a solution to the problem. After the problem is solved, you need to think more about it. Why didn't you think of it at the time... And then accumulate. Over time, experience increases.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.