How to use python,c# and other languages to achieve crawl static Web page + Crawl Dynamic Web page + analog landing site __python

Source: Internet
Author: User
Turn from: Http://www.crifan.com/how_to_use_some_language_python_csharp_to_implement_crawl_website_extract_dynamic_ webpage_content_emulate_login_website/

background

In the Network, Web page, site processing, a lot of people have encountered, want to use a language (python,c#, etc.), to achieve some needs, common have these categories: Want to from a static web page, extract some content want to crawl some dynamic Web page some content want to simulate landing a website

The basic logic behind this kind of demand is the same.

Here's how to implement these features.

understand the HTTP-related logical knowledge involved in accessing Web pages

To understand the corresponding, access to the URL of the logic behind the address:

What you need to provide: URL headers: Partial optional, part must be cookies (optional) post data when the Post method is needed

And then get what content: HTML source (or other, JSON strings, picture data, and so on) cookies (possible): For subsequent access to other URLs, you may need to provide the (new) cookie returned here

Detailed explanation, may refer to:

"Organizing" about crawling Web pages, analyzing Web content, simulating landing site logic/processes and considerations

Tips:

1.html of CharSet coding

For a background on HTML page coding, it's best to look at it again:

"Organize" the interpretation of the character encoding (charset) format (gb2312,gbk,utf-8,iso8859-1, etc.) about HTML Web page source code

understand the logical process of the execution of the Web page you want to handle

Simply put, you have to deal with a URL address, you need to provide what values, and then to get the value you need.

Understand the logic of your own concern before you can talk about subsequent code implementations.

If this logic process is simple, then you do not have to analyze the tool, you see the code, you can analyze it, it is also possible.

However, this process is often complex, so it is generally necessary to use the corresponding developer tools to analyze.

For example, use IE9 's F12 to capture the corresponding execution process, and then analyze the execution logic of some of the pages you need to be concerned about.

Detailed explanations and demonstrations see:

"Tutorials" to teach you how to use tools (IE9 F12) to analyze the simulation landing site (Baidu homepage) of the internal logic process

Tips:

1. Various other analytical tools

For IE9 's F12, and so unfamiliar, you can go to see first:

Developer tools in the summary browser (IE9 F12 and Chrome ctrl+shift+i)-a tool for web analytics

For this section, there is also a post to refer to:

"Organize" developer tools in various browsers developer TOOLS:IE9 's F12,chrome Ctrl+shift+j,firefox firebug

2. Analysis of complex parameter values

In the process of tool analysis, you will find that some of the values to be analyzed, is relatively complex, can not be directly obtained, so you need to debug analysis.

For information on how to analyze how complex parameter values are obtained, you can refer to:

How to use IE9 's F12 to analyze the complex (parameters, cookies, etc.) value of a Web site during the landing process
3. Another example

Later, an example was used to analyze how to find the true address of the song from the Songtaste's playback page address:

"Tutorial" How to use IE9 F12 to crawl the real address of a songtaste song

to achieve the above logic in some language

After you have understood all the logical processes and the order of execution that you have to deal with, then the logical process is implemented in a language.

Tips:

However, there are some common logic for implementing the corresponding logic in your code:

1. Encoding and decoding of URL addresses

Where, if involved, URL address decoding and coding, you can refer to:

"Organize" encoding (encode) and decoding of URL addresses in HTTP (get or POST) requests (decode)

2. How to deal with Headers,cookie,post data

Although it is written for dynamic Web crawl, the basic logic is the same:

"Collation" web crawl, simulate landing, crawl dynamic web content, etc., involved in the process of headers information, cookie information, post data processing logic

3. How to extract Web content

For simple HTML Web pages, of course, you can use regular expressions.

However, for complex content extraction, it is recommended to use a third party specific libraries to deal with:

"Organizing" Suggestions for handling HTML code with regular expressions

which

Python: Libraries related to parsing HTML, recommended by:

"Summarizing" the use of Python's third-party library BeautifulSoup

In the case of code sample demos, there are three broad categories of tutorials based on the previous three categories: want to extract some content from a static web page python version

"Tutorials" Crawl and extract the Python version of the information you need in your Web page

C # Edition

The C # version of the tutorial crawls the Web and extracts the information needed in the page to grab some of the dynamic Web pages

Here, the existing tutorial is to explain the whole process from the beginning:

How "Tutorial" captures Dynamic Web content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.