Python captures HTML content

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, WPS for Linux Alpha 7 is released. First of all, I would like to thank the WPS team for their hard work. The Forum is very lively and we are very much looking forward to the beta release next year.
However, there is a problem in the forum where the content of the post is visible to everyone (including tourists). As a result, a large number of email addresses are exposed to everyone. Next I will use python to capture these email addresses on the web page. By the way, I will practice the python standard library. (Please bypass the old bird)
The involved libraries include HTTP. Client (processing HTTP), Re (Regular Expression), and threading (multithreading ).
　　
First, to capture the webpage content, you must first obtain the HTML page. HTTP. Client. httpconnection is used for this purpose. In the HTTP. Client. httpconnection constructor, host specifies the Web server address and port specifies the port (80 by default ).
The effects of the following forms are the same:
>>> H1 = http. Client. httpconnection ('www. CWI. nl ')
>>> H2 = http. Client. httpconnection ('www. CWI. NL: 80 ')
>>> H3 = http. Client. httpconnection ('www. CWI. nl ', 80)
　　
The constructor returns an httpconnection object that represents the current HTTP connection. Then you can use this httpconnection object to send a request to the request page. The four request parameters, method, URL, body, and headers, are not described in detail and are easy to understand. Note that headers is a dict. You only need to store the attributes and values in the header in the form of key: value to dict. In addition, if you want to deliver cookies, simply add them to headers.
After sending the request, you can use the getresponse method to obtain the page response. Getresponse returns an httpresponse object, representing a response of HTTP. The httpresponse object contains the response header, content, status code, and so on. You can use the httpresponse read method to read the HTML page. Note that the read method in Python 3 returns a bytes object. You need decode to use the following regular expression for matching.
　　
With the webpage content, you can parse the content. Python has many libraries for parsing HTML page elements, such as beautifulsoup, pyquery, and regular expressions. Here I select regular expressions. For more information about the regular expression syntax, see the following two articles: Python Regular Expression operation guide, a 30-minute getting started tutorial on regular expressions. I will mainly talk about the usage of the python Regular Expression module re.
The re module uses regular expressions in two ways. The effect is similar: directly use the functions in RE, compile re. Compile into a RegEx object, and then use it. The latter is a little more efficient. The general process is as follows:
Obtain a match object through the search function, and then call the Group/groups function to obtain the result group through this match object.
You can also use findall to directly search for the existence of pattern.
　　
Because the single thread is too slow, the multi-thread (threading module) acceleration is enabled. Threading. thread is used to create a thread. Threading. in the thread constructor, target is a worker function object, and the thread execution logic. ARGs is a parameter passed to the worker function, which is represented by tupple. kwargs can also be used to pass parameters to worker, this is only a dict, and the storage Parameter Name: parameter value pair.

Start the created thread using the start method. The join method waits for the thread to exit. Because this program does not require synchronization between threads, there is no synchronization mechanism for mutex, semaphores, and so on. For more information, see the official documentation.

Code:

#!/usr/bin/env python3import http.clientimport reimport threadingpattern = re.compile(r'(?<=<a href="mailto:)[\w\W]*(?=">)')def Parse(data, fp):    lines = data.split('\r\n')    for line in lines:        match = pattern.search(line)        if(match):            email = match.group(0)            print(email)            fp.write(email + '\n')def worker(begin, end, No):    print('Thread %d initiated.' % No)    conn = http.client.HTTPConnection('bbs.wps.cn')    url = '/thread-22351621-%d-1.html'    index = begin    fp = open('email%02d.txt' % No, 'a')    while(index < end):        conn.request('GET', url % index)        try:            res = conn.getresponse()        except Exception:            continue        print('Thread %d: %d, %s' % (No, res.status, url % index))        if(res.status == 200):            Parse(res.read().decode('utf8'), fp)        index += 1    fp.close()total = 10step = 180 / totalthreads = [threading.Thread(target=worker, args=(i * step, i + step, i)) for i in range(total)]for thread in threads:    thread.start()for thread in threads:    thread.join()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python captures HTML content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python captures HTML content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support