Python captures HTML content

Source: Internet
Author: User

Today, WPS for Linux Alpha 7 is released. First of all, I would like to thank the WPS team for their hard work. The Forum is very lively and we are very much looking forward to the beta release next year.
However, there is a problem in the forum where the content of the post is visible to everyone (including tourists). As a result, a large number of email addresses are exposed to everyone. Next I will use python to capture these email addresses on the web page. By the way, I will practice the python standard library. (Please bypass the old bird)
The involved libraries include HTTP. Client (processing HTTP), Re (Regular Expression), and threading (multithreading ).
  
First, to capture the webpage content, you must first obtain the HTML page. HTTP. Client. httpconnection is used for this purpose. In the HTTP. Client. httpconnection constructor, host specifies the Web server address and port specifies the port (80 by default ).
The effects of the following forms are the same:
>>> H1 = http. Client. httpconnection ('www. CWI. nl ')
>>> H2 = http. Client. httpconnection ('www. CWI. NL: 80 ')
>>> H3 = http. Client. httpconnection ('www. CWI. nl ', 80)
  
The constructor returns an httpconnection object that represents the current HTTP connection. Then you can use this httpconnection object to send a request to the request page. The four request parameters, method, URL, body, and headers, are not described in detail and are easy to understand. Note that headers is a dict. You only need to store the attributes and values in the header in the form of key: value to dict. In addition, if you want to deliver cookies, simply add them to headers.
After sending the request, you can use the getresponse method to obtain the page response. Getresponse returns an httpresponse object, representing a response of HTTP. The httpresponse object contains the response header, content, status code, and so on. You can use the httpresponse read method to read the HTML page. Note that the read method in Python 3 returns a bytes object. You need decode to use the following regular expression for matching.
  
With the webpage content, you can parse the content. Python has many libraries for parsing HTML page elements, such as beautifulsoup, pyquery, and regular expressions. Here I select regular expressions. For more information about the regular expression syntax, see the following two articles: Python Regular Expression operation guide, a 30-minute getting started tutorial on regular expressions. I will mainly talk about the usage of the python Regular Expression module re.
The re module uses regular expressions in two ways. The effect is similar: directly use the functions in RE, compile re. Compile into a RegEx object, and then use it. The latter is a little more efficient. The general process is as follows:
Obtain a match object through the search function, and then call the Group/groups function to obtain the result group through this match object.
You can also use findall to directly search for the existence of pattern.
  
Because the single thread is too slow, the multi-thread (threading module) acceleration is enabled. Threading. thread is used to create a thread. Threading. in the thread constructor, target is a worker function object, and the thread execution logic. ARGs is a parameter passed to the worker function, which is represented by tupple. kwargs can also be used to pass parameters to worker, this is only a dict, and the storage Parameter Name: parameter value pair.

Start the created thread using the start method. The join method waits for the thread to exit. Because this program does not require synchronization between threads, there is no synchronization mechanism for mutex, semaphores, and so on. For more information, see the official documentation.

Code:

#!/usr/bin/env python3import http.clientimport reimport threadingpattern = re.compile(r'(?<=<a href="mailto:)[\w\W]*(?=">)')def Parse(data, fp):    lines = data.split('\r\n')    for line in lines:        match = pattern.search(line)        if(match):            email = match.group(0)            print(email)            fp.write(email + '\n')def worker(begin, end, No):    print('Thread %d initiated.' % No)    conn = http.client.HTTPConnection('bbs.wps.cn')    url = '/thread-22351621-%d-1.html'    index = begin    fp = open('email%02d.txt' % No, 'a')    while(index < end):        conn.request('GET', url % index)        try:            res = conn.getresponse()        except Exception:            continue        print('Thread %d: %d, %s' % (No, res.status, url % index))        if(res.status == 200):            Parse(res.read().decode('utf8'), fp)        index += 1    fp.close()total = 10step = 180 / totalthreads = [threading.Thread(target=worker, args=(i * step, i + step, i)) for i in range(total)]for thread in threads:    thread.start()for thread in threads:    thread.join()
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.