1. comparison of the three methods (HTMLParser, pyquery, and regular expression) of the URL in the Python crawling page 2. python provides the original string, as the name implies, that is, to retain the meaning of the original character. It does not escape the characters following the backslash or backslash, the method to declare the original string is to add 'R' or 'R' before the string '. 3. In findall, you can use regular expressions directly without escaping them? 4. re. X re. I5 .? I? :-> Matching case 6. The most common functions for retrieving input from the keyboard in Python are raw_input () and input (). It is best to use the former, which is returned in string format. 7. print output can be read only once after encoding 8. urlopen is added, and the second time is str type. It is best to unify the Exception type with the open option timeout9.t t to avoid unexpected errors. 10. Python reports the error 'ascii 'codec can't decode byte 0xe5 in position 0: ordinal not in range (128). try decode. If not, try to encode into byte stream. 11. comparison of Methods (HTMLParser, pyquery, and regular expression) in 3 of the Super Link (URL) captured by Python page ==> http://www.myexception.cn/html-css/639814.html12.python# is empty and available if xx is None or if not xx, the latter has a wider application and better results. 13. remove \ n for line in file from reading URLs by row. readlines (): line = line. strip ('\ n') 14. for urlsplit, urlparse, urlunparse details: http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html http://hi.baidu.com/springemp/item/64613c7457731517d0dcb3a7 15. to obtain the webpage status code, the requests module http://www.oschina.net/code/snippet_862981_2303216.local variable 'xx' referenced before assignment needs to be global 17. if the url remains unchanged and the content jumps, that is, the anti-scanning method, you can use urllib to directly ope N. catch reports an error. Ex: http://segmentfault.com/q/1010000000095769 Nginx configuration 18. urllib2.geturl () can get the final page after the jump, 302? 19. How to get the webpage status code:
F = urllib. urlopen ("xxxxxx") print f. getcode () ====================== import requestsdef getStatusCode (url): r = requests. get (url, allow_redirects = False) return r. status_code # The requests library used does not seem to exist in 2.7 or 2.6 ======================== = conn = httplib. HTTPConnection ("192.168.1.212"); # You can use get to submit data. request (method = "POST", url = "/newsadd. asp? Action = newnew ", body = params, headers = headers); # Return the processed data response = conn. getresponse (); # determine whether the submission is successful if response. status = 302:
20. httplib request usage. getresponse () is used to return data. 21. get_header is used to detect whether a remote file may exist. check whether it is null.