#encoding =gb2312import urllibimport redef gethtml (URL): page = urllib.urlopen (URL) html = page.read () return htmldef getimg (HTML): reg = R ' <strong> (. *) </strong> ' Imgre = Re.compile (reg) imglist = Re.findall (Imgre, HTML) return imglisthtml = gethtml (' http://yjs.teacher.com.cn/dsjyss/jswk11104/info/ Kcjjx.htm ') imglist = getimg (HTML)
Print HTML #这样输出一堆Unicode码print imglist[0] #for img in imglist:# print img
The above is a simple example of my study of Python crawler, I modified the online spread of the download Baidu image example, used to crawl some of the course name of online courses
But I found an interesting question, like the code
If the HTML is directly after the print regular, you will get
[' \xe7\xbd\x91\xe7\xbb\x9c\xe6\x8a\x80\xe6\x9c\xaf\xe4\xb8\x8e\xe5\xa4\x9a\xe5\xaa\x92\xe4\xbd\x93\xe6\x8a\x80 \xe6\x9c\xaf ', ' Network Technology and multimedia Technology ', ' 1. \xe7\x9f\xa5\xe8\xaf\x86\xe4\xb8\x8e\xe6\x8a\x80\xe8\x83\xbd ', ' 2. \xe8\xbf\x87\xe7\xa8\x8b\xe4\xb8\x8e\xe6\x96\xb9\xe6\xb3\x95 ', ' 3. \xe6\x83\x85\xe6\x84\x9f\xe6\x80\x81\xe5\xba\xa6\xe4\xb8\x8e\xe4\xbb\xb7\xe5\x80\xbc\xe8\xa7\x82 ', ' \xe4\xb8\ X93\XE9\XA2\X98\XE4\XB8\X80\XEF\XBC\X9A\XE5\XA4\X9A\XE5\XAA\X92\XE4\XBD\X93\XE6\X8A\X80\XE6\X9C\XAF1 ', ' \xe4\ Xb8\x93\xe9\xa2\x98\xe4\xba\x8c\xef\xbc\x9a\xe5\xa4\x9a\xe5\xaa\x92\xe4\xbd\x93\xe8\xaf\xbe\xe4\xbb\xb6\xe8\ Xae\xbe\xe8\xae\xa1 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe4\xb8\x89\xef\xbc\x9a\xe5\xa4\x9a\xe5\xaa\x92\xe4\xbd\x93\xe8 \xaf\xbe\xe4\xbb\xb6\xe5\xbc\x80\xe5\x8f\x91 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe5\x9b\x9b\xef\xbc\x9a \xe7\xbd\x91\ Xe7\xbb\x9c\xe8\xaf\xbe\xe7\xa8\x8b\xe8\xae\xbe\xe8\xae\xa1 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe4\xba\x94\xef\xbc\x9a \xe7\xbd\x91\xe7\xbb\x9c\xe8\xAf\xbe\xe7\xa8\x8b\xe5\xbc\x80\xe5\x8f\x91 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe5\x85\xad\xef\xbc\x9a\xe5\xa4\x9a\xe5\ Xaa\x92\xe4\xbd\x93\xe6\x8a\x80\xe6\x9c\xaf2 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe4\xb8\x83\xef\xbc\x9a\xe6\xa0\xa1\ Xe5\x9b\xad\xe5\xb1\x80\xe5\x9f\x9f\xe7\xbd\x91\xe7\x9a\x84\xe6\x9e\x84\xe5\xbb\xba ', ' \xe4\xb8\x93\xe9\xa2\x98 \xe5\x85\xab\xef\xbc\x9a\xe7\xbd\x91\xe7\xbb\x9c\xe6\x9c\x8d\xe5\x8a\xa1\xe5\x99\xa8\xe9\x85\x8d\xe7\xbd\xae\ Xe4\xb8\x8e\xe7\xae\xa1\xe7\x90\x86 ', ' \xe4\xb8\x93\xe9\xa2\x98\xe4\xb9\x9d\xef\xbc\x9a\xe7\xbd\x91\xe7\xbb\x9c \xe8\xae\xbe\xe5\xa4\x87\xe4\xba\x92\xe8\xbf\x9e ', ' \xe4\xb8\x93\xe9\xa2\x98\xe5\x8d\x81\xef\xbc\x9a\xe7\xbd\ X91\xe7\xbb\x9c\xe5\xae\x89\xe5\x85\xa8 ']
If you are using the Traverse method or print imglist[0], the output of the Chinese
This is a problem that has bothered me for a day, and there is no end to it now.
Why print HTML output directly is not a man, it's strange.
A little doubt about the output of Chinese in Python