Getting started with python: web page data capture and python crawling
This is good. Just getting started.
1. Use feedparser:
Tip: use Universal Feed Parser to control RSS
Http://www.ibm.com/developerworks/cn/xml/x-tipufp.html
Please visit feedparser.org to learn more about the Universal Feed Parser, which also includes some download materials and documents.
Feedparser:
Http://code.google.com/p/feedparser/downloads/list
2. In addition, you need to add the UTF-8 bom header to the file and use python to write hexadecimal characters:
Http://linux.byexamples.com/archives/478/python-writing-binary-file/
Python writes hexadecimal characters
File. write ("\ x5F \ x9D \ x3E ")
File. close ()
3. to debug the file, change the open mode to w.
Python code
- Import urllib
- Import sys
- Import re
- From feedparser import _ getCharacterEncoding as enc
- Class TagParser:
- Def _ init _ (self, value ):
- Self. value = value
- Def get (self, start, end ):
- Regx = re. compile (R' <'+ start + R'. *?>. * </'+ End + R'> ')
- Return re. findall (regx, self. value)
- If _ name _ = "_ main __":
- Baseurl = "http://data.book.163.com/book/section/000BAfLU/000BAfLU"
- F = open ("test_01.txt", "w ")
- F. write ("\ xef \ xbb \ xbf ")
- # For ndx in range (0, 56 ):
- For ndx in range (0, 1 ):
- Url = baseurl + str (ndx) + ". html"
- Print "get content from" + url
- Src = urllib. urlopen (url)
- Text = src. read ()
- F1 = open ("tmp _" + str (ndx) + ". txt", "w ")
- F1.write (text)
- F1.close ()
- Encoding = enc (src. headers, text) [0]
- Tp = TagParser (text)
- Title = tp. get ('h1 class = "f26s tC" ', 'h1 ')
- Article = tp. get ('P class = "ti2em" ', 'P ')
- T = re. sub (R' </. +> ',' \ n', title [0])
- T = re. sub (R' <. +> ',' \ n', t)
- Data = t
- C = ""
- For p in article:
- Pt = re. sub (R' </p> ',' \ n', p)
- C + = pt
- C = re. sub (R' <. +> ',' \ n', c)
- Data + = c
- Data = data. decode (encoding)
- F. write (data. encode ('utf-8', 'ignore '))
- F. close ()
How can I extract the main content of a captured webpage using python?
Use beautiful soup
There are too many specific code. Check the link.
Reference: www.crummy.com/...h.html
Python captures webpage data
S = WAF webpage content
Try the following four: [try one by one] (by the way, are you from Minhang tech?
S = s. decode ("utf8 ")
S = s. decode ("gbk ")
S = s. encode ("utf8 ")
S = s. encode ("gbk ")