The first is the library you need:
URLLIB,URLLIB2-----Build Access Requests
SYS-----Python standard library
BeautifulSoup-----Parse Crawl Results
first, build access requests
In Python, Urllib.urlencode can convert a key-value pair such as key-value into a format such as a=1&b=2, such as:
From urllib import urlencode
data = {' IE ': ' utf-8 ', ' word ': ' Test '}
print data
print UrlEncode (data)
Baidu search keyword for Chinese, the transfer of Chinese parameters to the URL needs to go through transcoding, UrlEncode can be transferred to the Chinese code
QUOTE () in the Urllib library can also be implemented to translate the Chinese code
Sys.stdin.encoding can query the current Environment encoding form (PS: ' cp936 ' is actually GBK)
Finally, we can write the URL of the search page and use the Urllib2.urlopen (URL). read () to get the content of the Web page.
Import sys
import urllib
import urllib2
question_word = "Programmer"
data = {' WD ': Question_word, ' ie ': ' utf-8 '
data = urllib.urlencode (data)
URL = "http://www.baidu.com/s?wd=" + '? ' +data
HtmlPage = Urllib2.urlopen (URL)
second, get search results link
outfile = open (' Csv_test.txt ', ' ab+ ')
str = html.read ()
str = beautifulsoup (str)
div = str (' div ') for the members in
div:
If Members.get (' class '):
if "result" in Members.get (' class '):
pages = Members (' a ')
for page in pages:
herf = page.get (' href ', No NE)
if herf!= None:
temp = Requests.get (Herf.rstrip ())
print Temp.url
outfile.write (Temp.url)
outfile.write (' \ n ')
Baidu page to see the source code, you can see the information we want in <div class= "Result...><a herf=" ... "></a>...</div>, So as long as you find the source code contains this part of the information, and finally crawled to the page link saved to TXT file.