involving knowledge points
1. Crawl Data
2. Paging crawler
Law analysis
1, crawl the data, found that each item is Data-tools label
2. Pagination Analysis
Code
import requestsfrom bs4 import beautifulsoupimport reimport jsonimport jieba# get HTML page information Def getkeywordresult (keyword, pagenum): url = ' http ://www.baidu.com/s?wd= ' + keyword + ' &pn= ' + pagenum + ' 0 ' try: r = requests.get (url, timeout=30) r.raise_for_status () r.encoding = ' Utf-8 ' return r.text except: return "" #解析并抽取数据def parserlinks (HTML): soup = beautifulsoup (html, "Html.parser") links = [] for div in soup.find_all (' div ', {' data-tools ': re.cOmpile (' title ')}): data = div.attrs[' Data-tools '] d = json.loads (data) links.append (d[' title ') words_all.append (d[') Title ']) return links, words_all# word frequency Statistics def words_ratio (words_all): words = [] for i in words_all: tmp = jieba.lcut (i) for tmp_word in tmp: words.append (Tmp_word) counts = {} for word in words: if len (word) == 1: continue else: counts[word] = counts.get (word, 0) + 1 items = list (Counts.items ()) items.sort (key=lambda x: x[1], reverse=true) for i in range (30): word, count = items[i] print ("{0:<10}{1:>5} ratio: {2}". Format (Word, count, int (count)/len ( words)) Def main (): for pagenum in range (0, 50): html = getkeywordresult (' Lao Zhang ', str (pagenum)) #输入搜索关键词和页数 ls, words_all = parserlinks (HTML) &Nbsp; count = pagenum + 1 for i in ls: print ("[{: ^3}]{}"). Format (count, i)) ls = [] words_ratio (words_all) if __name__ == ' __main__ ': words_all = [] main ()
Results
Follow-up thinking
Code is very simple, master to know how to expand. Now that the data is crawling down, but it's messy, it still needs to be artificially analyzed. Such data I call naked data, the ideal data is readable and related, I call it gold data.
The process of this conversion analysis involves two questions:
1, how to realize the readable?
You can delete bad data by using the del[] method in the dictionary.
2, how to achieve the relevance of data?
The bare data is analyzed two times, the related word items are put together and then run.
"Data analysis" python analysis of Baidu search keywords frequency