The following is the generated word cloud
My environment: mac,anaconda,python2.7, and a variety of Python libraries to use first Anaconda
Anaconda is a Python release that can be used for scientific computing, supporting Linux, MAC, and Windows systems, with built-in scientific computing packages in common use. It solves two big pain points in official Python. First: Provides the package management function, the Windows platform installs the third party package frequently failed scene to solve, second: provides the environment management the function, the function resembles Virtualenv, solves the multiple version Python coexistence, the switching question.
Conda is a tool for package management and environmental management under Anaconda, functionally similar to the combination of Pip and vitualenv. Conda is added to the environment variable by default after successful installation, so you can run commands directly in the Command line window Conda
Conda's environmental management and VIRTUALENV are basically similar operations.
# View Help Conda-h
# Create an environment with a name of PYTHON36 based on the python3.6 version Conda create--name python36 python=3.6
# Activate this environment source activate PYTHON36 # linux/mac# again to check Python version, show is 3.6python-v # Exit current environment Source deactivate python36
# Delete the environment Conda Remove-n Python3 6--all# View so the installed environment Conda INFO-E
Conda's package management function can be the same as the PIP, of course, you choose Pip to install the package is no problem.
# install matplotlib Conda install matplotlib# view installed packages Conda List
# package Update Conda update matplotlib# Delete package Conda remove Matplotlib
Anything is a package in Conda. Conda itself can be considered a package, the Python environment can be considered a package, Anaconda can also be considered as a package, so the 3 packages are supported in addition to the normal Third-party package support updates. For example: Anaconda Mirror Address By default in foreign countries, with Conda installation package will be very slow, the current available domestic mirror source address has Tsinghua University. Modify ~/.condarc (LINUX/MAC)
Channels:
-https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
-Defaults
show_channel_urls: True
If the use of Conda installation package is still very slow, then you can consider using the PIP to install, the same as the PIP source address also changed to domestic, watercress source faster. Modify ~/.pip/pip.conf (LINUX/MAC)
[global]trusted-host = Pypi.douban.comindex-url = Http://pypi.douban.com/simple
After the environment is set up, you can start to play the data analysis happily. Crawl dynamic content because the content of dynamic pages is dynamically loaded, so we need to continue to slide, load the page to switch to the current contents of the frame, but also may not be frame, where you need to see the specific situation to get the page source data, and then into the XPath, and then read
# Drop-down scroll bar, so that the browser load dynamically loaded content,
# I am here from 1 start to 6 end of 5 load per page data for I in
Range (1,6):
height = 20000*i# each time Slide 20000 pixel
Strword = "Window.scrollby (0," +str (height) + ")"
Driver.execute_script (Strword)
Time.sleep (4) # Many times the Web page is made up of multiple <frame> or <iframe>, the Webdriver default is the outermost frame,
# So you need to select a Under the frame, or can not find the following required page elements
driver.switch_to.frame ("App_canvas_frame")
selector = etree. HTML (driver.page_source)
divs = Selector.xpath ('//*[@id = Msglist ']/li/div[3] ')
Generate Word Cloud
Generate the word cloud need to use library: Wordcloud, generate word cloud matplotlib, generate word cloud picture Jieba, display Chinese.
#coding: Utf-8from wordcloud import wordcloudimport matplotlib.pyplot as Pltimport jieba# generate word Cloud def create_word_cloud ( FileName): text= Open ("{}.txt". Format (filename)). Read () # stuttering participle wordlist = jieba.c UT (text, cut_all=true) &NBSP;WL = "". Join (wordlist) # set Word cloud &NBSP;WC = Wordcloud ( &nbs P # Set background color background_color= "white", # Set maximum number of words to display in &NB Sp max_words=2000 # This font is in computer fonts, general path font_path= '/system/library/fon TS/PINGFANG.TTC ', height= 1200, width= 1600, # setting fonts
Max max_font_size=100, # Set How many randomly generated states, that is, how many color schemes random_state=30, ) myword = wc.generate (WL) # generative word cloud # display word Cloud chart plt.imshow (mywor d) plt.axis("Off") plt.show () wc.to_file (' py_book.png ') # keep the word cloud under if __name__ = ' __main__ ': create_word_cloud (' Qq_word ')
All the complete code has been put GitHub
GitHub Address Https://github.com/Jimmy9876/QZone_spider
Http://www.aibbt.com/a/22275.html