SELENIUM+PHANTOMJS technology can be used when crawling web content that is done using AJAX technology
1.pip install selenium2. Download Phantomjs no need to use PIP Wuhan University of Technology Homepage There is a piece of Web content that uses JS to load asynchronously,
The idea of grabbing this piece of content is to determine if the piece is loaded; Selenium crawl
In judging the loading completed this step can be judged whether there is a ' school-enterprise cooperation ' appears
(PS: In fact, it is reasonable to find the asynchronous content inside of a last loaded elements, but this example of the element has no redundant features to choose from)
1 #Coding:utf-82 fromSeleniumImportWebdriver3 fromSelenium.webdriver.common.byImport by4 fromSelenium.webdriver.support.uiImportwebdriverwait5 fromSelenium.webdriver.supportImportExpected_conditions as EC6 7Driver = Webdriver. PHANTOMJS (Executable_path ='C://python27//scripts//phantomjs-2.1.1-windows//bin//phantomjs')8Driver.get ("http://www.wust.edu.cn/default.html")9 Ten Try: OneElment = webdriverwait (Driver, ten). Until (Ec.presence_of_element_located (By.partial_link_text,'School-Enterprise cooperation'))) A finally: -UL = driver.find_element_by_id ('infocont_137575764138965434_148645613741998292') -Status ='False:' the iful!=None: -Lis = Ul.find_elements_by_tag_name ('Li') - iflis==None: - Print('Query failed') + forLiinchlis: -Text = Li.find_element_by_tag_name ('a'). Text + iftext!="': AStatus ='Tuple:' at Print(status+text) -Driver.close ()
This procedure is performed in the following steps:
Determine if there is a link containing the "school-Enterprise cooperation" string;
Find the UL tag with ID infocont_137575764138965434_148645613741998292
Find the LI tag inside the UL tag
Find the A tag in the Li tag and extract the text of the A tag
It is worth noting that:
The Windows system needs to set the encoding on the first line;
Use Webdriverwait to Judge page load status, better than time.sleep effect;
Asynchronous loading may return more Li tags than is displayed, the review element can be seen, but the page does not show it, so you need to judge text!= ';
Labels cannot be found directly across hierarchies.
Operation Result:
Selenium+phantomjs using the first experience