Python is a very convenient thing to do the web crawler, the following first posted a piece of code, use the URL and settings can be directly to get some data:
Programming Environment: Sublime Text
<span style= "FONT-SIZE:18PX;" ># import needs to use the package, remember to install BeautifulSoup from
BS4 import beautifulsoup import
urllib2
# You need to pick up the data website URL, Timeout is the overflow time, that is, in timeout time can not get the required data to exit, belong to the protection measures
Pagesource = Urllib2.urlopen ("http://www.ly.com/scenery/", timeout=8)
# Read the Web site's data
SourceData = Pagesource.read ()
sitesoup = BeautifulSoup (SourceData, "Html.parser")
# Data belongs to class
Selectkeys = Sitesoup.find_all ("div", attrs={"class": "S_com_detail"})
Selectkeyz = Sitesoup.find_all ("span", attrs={"class": "S_dis"})
# output format, the first is to output multiple data at the same time, the second represents output only a set of data, some of which are formatted for
Plink in Selectkeys:
print "%s,%s,%s"% (Plink.find_all ("P") [0].find (Text=true), Plink.find_all ("i") [0].find ( Text=true), Plink.find_all ("B") [0].find (Text=true)) for
blink in Selectkeyz:
print Blink.find (text=true) </span>
If you want to pick up the data from different websites, the procedures that need to be modified are as follows:
Action steps are as follows:
First step: First get the target URL, open the site you want to directly copy the URL, placed in the program box 1
The second part: on the website page on the right-check to open the following interface
The right is the source of the page, we need to find data from these sources
Step Three: Locate the data we want to download:
Click on this button:
Click the data you want to download select:
At this point the source of the Web page is located in this data place:
Step Fourth: Find the class that the data belongs to
Along the location of the source to look up, the first class is this data belongs to the class
Class before and after the DIV and detail respectively corresponding to the program inside the class before and after the content
Step Fifth: Find the unique identification for this data
The data we locate in the figure (ie 109) before and after the <b> and </b> respectively, so B is the data corresponding to the unique identification, the source of these pages all data such data (price) of the logo is B, the corresponding program in the last red box inside the contents.
At this point, the procedure is completed, if there is no clear place, please correct me