Some time ago, my sister company boss asked her to go to the French Amazon review list of the first 100 pages a total of 1000 comments The user's contact information to find out. 1000 users, to see one by one and then recorded, and not every comment user will be the personal contact information. So the problem comes, so time-consuming and laborious work, if it is done manually, then it takes two days to find the first 30 pages of data (there is something else to do), and then tired. In the spirit of the love of the principle (the program ape can find sister is very good, so have to love dearly), want to help her do something.
My own job is to do the game client development, the main use of the development language is LUA and C + +, and has not touched the web, site-related work. It's just that the work has been useful to Python scripts, and then one time on the Internet to check the relevant data on Python, there are users to read Python to do something about the crawler. So I thought, can I also use Python to write crawlers to crawl data on Amazon's Web site? This is the beginning of the current learning to start knocking up the code.
Environment:
Windows7
python:2.7
Python Plugin for use:
URLLIB2, Urllib plug-in, with the open web link;
Re plugin, used to do regular matching;
Codecs plug-in for encoding conversion and data preservation.
Features that are currently implemented:
Crawl France Amazon top-viewer list Top 100 pages total 1000 users ' names, contact details (website links or email), nationality (may have Luxembourg, Switzerland buyers in France Amazon), user reviews detailed page links and other data.
By crawling the French Amazon top-viewer list data, it expands to fetch data from the list of Amazon top-viewer in China, Amazon, USA. In theory, simple modifications can be used to crawl data from Amazon Top-viewer lists in different countries.
Areas to be improved:
After the code is written, the process of fetching data is found to be inefficient, 1000 of data takes a long time to crawl, and after fetching a few pages or dozens of pages the program will not run, and the card can only be turned off and then open. Before the data was extended to crawl China and the US Amazon, I thought of possible reasons:
- The regular expression has the optimization space, because I have not contacted the regular expression before, has not used;
- The French Amazon website is slow in the domestic visit, affecting the data crawl;
- Python has not been systematically studied and is not familiar with the use of some syntax or third-party helper plugins.
The above three points are the probable cause of the low crawl efficiency that I think of. I later extended the same code to fetch data from China and the US Amazon to verify the impact of the second cause on the overall crawl, and found that the impact was great! The same bandwidth, hardware conditions, China, the United States, the first 100 pages a total of 1000 review users, the crawl took about half an hour, and the crawl of the French 1000 data, it took me nearly an afternoon time (because always stuck, I think it should be urllib open the webpage is not responding and my program did not make a judgment Only after a continuous crawl, but also always more than a sister to open a Web page to record, at least people will not bother! Then to China, the United States data capture took half an hour, for this time spent my personal bad judgment is spent more or almost. But as a developer, programs can always be optimized!
Ideas:
At that time to see the crawler written by netizens, the idea is to open the Web page, matching their needs of information. So my idea is to follow this: through the Python urllib and Urllib2 plug-in to open the page, and then converted to HTML data, using the Python re regular plug-in to do regular matching, get the page, user Details page, user contact information.
Specific implementation:
1. The top comment list for Amazon in France has about 1000 pages with 10 user data per page. Each page, in addition to the first page of links, the other page links are related to the number of pages, such as Http://xxx.page23.xxx is a 23-page data, so you can use a simple string concatenation can be 1000 pages of the page link. This is the way to get links to each page, the sample code is as follows: A, splicing page links, because the first page and the rest of the page format slightly different, so separate to deal with:
if 1 = = I :"http://www.amazon.fr/review/top-reviewers/ref=cm_cr_tr_link_ " + str (i); Else : " Http://www.amazon.fr/review/top-reviewers/ref=cm_cr_tr_link_ " " ? ie=utf8&page= " + str (i);
B. Convert the page to HTML:
Try : = urllib.urlopen (URL) = page.read () return HTMLexcept: Print " GETHTML2 Error "
I used try, except just want to see if I can handle the problem of opening the Web page (I guess the French Amazon crawl stuck because the site is not responding), but no effect;
2, each page has 10 users of data, open a user will jump to its details page, by looking at the different Details page link form, found to be similar: A variable that may be a user name, a value that represents the user in the list of comments. So I can find a way to get the user name (guess is a user name, or a unique Amazon saved), the user in the list of comments in the rankings, and then stitching out the User Details page link; By viewing the source of the page, the user information is found in the form of similar, such as:/gp/pdp/ profile/xxxx/such a form, so can be a simple regular match to get XXXX this data, for the moment called the user unique identifier. This is a way to get a detailed page link; example code: A, a unique identifier that matches each user:
Reg = R'href= "(/gp/pdp/profile/.+?)" ><b>'== Re.findall (capturere,html)
B, Patchwork Links:
num = (i-1) * ten +"http://www.amazon.fr""/ref=cm_cr_tr_tbl_ " " _name ";
Index refers to 10 of the specific data, NUM is actually the user in the comment list of the ranking, a page 10, so can be based on the page number and index to calculate the specific rankings;
C, to HTML:
headers = {#masquerading as a browser crawl 'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6'}req= Urllib2. Request (url,headers=headers) Page="";Try: Page=Urllib2.urlopen (req) HTML=Page.read ()returnHTMLexcept: Print "gethtml Error"
You can see that this is not the same as the previous conversion form, because I found that the previous conversion method to get the page data, and our right-click Browser to view the source code format is different, and then I match the contact method has been matching the failure, I think it is caused by this difference. Therefore, the data used in the above form to convert, netizens also said that this can prevent Amazon for frequent access to IP processing IP;
3. Not every reviewer offers contact information, but there are only two forms of contact: A link to a website (blog or something), and a mailbox. By viewing the source code discovery: Provide links to the site, in the source code will have a nofollow tag, and provide the mailbox has a mailto: keywords. With these two messages, I can do a regular match accordingly. This is the way to get the contact information. Here is also the two regular expression to do regular matching, similar to the above code is not on the code, 4, followed by regular matching to get the comment person's name, nationality and other information. The operation is similar to that mentioned in 3.
defects:
- As I mentioned in the question , crawl efficiency is a big problem. 1000 data quickly need to spend half an hour to match crawl finish, self-feeling efficiency can also be improved;
- In the crawl of the French Amazon Top-viewer data, the program has been stuck, but I can not do this problem processing, need to check again.
written at the end:The little script took about an afternoon to toss it out, and then took some time to meet the other needs of her sister, and the second genius sent the data to her sister for use. But overall, it may take me 1 weeks to finish it than my sister, and I'm a little bit busy. And I learned something myself:Data fetching! Although I this is simply can not be a simple implementation, but also the equivalent of a little into the door, accumulated a bit of knowledge!
Use Python to crawl Amazon comment list data