To crawl Taobao store content can only be passed over the Taobao URL to crawl. So we need to have a URL first.
After you have the URL, you can start to crawl work. According to the URL of the domain name different need to divide the URL into two parts, one is Taobao shop, one is the shop of the cat. Intercept the domain name of the URL here is not said, we will not be their own Baidu. This is because Taobao and the cat shop DOM structure is not the same.
First of all, the simple day cat.
The day Cat shop rank in a class name is tm-shiop-age-content, therefore uses the Phpquery PQ ('. Tm-shiop-age-content ')->text () can directly obtain the day cat shop the rank.
Then there is the grading of the store.
The day Cat store score is in a class for Main-info Div, so it is also the direct PQ ('. Main-info ')->text () on it, the returned data will be "description 4.8 Service 4.7 Logistics 4.7" such data.
Below said the more troublesome Taobao.
First of all Taobao shop level divided into a variety of, such as crown, diamond and so on, and these levels in front of the number. First look at Taobao store level can be found on the page where to find:
Level information here, the first and third classes are fixed, and the second class, which is in the form of tb-rank-xxx, is the corresponding relationship:
Crown: Golden Crown
Cap: Crown
Blue: Diamond
Red: Hearts
So we have to according to the different values of this class to get the level of Taobao store.
PQ ('. Shop-rank. Rank-icon-v2 ')->attr (' class '), we take the value of this class first. Note that from its upper level, Shiop-rank, because the business shop level on the page appeared two times, and we only need one is enough.
The next step is to get the number before this level.
The number of I in this a label corresponds to the number of the corresponding rank. So let's get the I tag in the a tag too: PQ ('. Shop-rank. Rank-icon-v2 ')->html ().
With these two things we're OK, write a function to handle both, and return the value of the store level that we can read:
Trimall function is to replace the crawl to the space in the HTML page, my previous crawl Baidu data blog has this function, here will not be posted again.
And finally we'll get this data: 5 crowns.
Then is to grab Taobao store rating.
Taobao Store rating in a class of MINI-DSR a label, we directly get the text of this tag: PQ ('. MINI-DSR ')->text ().
Crawl down will find that become garbled. This is because the page code taobao is GBK, and phpquery only know GBK2312 do not know GBK, will help us automatically converted to iso-8859-1 encoding, so also need to talk about the data capture to convert the code.
First convert the data to UTF8, and then transfer GBK2312 OK. $str = mb_convert_encoding ($str, ' iso-8859-1 ', ' utf-8 ');
$str = mb_convert_encoding ($str, ' utf-8 ', ' GBK ');
So we get the same grading data as the cat above.
This is the way I crawl and the problems encountered, welcome other small partners have a better way to discuss together.