This is a creation in Article, where the information may have evolved or changed.
Crawl Autohome used Car product Library
Project Address: Https://github.com/go-crawler ...
Goal
Recently, people often mention autohome in their ears, but also curious about the domestic prices of used cars, so this time the target site is Autohome's Used car product library
To analyze the target source:
- Page Total 24 article
- With paging, but this old product library, after 100 pages there will be a problem, so we crawl 99 pages
- Access to all cities
- 19w+ data can be crawled together
Begin
Crawl steps
- Get all the city
- Assemble all city URLs into the queue
- Analyzing the structure of a Used car page
- Next page URL into queue
- Recycle used car data that pulls all the pages out
- Recycling of used car data in a city in a circular pull queue
- Wait to determine that there are no new URLs in the queue
- Crawling of used cars data warehousing
Get the city
With the page view, you can find a list of all used-car cities in the city screening area, but you check the code carefully. Will find it is JS loaded in, the city is also unified in a variable
There are two methods of extracting
- Analyze JS variables and extract them
areaJson
copy it directly as a variable resolution
Here we can directly copy and paste it, because this is a relatively small change in value
Get pagination
Through the analysis page can be learned that the paging link has a certain regularity, for example: /2sc/hangzhou/a0_0msdgscncgpi1ltocsp2exb4/
, can be found sp%d
, sp
followed by the page number
It's common sense that you can go routine
quickly pull a wave by predicting all the paging links and pushing into the queue
But in this old product inventory in one problem, after more than 100 pages, the next page is always 101 pages
Therefore, we take a more traditional approach, by pulling down a page of links to access, in order to adapt to possible changes in paging links; 100 pages after the page show is also very strange, first of all ignore
Get data for Used cars
The page structure is more fixed, the regular cleaning HTML can
func getcars (Doc *goquery. Document) (Cars []qccar) {cityname: = Getcityname (Doc) doc. Find (". Piclist ul Li:not (. Line)"). Each (func (i int, selection *goquery. Selection) {title: = Selection. Find (". Title a"). Text () Price: = Selection. Find (". detail. Detail-r"). Find (". Colf8"). Text () Kilometer: = Selection. Find (". detail. Detail-l"). Find ("P"). Eq (0). Text () Year: = Selection. Find (". detail. Detail-l"). Find ("P"). Eq (1). Text () kilometer = strings. Join (Compilenumber.findallstring (kilometer,-1), "") Year = strings. Join (Compilenumber.findallstring (strings). Trimspace (year),-1), "") prices, _: = StrConv. Parsefloat (price, kilometers), _: = StrConv. Parsefloat (kilometer, years), _: = StrConv. Atoi (year) cars = append (Cars, qccar{cityname:cityname, Title:title, Price:pri CeS, Kilometer:kilometers, Year:years,})) return cars}
Data
In each city's average price comparison, we can find the North Canton Deep in Beijing, Shanghai, Shenzhen are on the list, and in recent years the momentum of the more powerful Hangzhou directly occupied the top, and the last few have some distance
While other cities are generally a cascade of downward trend, it seems that the first-tier city of used cars is not cheap, of course, it is only the average price
We can see the comparison between the price and the number of kilometers, Shanghai, Chengdu, Zhengzhou, the difference is a little big, feel the need to do in the price and the number of kilometers to do a measure
The picture is a bit interesting, roughly counting the total number of kilometres. In the previous pictures, the average price ranking is not here, but Hohhot, Daqing, Zhongshan and other appeared in the top
Whether the side reaction of the first-tier city vehicle replacement is faster, and the later city of the vehicle is slow, the number of kilometers are basically the leverage
Through the analysis of the title, it can be learned that the name of the vehicle product library is the brand name + Auto/manual +xxxx + attributes, see the title can know an overview
Reference
Crawler Project Address
- Https://github.com/go-crawler ...