Crawl and analyze data with Python-Link home Network (requests+beautifulsoup)

Source: Internet
Author: User

This article is the first to use Python to crawl data, using the Requests+beautifulsoup method to fetch and extract the page. Through the use of the requests library to the chain home of the secondary-house list page crawl, through the BeautifulSoup to parse the page, and to obtain the price of the listing, area, type and attention to the data.

preparatory work

The first is to start the crawl before the preparation work, import the required library files, the main use is requests and beautifulsoup two. The time Library is responsible for setting the break times for each fetch. This is not all, and subsequent imports of the new library are also in progress.

Crawl List Page

The structure of the target page or site is observed before starting the crawl, which is more important than the structure of the URL. Chain Home network of the second-hand list page A total of 100, the URL structure of http://bj.lianjia.com/ershoufang/pg9/, which BJ represents the city,/ershoufang/is the channel name, PG9 is the page code. We want to crawl is Beijing's secondary housing channel, so the front part will not change, belongs to the fixed part, the following page code needs to change in 1-100, belong to the variable part. The URL is divided into two parts, the previous fixed part is assigned to the URL, and the later variable part uses a For loop.

In addition, you need to set a header message in a very HTTP request, otherwise it is easy to be blocked. Head information is available on the web and can be viewed using tools such as HttpWatch. Specific details are adjusted according to the specific situation.

Use the For loop to generate a 1-100 number, and convert the format to the URL you want to crawl with a fixed portion of the previous URL. Here we set the interval of 0.5 seconds per two pages. The crawled page is saved in HTML.

Parse pages and extract information

Page fetching cannot be read and extracted directly, and page parsing is required. We use BeautifulSoup to parse the page. into what we see in the browser View source code.

Once the page is resolved, the key information in the page can be extracted. Below we separately to the total price of the listing, the listing information and the attention degree three parts carries on the extraction.

Extract the Class=priceinfo part of the page Div tag and use the For loop to present the total price data for each listing in the TP.

The method of extracting listing information and the attention degree is similar to the method of extracting the price of the housing, the following is the specific code, the listing information is stored in Hi, the focus is stored in the fi.

Create a data table and clean the data

Importing the Pandas library summarizes the total price of the previously extracted listings, and the information such as the focus to generate a data table. Easy to analyze later.

The previous extracts are just information, not directly used, before the analysis of the data to extract and clean the information. For listing information, the cell name, type, area and orientation of each listing in the table are in one field and cannot be used directly. A column operation is required first. The rules here are obvious, and each information is separated by a vertical line, so we just need to sort by a vertical bar.

This is done after the breakdown of the new datasheets, listings of various information and become separate fields.

The new data table after the column is re-stitched back into the original data table so that it can be used in conjunction with other field information during the subsequent analysis.

The Completed Stitching data table contains both the original field and the new field after the column.

Use the same method to sort and stitch the listing attention fields. The rule here is a slash.

The distribution of the listing type

Before we through the listing information to obtain the housing of the orientation, type and other information, here we have a summary of the housing units, to see the housing distribution of the housing in Beijing for sale.

First, the number of listings according to the size of the house to summarize, the following is the specific code and results.

Import the numerical calculation library mumpy the data, and use Matplotlib to draw the listing type Distribution bar chart.

BEIJING sale of second-hand housing from 1 Room 0 Hall to 7 Room 3 Hall nearly 20 kinds of distribution widely. The largest number in all units is 2 Room 1 hall, followed by 3 Room 1 Hall and 3 Room 2 hall, and 2 Room 2 Hall. The smaller 1-room 1-hall is also more numerous. The larger units are less. In addition, we can speculate on the sale of housing people in the distribution of the situation.

distribution of housing area

In the data sheet, the size of the listings is extracted by column and separately, but the numbers and the Chinese format are not directly used. We also need to make a two-time breakdown of the Housing area field to extract the value of the area. Method similar to the previous method of disaggregation, we use "ping" as a breakdown of the size of the housing area two times. The results are then spliced back into the original data table.

The data that is broken down will need to be cleaned before it is used, and the usual operations include removing spaces and formatting conversions. Below we first to the value of the housing area to remove the two ends of the space, and then change the format of the value to facilitate subsequent calculations.

After cleaning the area of the space can be started analysis. First look at all the areas of Beijing's secondary housing, below is the code and results. The room size is from 18.85 to 332.63.

With the scope of the housing area, it is possible to group the area, we divide the housing area into 7 groups with 50 intervals. and statistics of the distribution of all listings in these 7 groups.

Use the Property Area grouping field to group the number of listings and draw a bar chart.

In all listings, the largest number is 50-100, followed by 100-150. As the area increases the quantity decreases. Smaller listings of less than 50 also have a certain number of listings.

distribution of the focus of the listing

The situation of the listing is similar to the size of the housing, the data obtained after the first breakdown contains numbers and Chinese, cannot be used directly, it needs to be processed again by processing the value of the attention, and the value of the cleaning and format conversion. The following is the specific code.

After cleaning, look at the range of focus for all listings, with a focus from 0 to 725. In other words, some houses are hot, while others are not. This may be related to the listing and update of the situation, in addition to consider the sales speed of the listings, popular listings may be very hot, just online on the deal. So we are simplifying the situation and temporarily ignoring these complex situations. Only the distribution of the degree of attention is counted.

The focus is divided into 8 groups of 100, and the number of listings is aggregated according to the area of interest. View the distribution of concerns for listings on sale.

Draw the listing attention Distribution bar chart.

Of the 3,000 listings, nearly 2,500 of the listings are less than 100, with less attention than 400 of the listings. Again, the attention level data cannot accurately represent the popularity of the listings. Popular listings may be less focused due to the fast selling rate. So the focus data is for reference only.

Clustering Analysis of listing

Finally, we cluster all listings for sale in terms of total price, area and degree of attention. The similarities between the listings by Total price, area and attention are divided into categories.

We divide the listings into three categories by calculation, the following is the center point coordinates for each category.

Based on the three categories at the center of the total price, area and attention three points, we will be selling listings in three categories, the first category is low-cost, low-area, high-concern listings. The second category is the category with the center of the total price, the center of the area, and the focus. The third category is the category of high total price, high area, and low attention level.

From a marketing and user experience perspective, the default ordering of ads and list pages should give a total of 4 million, an area of 80 properties with higher weights. This category of listings can attract the most user attention.

Enthusiastic little Partner provides code example: Link: http://pan.baidu.com/s/1skSlVUt Password: IGVV

End.

Reprint please indicate from 36 Big data (36dsj.com)

http://www.36dsj.com/archives/71046

Use Python to crawl and analyze data-link home Network (requests+beautifulsoup) (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.