Python Project Combat: Data collection from the Forbes Collection

Last Update:2017-07-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Data Collection Overview

To start a data analysis project, the first thing you need to do is get to raw data, and there are many ways to get the raw data. Like what:

Get DataSet (DataSet) file
Using crawlers to collect data
Get direct access to Excel, CSV, and other data files
Other ways ...

The Forbes series of data Analysis project combat, data collection, the main data from the use of reptiles for data collection, while also supporting other data for comparison.

This paper mainly introduces the idea and steps of using crawler to collect data.

This collection of Forbes global listed companies top 2000 data, covering the year from 2007 to 2017, spanning more than 10 years.

This acquisition of the target site, is a number of web pages, but the distribution of multiple Web pages are different, although the idea and the steps are similar, but need to separate to write, separately collected.

2 Data Acquisition steps

Data collection is broadly divided into several steps:

Download of target Main Page content
Collection of data on the main page
Collection of links to other distribution page sites on the main page
Download and collection of data for each distribution page
Save the collected data

The Python libraries involved include, requests, BeautifulSoup, and CSV. The following is a collection of data for a year as a case, to describe the next step of data collection.

import requestsfromimport BeautifulSoupimport csv

2.1 Data Download Module

Mainly based on requests, the code is as follows:

 def download(URL):headers = {' User-agent ':' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 '} response = Requests.get (url,headers=headers)# Print (response.status_code)    returnResponse.text

This module will be used in the main page data download, as well as each sub-page data download, is a more common module.

2.2 Collection of data on the main page

The page structure of the main page is divided into two parts, one is a Web page link containing other page data, and a list of company data on the main page, which is displayed on the Web page in tabular form.

This data can be parsed with BeautifulSoup. The code modules are as follows:

Parse the company data list information on the main page

 def get_content_first_page(HTML, year):    "Get the list of companies ranked at 1-100 and include the table header"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) tables = Body_content.find_all (' table ', {' class ':' xxxxtable '})# There are 3 tables, and the last one is what we want.TRS = tables[-1].find_all (' TR ')# Get Header name    # Trs[1], this is not the same as the other yearsRow_title = [Item.text.strip () forIteminchtrs[1].find_all (' th ')] Row_title.insert (0,' Year ') rank_list = [] Rank_list.append (Row_title) forI, TRinchEnumerate (TRS):ifi = =0 ori = =1:ContinueTDS = Tr.find_all (' TD ')# Get company rankings and listingsrow = [Item.text.strip () forIteminchTDS] Row.insert (0, year) Rank_list.append (row)returnRank_list

Parse page links on other pages on the main page

 def get_page_urls(HTML):    "Get the page link of the company ranked at 101-2000"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) Label_div = Body_content.find (' div ', {' Align ':' Center '}) Label_a = Label_div.find (' P '). Find (' B '). Find_all (' A ') Page_urls = [' Basic_url '+ Item.get (' href ') forIteminchLABEL_A]returnPage_urls

2.3 Data collection on each distribution page

The steps are also web page downloads and table class data crawling. The code content is similar to the main Page page, but there are some differences in the details.

2.4 Data storage

The data collected and finally saved to a CSV file. The module code is as follows:

 def save_data_to_csv_file(data, file_name):    "' save data to csv file '     withOpen (file_name,' A ', errors=' Ignore ', newline="') asF:f_csv = Csv.writer (f) f_csv.writerows (data)

2.5 Data Acquisition Main function

 def get_forbes_global_year_2007(year=):URL =' URL 'html = download (URL)# Print (HTML)Data_first_page = get_content_first_page (HTML, year)# Print (data_first_page)Save_data_to_csv_file (Data_first_page,' Forbes_ '+STR (year) +'. csv ') Page_urls = get_page_urls (HTML)# Print (page_urls)     forUrlinchpage_urls:html = Download (URL) data_other_page = get_content_other_page (HTML, year)# Print (data_other_page)Print' Saving data ... ', URL) save_data_to_csv_file (Data_other_page,' Forbes_ '+STR (year) +'. csv ')if__name__ = =' __main__ ':# Get data from Forbes GlobalGet_forbes_global_year_2007 ()

3 Summary

This article only introduces the idea of data collection and each module, and does not provide links to the target Web page, on the one hand, because the original Web page data information is messy, the acquisition of time need to write a number of acquisition procedures, on the other hand, because our focus is on the subsequent data analysis part, I hope not to focus on data crawling.

During the subsequent analysis, we will look at the structure of the data, data integrity and related information, and welcome to the public number (id:pydataroad).

Recommended readings for this issue:

Forbes series of data analysis ideas
Pandas data Processing: Forbes Global list of listed companies
Python job search Top10 City to see if there is a city of your own

Python Project Combat: Data collection from the Forbes Collection

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Project Combat: Data collection from the Forbes Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Project Combat: Data collection from the Forbes Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support