Python Project Combat: Data collection from the Forbes Collection

Source: Internet
Author: User

1 Data Collection Overview

To start a data analysis project, the first thing you need to do is get to raw data, and there are many ways to get the raw data. Like what:

    1. Get DataSet (DataSet) file
    2. Using crawlers to collect data
    3. Get direct access to Excel, CSV, and other data files
    4. Other ways ...

The Forbes series of data Analysis project combat, data collection, the main data from the use of reptiles for data collection, while also supporting other data for comparison.

This paper mainly introduces the idea and steps of using crawler to collect data.

This collection of Forbes global listed companies top 2000 data, covering the year from 2007 to 2017, spanning more than 10 years.

This acquisition of the target site, is a number of web pages, but the distribution of multiple Web pages are different, although the idea and the steps are similar, but need to separate to write, separately collected.

2 Data Acquisition steps

Data collection is broadly divided into several steps:

    1. Download of target Main Page content
    2. Collection of data on the main page
    3. Collection of links to other distribution page sites on the main page
    4. Download and collection of data for each distribution page
    5. Save the collected data

The Python libraries involved include, requests, BeautifulSoup, and CSV. The following is a collection of data for a year as a case, to describe the next step of data collection.

import requestsfromimport BeautifulSoupimport csv
2.1 Data Download Module

Mainly based on requests, the code is as follows:

 def download(URL):headers = {' User-agent ':' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 '} response = Requests.get (url,headers=headers)# Print (response.status_code)    returnResponse.text

This module will be used in the main page data download, as well as each sub-page data download, is a more common module.

2.2 Collection of data on the main page

The page structure of the main page is divided into two parts, one is a Web page link containing other page data, and a list of company data on the main page, which is displayed on the Web page in tabular form.

This data can be parsed with BeautifulSoup. The code modules are as follows:

    • Parse the company data list information on the main page
 def get_content_first_page(HTML, year):    "Get the list of companies ranked at 1-100 and include the table header"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) tables = Body_content.find_all (' table ', {' class ':' xxxxtable '})# There are 3 tables, and the last one is what we want.TRS = tables[-1].find_all (' TR ')# Get Header name    # Trs[1], this is not the same as the other yearsRow_title = [Item.text.strip () forIteminchtrs[1].find_all (' th ')] Row_title.insert (0,' Year ') rank_list = [] Rank_list.append (Row_title) forI, TRinchEnumerate (TRS):ifi = =0 ori = =1:ContinueTDS = Tr.find_all (' TD ')# Get company rankings and listingsrow = [Item.text.strip () forIteminchTDS] Row.insert (0, year) Rank_list.append (row)returnRank_list
    • Parse page links on other pages on the main page
 def get_page_urls(HTML):    "Get the page link of the company ranked at 101-2000"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) Label_div = Body_content.find (' div ', {' Align ':' Center '}) Label_a = Label_div.find (' P '). Find (' B '). Find_all (' A ') Page_urls = [' Basic_url '+ Item.get (' href ') forIteminchLABEL_A]returnPage_urls
2.3 Data collection on each distribution page

The steps are also web page downloads and table class data crawling. The code content is similar to the main Page page, but there are some differences in the details.

2.4 Data storage

The data collected and finally saved to a CSV file. The module code is as follows:

 def save_data_to_csv_file(data, file_name):    "' save data to csv file '     withOpen (file_name,' A ', errors=' Ignore ', newline="') asF:f_csv = Csv.writer (f) f_csv.writerows (data)
2.5 Data Acquisition Main function
 def get_forbes_global_year_2007(year=):URL =' URL 'html = download (URL)# Print (HTML)Data_first_page = get_content_first_page (HTML, year)# Print (data_first_page)Save_data_to_csv_file (Data_first_page,' Forbes_ '+STR (year) +'. csv ') Page_urls = get_page_urls (HTML)# Print (page_urls)     forUrlinchpage_urls:html = Download (URL) data_other_page = get_content_other_page (HTML, year)# Print (data_other_page)Print' Saving data ... ', URL) save_data_to_csv_file (Data_other_page,' Forbes_ '+STR (year) +'. csv ')if__name__ = =' __main__ ':# Get data from Forbes GlobalGet_forbes_global_year_2007 ()
3 Summary

This article only introduces the idea of data collection and each module, and does not provide links to the target Web page, on the one hand, because the original Web page data information is messy, the acquisition of time need to write a number of acquisition procedures, on the other hand, because our focus is on the subsequent data analysis part, I hope not to focus on data crawling.

During the subsequent analysis, we will look at the structure of the data, data integrity and related information, and welcome to the public number (id:pydataroad).

Recommended readings for this issue:

    • Forbes series of data analysis ideas

    • Pandas data Processing: Forbes Global list of listed companies

    • Python job search Top10 City to see if there is a city of your own

?

Python Project Combat: Data collection from the Forbes Collection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.