1 Data Collection Overview
To start a data analysis project, the first thing you need to do is get to raw data, and there are many ways to get the raw data. Like what:
- Get DataSet (DataSet) file
- Using crawlers to collect data
- Get direct access to Excel, CSV, and other data files
- Other ways ...
The Forbes series of data Analysis project combat, data collection, the main data from the use of reptiles for data collection, while also supporting other data for comparison.
This paper mainly introduces the idea and steps of using crawler to collect data.
This collection of Forbes global listed companies top 2000 data, covering the year from 2007 to 2017, spanning more than 10 years.
This acquisition of the target site, is a number of web pages, but the distribution of multiple Web pages are different, although the idea and the steps are similar, but need to separate to write, separately collected.
2 Data Acquisition steps
Data collection is broadly divided into several steps:
- Download of target Main Page content
- Collection of data on the main page
- Collection of links to other distribution page sites on the main page
- Download and collection of data for each distribution page
- Save the collected data
The Python libraries involved include, requests, BeautifulSoup, and CSV. The following is a collection of data for a year as a case, to describe the next step of data collection.
import requestsfromimport BeautifulSoupimport csv
2.1 Data Download Module
Mainly based on requests, the code is as follows:
def download(URL):headers = {' User-agent ':' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 '} response = Requests.get (url,headers=headers)# Print (response.status_code) returnResponse.text
This module will be used in the main page data download, as well as each sub-page data download, is a more common module.
2.2 Collection of data on the main page
The page structure of the main page is divided into two parts, one is a Web page link containing other page data, and a list of company data on the main page, which is displayed on the Web page in tabular form.
This data can be parsed with BeautifulSoup. The code modules are as follows:
- Parse the company data list information on the main page
def get_content_first_page(HTML, year): "Get the list of companies ranked at 1-100 and include the table header"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) tables = Body_content.find_all (' table ', {' class ':' xxxxtable '})# There are 3 tables, and the last one is what we want.TRS = tables[-1].find_all (' TR ')# Get Header name # Trs[1], this is not the same as the other yearsRow_title = [Item.text.strip () forIteminchtrs[1].find_all (' th ')] Row_title.insert (0,' Year ') rank_list = [] Rank_list.append (Row_title) forI, TRinchEnumerate (TRS):ifi = =0 ori = =1:ContinueTDS = Tr.find_all (' TD ')# Get company rankings and listingsrow = [Item.text.strip () forIteminchTDS] Row.insert (0, year) Rank_list.append (row)returnRank_list
- Parse page links on other pages on the main page
def get_page_urls(HTML): "Get the page link of the company ranked at 101-2000"Soup = BeautifulSoup (HTML,' lxml ') BODY = Soup.body Body_content = Body.find (' div ', {' id ':' bodycontent '}) Label_div = Body_content.find (' div ', {' Align ':' Center '}) Label_a = Label_div.find (' P '). Find (' B '). Find_all (' A ') Page_urls = [' Basic_url '+ Item.get (' href ') forIteminchLABEL_A]returnPage_urls
2.3 Data collection on each distribution page
The steps are also web page downloads and table class data crawling. The code content is similar to the main Page page, but there are some differences in the details.
2.4 Data storage
The data collected and finally saved to a CSV file. The module code is as follows:
def save_data_to_csv_file(data, file_name): "' save data to csv file ' withOpen (file_name,' A ', errors=' Ignore ', newline="') asF:f_csv = Csv.writer (f) f_csv.writerows (data)
2.5 Data Acquisition Main function
def get_forbes_global_year_2007(year=):URL =' URL 'html = download (URL)# Print (HTML)Data_first_page = get_content_first_page (HTML, year)# Print (data_first_page)Save_data_to_csv_file (Data_first_page,' Forbes_ '+STR (year) +'. csv ') Page_urls = get_page_urls (HTML)# Print (page_urls) forUrlinchpage_urls:html = Download (URL) data_other_page = get_content_other_page (HTML, year)# Print (data_other_page)Print' Saving data ... ', URL) save_data_to_csv_file (Data_other_page,' Forbes_ '+STR (year) +'. csv ')if__name__ = =' __main__ ':# Get data from Forbes GlobalGet_forbes_global_year_2007 ()
3 Summary
This article only introduces the idea of data collection and each module, and does not provide links to the target Web page, on the one hand, because the original Web page data information is messy, the acquisition of time need to write a number of acquisition procedures, on the other hand, because our focus is on the subsequent data analysis part, I hope not to focus on data crawling.
During the subsequent analysis, we will look at the structure of the data, data integrity and related information, and welcome to the public number (id:pydataroad).
Recommended readings for this issue:
Forbes series of data analysis ideas
Pandas data Processing: Forbes Global list of listed companies
Python job search Top10 City to see if there is a city of your own
?
Python Project Combat: Data collection from the Forbes Collection