No doubt, who will hold the World Cup is our most concerned about the problem, small as a "senior" fans naturally have to play their own expertise, using Python to simulate the 2018 World Cup, first to quench our thirst.
Objective
The World Cup is about to play, everything is unknown, but the whole schedule has been set, we can fully follow the schedule to simulate all 64 games score 10,000 times, the A~h group of the respective outgoing situation, the probability of each team into the four, and the final probability of winning the championship.
Data acquisition
Find the resources and sites you want to collect and get ready to start collecting data. This collection takes the Scout network as an example:
First find the 32 countries of the respective website links, and then enter 32 links, respectively, to collect their own game record data;
Analyze the website, construct the general idea, create the network crawler implementation collection. Since the site is static site, so it is easy to collect the site, in the process of acquisition, we first find, each country team link, establish the National team link and national team name, and then carry on the collection of the given national team page of all historical game data.
When looking for a country link, please pay attention to the accuracy of the link, scouting network each team has a separate link, such as Brazil's ID is 778, the link address is: http://zq.win007.com/cn/team/CTeamSche/778.html, If you are not comfortable with the link, you can first copy the link to the browser to see if you can find the page.
The following is a detailed code for data acquisition:
1 from __future__ importprint_function, Division 2 3 from selenium import webdriver 4 5 Import pandas as PD 6 7 class Spider (object): 8 9 def __init__ (self): ten self.driver = Webdriver. Chrome () self.driver.implicitly_wait () self.verificationerrors = [] Self.accept_next_alert = True 1 8 def get_all_team_data (self): 20 21 # Get all 32 team IDs (form the team URL) from the World Cup homepage Python Learning Exchange Group: 125240963, a daily share of dry goods in the group, Includes the latest Python Enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn Exchange Self.get_team_ids () 24 25 # Cycle through each team's match data = [] [team_id, Team_name] Inenumerate (self.team_list): Print (I, team_id, team_name) DF =self.get_team_data (TEAM_ ID, Team_name) data.append (DF) PNS output = pd.concat (data) Output.reset_index (Drop=true,inplace=true) 40 Output.to_csv (' Data_2018worldcup.csv ', Index=false, encoding= ' Utf-8 '), Self.driver.close () + def get_team_i DS (self): Main_url = ' http://zq.win007.com/cn/CupMatch/75.html ' Self.driver.get (Main_url) Teams=self.driver.find_elements_by_xpath ("//td[@style = ' padding:0px; border:0px; Font-style:italic; Font-variant:inherit; Font-weight:inherit; Font-stretch:inherit; Font-size:inherit; Line-height:inherit; Font-family:inherit; Vertical-align:baseline; Word-break:break-word; COLOR:RGB (64, 128, 128); " > #fff; text-align:left; '] ") data = [] teams:56 for Team in team_id= (Team.find_element_by_xpath (".//a"). Get_attribute (' HR EF '). Split ('/') [ -1].split ('. ') [0]) Team_name =team.find_element_by_xpath (".//a"). Text: Print (team_id, team_name), Data.append ([team_i D,team_name]) self.team_list = Data #self. Team_list =PD. DataFrame (data, columns=[' team_name ', ' team_id ']) #self. Team_list.to_excel (' National Team id.xlsx ', Index=false) T_team_data (self, team_id,team_name): 74 75 "" to get a match data for a national team. TODO: No paging Python learning Exchange Group: 125240963, the group daily share of dry goods, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to all the small partners into the group Learning Exchange "" "http://zq.win007.com/cn/team/cteamsche/%d.html '%team_id self.driver.get (URL)----Bayi Table=self.driver.find_element_by_xpath ("/ /div[@id = ' Tech_schedule ' [email protected]= ' data '] "matches =table.find_elements_by_xpath (".//tr ") in the 84 85 Print (len (matches)) 86 87 # Grab the match data and save it as dataframe.-[] for I, Match inenumerate (matches): If i = = 0:94 headers =match.find_elements_by_xpath (".//th"): H1, H2, H3, H4, H5 =headers[0].text, Headers[1].text, he Aders[2].text, headers[3].text,headers[4].text 98 print (H1, H2, H3, H4, H5) 101 continue102 103 try:104 Info =m Atch.find_elements_by_xpath (".//td") 106 107 Cup =str (Info[0].text.encode (' Utf-8 ')) 108 109 match_time =str (info[1]. Text.encode (' Utf-8 ')) 111 Home_team =str (Info[2].text.encode (' Utf-8 ')) 113 FTS = info[3].text114 #print ('-', Cup, '-') 117 Fs_a,fs_b=int (Fts.split ('-') [0]), int (fts.split ('-') [1]) 118 119 Away_team = str (Info[4].text.encode (' Utf-8 ') 121 print (Cup, Match_time,home_team, Away_team, Fs_a, fs_b) 122 123 data.append ([Cup,match_time, Home_team, Away_team, Fs_a, Fs_b, Team_name]) 124-Except:12 6 127 break128 129 df = PD. DataFrame (data, columns=[' tournament ', ' time ', ' home ', ' away ', ' home team goal ', ' away goal ', ' National Name ') ' 131 return df132 133 If __name__ = ' __main__ ": 134 135 spider = Spider () 136 137 # First step: Catch the ID of the 2018 World Cup team. The second part: The game data of each detachment is recycled. 138 139 Spider.get_all_team_data ()
Enter the group: 125240963 can get the source code!
The World Cup is coming! Look at my big python analysis wave! The top four will be a country!