1 weiboheat.py2 #-*-coding:utf-8-*-3 " "4 The script can crawl popular movie information from the WAP version of the microblogging site,5 In particular, the number of film topics discussed and the number of readings6 " "7 ImportJSON8 ImportRequests9 fromPandasImportDataFrameTen Import Time Oneheaders = {'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'} AI=1#the regular item in the URL -Movies=[]#Initialize movie list -Csvname='Wh_allmovies.csv' #The name of the file that will be entered theCARDS=[1]#For cold start, set the cards list to non-empty - #This is a dynamic load problem, and when you can get content from a Web page, you keep looping - while(cards!=[]): - Try: + ifI==1: -j=2 + Else: Aj=0 atUrl='http://m.weibo.cn/page/pageJson?containerid=&containerid=' - '100803_ctg1_100_-_page_topics_ctg1__100&luicode=10000011&lfid=' - '100808d35a54c4ae10c8311e64ae96c776f206&v_p=11&ext=&fid=' - '100803_ctg1_100_-_page_topics_ctg1__100&uicode=' - '10000011&next_cursor=&page='+Str (i) -Resp=requests.get (url,headers=headers) inTime.sleep (0.1) -Content=json.loads ((Resp.text). Decode ('ASCII'). Encode ('Utf-8'))#the Text property of the response is data in JSON format to #by analyzing the content of JSON-formatted text, we find the law ######### +cards=content['Cards'] -card=Cards[j] thecard_group=card['Card_group'] * ############################################ $Movies=movies+card_group#A list of 10 movie messages that are card_group for each cyclePanax Notoginseng #Add the list to the movies list, and each of the card_group is a dictionary that contains various information about the movie - PrintI*10#use as Tag theI+=1#each cycle I plus 1 + except: A Print 'Error' the finally: +MOVIES_DF = DataFrame (Movies)#each cycle converts the movies list to a dataframe format file, which is then deposited into the file - #df1 = DataFrame ({' title ': movies_df.ix[:, ' card_type_name '], ' heat ': movies_df.ix[:, ' desc2 '), $ #' scheme ': movies_df.ix[:, ' scheme ', $ #' pic ': movies_df.ix[:, ' pic ']}) -Movies_df.to_csv (Csvname, Index=false, encoding='Utf-8')
1 weiboheat_treatment.py2 #-*-coding:utf-8-*-3 " "4 the script can be processed for the resulting weiboheat.csv file5 Add a movie topic discussion number Discussnum, topic reading Readnum, and the number of heat points obtained by reading6 " "7 ImportPandas as PD8 fromPandasImportDataFrame9Df=pd.read_csv ('Wh_allmovies.csv')Ten #remove the desired column and add a custom column name OneDf1=dataframe ({'title':d f.ix[:,'Card_type_name'],'Heat':d f.ix[:,'DESC2'], A 'Scheme':d f.ix[:,'Scheme'], - 'pic':d f.ix[:,'pic']}) - #Remove the Heat column from the DATAFRAME data structure theheat=df1.ix[:,'Heat'] - - #function: Converts a string like ' 240 million reading ' into an int format 2400000000 - #Note: The input string is in Unicode encoded format + defGetnum (heat): - ifU'billion' inchHeat: +Temp=list (heat)#Convert strings to list lists for easy subsequent deletion of Chinese character operations A Temp.pop () at Temp.pop () -Temp.pop ()#execute the statement three times and remove the string like ' billion reading ' -temp="'. Join (temp)#The remainder of the deleted Chinese is combined to get the str format string -Temp=float (temp) *100000000#first, the STR is converted to float format, multiplied by 100 million - elifU'million' inchHeat: -temp =list (heat) in Temp.pop () - Temp.pop () to Temp.pop () +temp ="'. Join (temp) -temp = Float (temp) * 10000#Multiply by 10,000 the Else: *temp =list (heat) $ Temp.pop ()Panax Notoginseng Temp.pop () -temp ="'. Join (temp) thetemp = Float (temp)#no need to multiply + returnInt (temp)#converts the returned value to a number in int format A the #function: According to the reading volume of the film, get the score of the film + defGetscore (i): - ifI>=0 andi<100000000: $ return1 $ elifi>=100000000 andi<300000000: - return2 - elifi>=300000000 andi<500000000: the return3 - elifi>=500000000 andi<700000000:Wuyi return4 the elifi>=700000000: - return5 Wu Else: - returnNone About $Discussnum=[]#initialize a list of discussion series -Readnum=[]#Initialize reading list -Score_weibo=[]#Initialize the score list for the microblog heat - forIinchRange (len (heat)): AHeat_i=heat[i]#remove each heat item + #convert each heat item to Unicode encoding and divide by space into a list of length 2 theHeat_ilist= (Heat_i.decode ('Utf-8') . Split () -HEAT_DISCUSS=HEAT_ILIST[0]#the first item of list is a discussion number, like ' 2.758 million discussion ' $HEAT_READ=HEAT_ILIST[1]#the second item of list is reading number, like ' 1.3 billion reading ' theDiscussnum.append (Getnum (Heat_discuss))#after you call the Getnum function to format the conversion, add it to the list the readnum.append (Getnum (heat_read)) theScore_weibo.append (Getscore (Getnum (heat_read)))#Call the Getscore function to add the resulting score to the list theDf2=dataframe ({'Discussnum':d Iscussnum,'Readnum': Readnum,'Score_weibo': Score_weibo})#Get datafrme format -Df3=pd.concat ([Df1,df2],axis=1) inDf3.to_csv ('Wh_allmovies_discussreadscore.csv', Index=false)
1-3 Crawl The popularity of movie themes on Weibo (number of readings and discussions on topics)