Python crawler--some poses for cleansing of crawled data (5)

Source: Internet
Author: User

A reptile, of course, uses data. To analyze the data, first clean the data. This cleaning data includes removing useless data columns and dimensions, deleting the same data, and making corrections to the data.

From the big different news sites can crawl to repeat the news ... This can have. Before in order to crawl to the news information deep excavation went to this site http://blog.reetsee.com/archives/237 although there is no further data mining, at least the processing of data. You can use Python stuttering participle to handle repetitive news. Statistical Word segmentation information, the dictionary gap is too large, the dictionary frequency than the gap is too large, can be counted different news.

Checking the news content string requires these 3 operations: extracting participle, statistic frequency and checking weight.

Extract the word breaker function:

def extracttagsfromcontent (Content, num_of_tags):     = Jieba.analyse.extract_tags (content, TopK = num_of_tags)    return tags
View Code

Statistical frequency function:

defgettermfreqfromcontent (tags, content): Tfdict= {}     forTaginchTags:tfdict[tag]= 0#Initialize the frequency of words that have appeared to 0seg_list= Jieba.cut (content)#cut the news content into wordsHas_words =False forWordinchseg_list:ifTfdict.has_key (Word): Tfdict[word]= Tfdict[word] + 1#Statistical FrequencyHas_words =Trueifhas_words:returntfdictElse:        returnNone
View Code

Cosine similarity function:

defcosinsimilarity (Vector1, Vector2):ifLen (vector1)! =Len (vector2):Print "Error:vector1:"+ Vector1 +"and Vector2:"+ Vector2 +"have different dimensions"         returnNone Numerator= 0.0V1_square= 0.0V2_square= 0.0 forIinchRange (0, Len (vector1)): Numerator+ = vector1[i] *Vector2[i] V1_square+ = vector1[i] *Vector1[i] V2_square+ = vector2[i] *Vector2[i] Denominator= Math.sqrt (V1_square *v2_square)ifDenominator = =0:returnNoneElse:        returnNumerator/denominator
View Code

Check weight function:

defFindsimilarpassagefromset (News_set, EXAMPLE_TF): Heap=[] Tags= []     forTaginchExample_tf.keys (): tags.append (TAG) forFile_pathinchNEWS_SET:TF=gettermfreqfromfile (tags, file_path)iftf = =None:ContinueSimilarity=cosinsimilarityfordict (EXAMPLE_TF, TF)#Insert Heap        if  notSimilarity = =None:heap.append (similarpassage (Similarity*-1.0, File_path)) #bring up the highest similarity (with the *-1 relationship, the minimum eject is actually the maximum eject)heapq.heapify (Heap)ifLen (heap) = =0:returnNone Result=heapq.heappop (Heap)ifresult. Relevant ():Print "Similarity:"+Str (result.similarity) News_set.discard (Result.file_path)returnResult.file_pathElse:        returnNone
View Code

And then the news hit it off.

Crawling Sina Weibo data, WAP side is easier to crawl, and the data is more pure.

But the data crawled out of the keyword is somewhat odd:

: Spring Update Sales: "Machine A Girl" hot "cherry quest" violent death is considered "buy hand to send CD-ROM", "Machine a girl frame ARMS GIRL" The 1th volume has achieved good results. and P.a.works animation company after the "Flower Kai Yi Iroha" "white box" after the third work theme animation "Cherry blossom task" Sales of only 1392, in the spring is the bottom.<BR/>2017 Spring sales and rankings (as of July 23, 2017)<BR/><BR/>1th Place: "Blue fantasy" 53571 photos<BR/><BR/>2nd place: "Idol master Cinderella Girl Theater" 42959 photos<BR/><BR/>3rd place: "Erromango teacher" 10417 Zhang<BR/><BR/>4th place: "Star Opera of the University", 9827 photos in the 2nd quarter<BR/><BR/>5th place: "ARMS GIRL" 7614 photos<BR/><BR/>6th place: "Attacking Titan", 2nd season, 7525 photos<BR/><BR/>7th place: "Royal teacher Heine" 6532 Zhang<BR/><BR/>8th place: "Natsume Friends Account" 6th season 4862<BR/><BR/>9th place: "My Hero Academy", 2nd season, 4145 photos<BR/><BR/>10th place: "Sin seven sins" 3305 photos<BR/><BR/>11th place: "Re:creators" 2631 photos<BR/><BR/>12th place: "Indecent magic instructor and Taboo code" 2485 Zhang<BR/><BR/>13th place: What are you doing at the end of the day? Do you have a free time? Can you come and save? "1674 Sheets<BR/><BR/>14th place: "Sword Daishi San Tan" 1656 photos<BR/><BR/>15th place: "Armed girls" 1425 photos<BR/><BR/>16th place: "Cherry blossom quest" 1392 photos<BR/><BR/>17th place: "Clock organ star" 896 photos<BR/><BR/>18th place: "The Strange Guardian God" 834 photos<BR/><BR/>19th place: "Covering noise" 823 sheets<BR/><BR/>20th place: "The hubbub of the long B-girl" 673 Zhang<BR/><BR/>21st Place: "Love Tyrant" 556 photos
I'm talking about Nintendo's game arms ...
: Unfortunately, lost Stars no audio version, not better to listen! Strength to sing will! Luckily, open arms has it.<ahref= "/N/M%E9%B9%BFM">@m Deer M</a>Of<ahref= "https://weibo.cn/sinaurl?f=w&amp;u=http%3A%2F%2Ft.cn%2FRorgJGY&amp;ep=FerXXxPbm%2C1763629124% 2cferxxxpbm%2c1763629124 ">Trigger (Set it off)</a>Just hit the list!<ahref= "http://weibo.cn/pages/100808topic?extparam=%E4%BA%9A%E6%B4%B2%E6%96%B0%E6%AD%8C%E6%A6%9C&amp;from= Feed ">#亚洲新歌榜 #</a>Now participate in the list, but also have the opportunity to receive the August 27 Asian new song list 2017 annual event tickets!???
looks like the arms keyword is too ambiguous.
<href= "https://weibo.cn/sinaurl?f=w&amp;u=http%3a%2f%2ft.cn%2fr9yv0fz& amp;ep=ferbm7lqy%2c1764127957%2cferbm7lqy%2c1764127957 "> second shot video </a>  ???
It's a normal one. But the post-topic sign is obviously a mess.
:<ahref= "Http://weibo.cn/pages/100808topic?extparam=%E5%AD%A6%E5%AD%90%E9%A3%8E%E9%87%87&amp;from=feed" >#学子风采 #</a>"Great, set the university son won the 41st annual ACM International College Student Program Design Competition Asian Regional Competition Bronze Award", ACM/ICPC (International College Program Design Competition) Asian regional competition in Qingdao curtain. 186 teams from 115 universities such as Peking University, Fudan University, Wuhan and Xiamen universities competed. After fierce competition, the University of Computer Engineering 2014 students Wu Xiaoren, Shan Ai, Chen Mingzhen composed of large ACM Training team (instructor: Lin Yangbin) won a bronze medal.<BR/><BR/>The ACM International College Student Program Design Competition (abbreviated ACM-ICPC) is sponsored by the International computer community with a long history of the authoritative organization ACM Institute (Association for Computing Machinery), the world recognized the largest, highest level, The largest number of participants in the International University Program Design competition, known as the "Olympic" competition in the IT industry.<ahref= "/n/%e9%9b%86%e5%a4%a7%e8%ae%a1%e7%ae%97%e6%9c%ba%e5%b7%a5%e7%a8%8b%e5%ad%a6%e9%99%a2%e5%ad%a6%e7%94%9f% e4%bc%9a ">@ Set Big Computer Engineering College student Union</a>
after that section of the well-known introduction can be removed ah ...
<href= "https://weibo.cn/sinaurl?f=w&amp;u=http%3a%2f%2ft.cn%2frk1sg4m& amp;ep=fdolv9dkf%2c6286510827%2cfdolv9dkf%2c6286510827 ">http://t.cn/RK1sG4m</ a > ???
I know this is just the entrance to the blog post ... If the title is wrong, it would be more troublesome to remove the clutter.
: "Our school students in the ACM International College Student Program Design Competition National Invitational Gold" in May 2017, the ACM International College Student Program Design Competition (ACM-ICPC) National Invitational held at Northwestern University of Technology. By our school students Zhiyuan Li, Xu Cheng, Chen Litian Three students composed of the team "challenge" won the gold medal, holding back the history of our school's first ACM-ICPC gold medal. Details visible <href= "https://weibo.cn/sinaurl?f=w&amp;u=http%3A%2F%2Ft.cn% 2frol0sxx&amp;ep=f9vlbao8b%2c1845850033%2cf9vlbao8b%2c1845850033 ">http://t.cn/Rol0sxX  </a> ???
is the data of the details useful? Do you want to write the analysis strategy again?
: "Our school students in the ACM International College Student Program Design Competition National Invitational Gold" in May 2017, the ACM International College Student Program Design Competition (ACM-ICPC) National Invitational held at Northwestern University of Technology. By our school students Zhiyuan Li, Xu Cheng, Chen Litian Three students composed of the team "challenge" won the gold medal, holding back the history of our school's first ACM-ICPC gold medal. Details visible <href= "https://weibo.cn/sinaurl?f=w&amp;u=http%3A%2F%2Ft.cn% 2frol0sxx&amp;ep=f9vlbao8b%2c1845850033%2cf9vlbao8b%2c1845850033 ">http://t.cn/Rol0sxX  </a> ???
search results collected 2 times ...
<href= "https://weibo.cn/sinaurl?f=w&amp;u=http%3a%2f%2ft.cn%2fru14lzk& amp;ep=feuqn8bxp%2c1886986281%2cfeuqn8bxp%2c1886986281 "> Japan • Yokohama </a>  ???
do you want to enter the address as a variable?

The above question is only part of ... If you crawl other data there will be more questions to consider. Think about cleaning strategies when you are free

Python crawler--some poses for cleansing of crawled data (5)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.