Python General Forum Body extract \python Forum comment Extract \python Forum user Information Extraction

Source: Internet
Author: User
Tags repetition

I long-term sales of ultra-large amount of micro-blog data, and provide specific micro-blog data packaging, Message to [email protected]

Background

Participate in the data mining competition, this time really learned a lot of things, and finally almost complete the requirements of the content, accuracy is also OK. A total of the code, the middle of the process is not more than 500 lines, the code is also relatively simple thinking, mainly based on the forum's short text features and the content of the floor between the similar to complete. (Popular point is to remove noise and noise, and then leave only the relatively regular date, content)

Pre-preparation
    1. Software and development environment: Pycharm,python2.7,linux system

    2. Main python packages used: Jieba, requests, BeautifulSoup, goose, Selenium, PHANTOMJS, Pymongo, etc. (some of the software installed in my previous blog is introduced)
Web preprocessing

First, because the site is a lot of dynamic, directly with BS4 is not get some information, so we use selenium and Phantomjs to save the file locally, and then processed.

The relevant code is

def save(baseUrl):    driver = webdriver.PhantomJS()    driver.get(baseUrl) # seconds    try:        element = WebDriverWait(driver, 10).until(isload(driver) is True)    except Exception, e:        print e    finally:        data = driver.page_source  # 取到加载js后的页面content    driver.quit()    return data

Since there is a lot of noise in the web (advertising, images, etc.), we first need to remove all noise that is inconsistent with what we have extracted. We first chose to remove some noise tags with a typical noise meaning, such as script, and we chose BeautifulSoup to do it.

That's probably the code.

    for element in soup(text=lambda text: isinstance(text, Comment)):        element.extract()    [s.extract() for s in soup(‘script‘)]    [s.extract() for s in soup(‘meta‘)]    [s.extract() for s in soup(‘style‘)]    [s.extract() for s in soup(‘link‘)]    [s.extract() for s in soup(‘img‘)]    [s.extract() for s in soup(‘input‘)]    [s.extract() for s in soup(‘br‘)]    [s.extract() for s in soup(‘li‘)]    [s.extract() for s in soup(‘ul‘)]    print (soup.prettify())

Comparison of web pages after processing

You can see that the Web page is much less noisy, but it's still not enough to extract what we want from so much noise.

Since we do not need the label only the text inside the label, so we can use BeautifulSoup extract the text content and then analyze

for string in soup.stripped_strings:    print(string)    with open(os.path.join(os.getcwd())+"/data/3.txt", ‘a‘) as f:        f.writelines(string.encode(‘utf-8‘)+‘\n‘)

You can see that it's still very messy, but it's very regular. We can find that each floor of the text content is virtually the same, can say a lot of repetition, and are some specific words, such as: direct floors, benches , sofas , and other such words, so we need to delete these words and then analyze

I used the method is to use Jieba participle to obtain the text of the Web page Word segmentation, statistics appear the word frequency highest, but also easy to appear in the noise of the article, the code is as follows

import jieba.analysetext = open(r"./data/get.txt", "r").read()dic = {}cut = jieba.cut_for_search(text)for fc in cut:    if fc in dic:        dic[fc] += 1    else:        dic[fc] = 1blog = jieba.analyse.extract_tags(text, topK=1000, withWeight=True)for word_weight in blog:    # print (word_weight[0].encode(‘utf-8‘), dic.get(word_weight[0], ‘not found‘))    with open(‘cut.txt‘, ‘a‘) as f:        f.writelines(word_weight[0].encode(‘utf-8‘) + "    " + str(dic.get(word_weight[0], ‘not found‘)) + ‘\n‘)

Statistics come out and then we test and sift through the discontinued words that have these

Replies
Integral
Post
Login
Forum
Registered
Offline
Time
Author
Sign
Theme
Essence
Client
Cell phone
Download
Share

The current statistics of about 200 words.

And then there's the work of removing duplicate text.

 # de-weight function def remove_dup (items): Pattern1 = Re.compile (R ' posted in ') pattern2 = Re.compile ( ' \d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2}:\d{2} ') Pattern3 = Re.compile (' \d{1,2}-\d{1,2} \d{2}:\d{2} ') Pattern4 = Re.compil     E (' \d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2} ') Pattern5 = Re.compile (R ' [^0-9a-za-z]{7,} ') # uses a set as a container to do part of the repeated judgment basis, and the other part by matching to do # yield is used to get the appropriate text with the generator iterator, so that the text is deleted, outside the function # can be used in the function of the text iteration seen = set () for item in ITEMS:MATCH1 = Pat Tern1.match (item) MATCH2 = Pattern2.match (item) MATCH3 = Pattern3.match (item) MATCH4 = Pattern4.match            (item) MATCH5 = Pattern5.match (item) If item not in seen or match1 or MATCH2 or Match3 or MATCH4 or MATCH5: Yield item Seen.add (item) # adds item to the collection, and the collection automates the deletion of duplicate items  

In the text of the Web page after observation, we find that there is another noise that cannot be ignored, that is, pure numbers. Because there are a lot of pure numbers in the text of the Web page, but not repetition, such as the number of likes and so on, so I'm going to match the regular with a pure number and then delete. But then there is the problem ... Because some usernames are purely numeric, we will delete the username. To solve this problem we use a pure number with a reserved number of characters greater than 7, which removes most of the useless information and preserves the user name as much as possible.

The relevant code is as follows

st = []    for stop_word in stop_words:        st.append(stop_word.strip(‘\n‘))    t = tuple(st)    # t,元组,和列表的区别是,不能修改使用(,,,,),与【,,,】列表不同    lines = []    # 删除停用词和短数字实现    for j in after_string:        # 如果一行的开头不是以停用词开头,那么读取这一行        if not j.startswith(t):            # 如何一行不全是数字,或者这行的数字数大于7(区别无关数字和数字用户名)读取这一行            if not re.match(‘\d+$‘, j) or len(j) > 7:                lines.append(j.strip())                # 删除所有空格并输出                print (j.strip())

After processing the text as follows, the law is very obvious

And then it's time for us to extract the content.

Content Extraction

Content extraction is nothing more than a comment block, and the comment block is already very clear in our diagram above, and we naturally think of distinguishing the comment block by date. After observation, all the forums in the form of dates only 5 (currently only see 5 kinds, of course, later can be added). We can use regular matches to match the date of the row, based on the number of rows of two dates in the middle of the content is the comment and user name to complete our review content extraction.

Pass in our processed text and then match the number of rows with the date

 # match date return get_listdef match_date (lines): Pattern1 = Re.compile (R ' posted in ') pattern2 = R  E.compile (' \d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2}:\d{2} ') Pattern3 = Re.compile (' \d{1,2}-\d{1,2} \d{2}:\d{2} ') Pattern4 =  Re.compile (' \d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2} ') Pattern5 = Re.compile (r ' Release date ') Pre_count =-1 get_list = [] # Match date text for string in Lines:match1 = Pattern1.match (string) Match2 = Pattern2.match (string) match 3 = Pattern3.match (string) Match4 = Pattern4.match (string) Match5 = Pattern5.match (String) pre_count            + = 1 if match1 or MATCH2 or Match3 or match4 or match5:get_dic = {' count ': pre_count, ' Date ': string} Get_list.append (get_dic) # Returns the information after the match date return get_list  

Because there are replies and no reply to deal with the same way is not the same so we need to classify the discussion. Because we know that the comment is in the middle of the two matching dates, so there is a problem that the content area of the last comment is bad. But considering that most of the last reply is a row we can take a value of 3 (sub==3, consider a comment and a row of user names), and then think of a more scientific method, such as judging the text density of the next few lines, if very small to indicate that only a single comment is more likely.

The following code is the difference between the number of rows that gets the date and the two dates

# 返回my_countdef get_count(get_list):    my_count = []    date = []    # 获取时间所在行数    for i in get_list:        k, t = i.get(‘count‘), i.get(‘date‘)        my_count.append(k)        date.append(t)    if len(get_list) > 1:        # 最后一行暂时取3        my_count.append(my_count[-1] + 3)        return my_count    else:        return my_count# 获取两个时间所在的行数差def get_sub(my_count):    sub = []    for i in range(len(my_count) - 1):        sub.append(my_count[i + 1] - my_count[i])    return sub

And then we're going to sort it out

    1. If only the landlord has no comments (ie, my--count==1), this time we can use open source body extract software Goose to extract the text.

    2. If there is a comment we need to classify according to the value of sub if sub==2 is the majority (or more than sub==3) accounted for more, then we think it may be the user name is deleted, there are many reasons for the deletion, for example, when someone in the building in the back of the floor to repeat the user name is deleted, It is possible that the label of the site compared to the special user name in the time to delete the label, the situation is more complex and the frequency is not too high, is not considered. Besides, it doesn't affect us to extract comments, just classify them and consider them.

<font color= #FF0000 size=4 face= "Blackbody" >
Note: The cosine similarity below this is when I began to think more! Most of the time: date-comment-username, and then I did not consider the cosine similarity classification, the code is less, the accuracy has not decreased. Not to delete here is to leave a process of thinking. Code to see just fine, finally have the revised source code.
</font>

    1. There is also the most common content, that is sub==3 the majority of the situation. Because most of the comments are a single line of text, we need to consider which line of comment text to get when sub==3. In layman's terms, the three-line content is the date-comment-user name , or date-user name-comment ? Although most of the cases are the first, we cannot ignore the second case. How do you judge both of these situations? It really made me think for a long time, and then I thought that I could solve the problem with cosine similarity. The cosine similarity of science can be seen here. Simply put, the length of the user name is similar, but the content length of the comment is very different. For example, the length of the user name is about 7 characters, but the length of the comment can be hundreds of, or only one. So we can 22 compare the cosine similarity, then take the average, the similarity is the user name. So that we can distinguish the comment content to extract! That's the main idea. The rest is the implementation of the Code.

Simply post the relevant code

# Use Goose to get body content def goose_content (My_count, Lines, my_url): G = Goose ({' Stopwords_class ': Stopwordschinese}) Content_ 1 = g.extract (url=my_url) host = {} my_list = [] host[' content '] = content_1.cleaned_text host[' Date ' = lines [My_count[0]] host[' title '] = Get_title (my_url) result = {"POST": Host, "Replys": My_list} spiderbbs_info.insert ( Result) # Calculates the cosine similarity function Def cos_dist (A, B): If Len (a)! = Len (b): return None part_up = 0.0 a_sq = 0.0 B_SQ = 0.0 for A1, B1 in Zip (A, B): Part_up + = A1 * B1 A_SQ + A1 * * 2 B_SQ + b1 * * 2 Part_down = Mat H.SQRT (A_SQ * b_sq) if Part_down = = 0.0:return None else:return part_up/part_down# determine what line the comment is in (probably at 3 The middle of the line comment block, possibly at the end of the three-line comment block) def get_3_comment (My_count, lines): Get_pd_1 = [] Get_pd_2 = [] # If the interval is 3 take out the length of the text in the row Test_sa            T_1 = [] test_sat_2 = [] for num in range (len (my_count)-1): If my_count[num+1]-3 = = My_count[num]: Pd_1 = (Len (lines[my_Count[num]]), Len (lines[my_count[num]+2]) get_pd_1.append (pd_1) pd_2 = (len (Lines[my_count[num]]), Len (lines[my_count[num]+1])) Get_pd_2.append (pd_2) for I_cos in range (len (get_pd_1)-1): For J_cos in Range (i_cos+1, Len (get_pd_1)): # Calculates the text cosine similarity test_sat_1.append (cos_dist (Get_pd_1[j_cos], Get_pd_1[i_co S])) Test_sat_2.append (Cos_dist (Get_pd_2[j_cos], Get_pd_2[i_cos]) # Calculates the mean of the cosine similarity get_mean_1 = Numpy.array (test_sat_1) print (Get_mean_1.mean ()) get_mean_2 = Numpy.array (test_sat_2) print (Get_mean_2.mean ()) # Compare size return        Whether you should press if Get_mean_1.mean () >= Get_mean_2.mean (): Return 1 elif Get_mean_1.mean () < Get_mean_2.mean (): return for Comments def solve__3 (num, My_count, Sub, lines, My_url): # If the value returned by Get_3_comment () is 1, then it is more likely that the last line is a username, otherwise A row is more likely to be a user name if num = = 1:host = {} my_list = [] host[' content '] = '. Join (Lines[my_count[0]+1:my    _COUNT[1]+SUB[0]-1])    host[' Date ' = lines[my_count[0]] [host[' title '] = Get_title (My_url) for use in range (1, Len (my_count)-1)                  : pl = {' content ': '. Join (Lines[my_count[use] + 1:my_count[use + 1]-1]), ' date ': Lines[my_count[use],        ' title ': Get_title (My_url)} my_list.append (PL) result = {"POST": Host, "Replys": My_list} Spiderbbs_info.insert (result) if num = = 2:host = {} my_list = [] host[' content '] = '. Join (l         Ines[my_count[0]+2:my_count[1]+sub[0]]) host[' date '] = lines[my_count[0] [host[' title '] = Get_title (My_url)  For use in range (1, Len (my_count)-1): pl = {' content ': '. Join (Lines[my_count[use] + 2:my_count[use + 1]]), ' date ': Lines[my_count[use]], ' title ': Get_title (My_url)} my_list.append (PL) Res Ult = {"POST": Host, "Replys": my_list} spiderbbs_info.insert (Result)
Prospect

The accuracy of the extraction should be analyzed more BBS site, optimize the deletion of repeated words (too rough), optimize the stop word, for short text not reply to the optimization of the situation, accurately extract the landlord's user name, etc., but the time is too tight can not be further optimized. Caishuxueqian, just learned a few months python, code inevitably have unreasonable place, hope you put forward valuable comments.

Python General Forum Body extract \python Forum comment Extract \python Forum user Information Extraction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.