[Reading notes] machine learning: Practical Case Analysis (4)

Source: Internet
Author: User

4th Chapter Sort: Smart Inbox

Supervised learning and unsupervised learning: supervised learning has a clear example of output; Unsupervised learning does not have a known output instance in advance when processing data.

The priority characteristics of the message in theory:

    • Social characteristics: The degree of interaction between the recipient and the sender
    • Content characteristics: The recipient's behavior on the message (reply, tag, etc.) is related to some feature words
    • Threading Characteristics: Record the user's interaction behavior under the current thread
    • Label feature: Check the label (tag) that the user assigned to the message via the filter

This document is used for overriding priority characteristics due to insufficient data volume: (Elements that need to be extracted)

    • Social Features: volume of messages from a single sender (sender address)
    • Time measurement: (Date of acceptance)
    • Thread Characteristics: Active (subject)
    • Content characteristics: high-frequency word analysis (message body)

#machine learing for Heckers
#chapter 4

Library (tm) library (GGPLOT2)
Library (PLYR) easyham.path <-"ml_for_hackers/03-classification/data/easy_ham/"

  

###############################################

#抽取特征集合
###############################################
#输入邮件路径, return feature information
#把路径作为数据的最后一列保存, you can make it easier to sort the test phase (written in the book, not understand why)

Parse.email <-Function (path) {  full.msg <-msg.full (path)  date <-get.date (full.msg) from  <- Get.from (full.msg)  subj <-get.subject (full.msg)  msg <-get.msg (full.msg)  return (c (date, from, SUBJ, MSG, PATH)}

  

#辅助函数的编写
#读取内容
#按行读取, the contents of each row correspond to an element in the vector

Msg.full <-Function (path) {  con <-file (path, open = "RT", encoding = "latin1")  msg <-readlines (con) 
   
    close (Con)  return (msg)}
   

  

#正则表达式抽取信息
#抽取发件地址:
#两种格式: From:jm<[email protected]>; From: [Email protected] (JM)
#grepl () with "from:" as the match condition, the return is (1) No (0) match;
#方括号创建字符集: colon, angle bracket, space, as a flag for split text, to the first element in the list
#将空元素过滤掉
#查找包含 an element of the "@" character and returns

Get.from <-Function (Msg.vec) {from  <-msg.vec[grepl ("From:", Msg.vec)] from  <-strsplit (from, ' [": <>] [[1]] from  <-From[which (from! = "" & From! = "")]  return (FROM[GREPL ("@", "from"][1])}

  

#抽取正文

Get.msg <-Function (Msg.vec) {  msg <-msg.vec[seq (which (Msg.vec = = ") [1]+1, Length (Msg.vec), 1)]  return ( Paste (msg, collapse = "\ n"))}

  

#抽取主题
#正则匹配主题特征 (some messages have no subject)
#如果长度大于0, returns the 2nd element (the 1th element is "Subject") otherwise returns a null character
#如果不设条件, when a function of the GREPL () is not matched, a specified value is returned.
#如integer (0) or character (0)

Get.subject <-Function (Msg.vec) {  subj <-msg.vec[grepl ("Subject:", Msg.vec)]  if (length (SUBJ) > 0) { C8/>return (Strsplit (SUBJ, "Subject:") [[1]][2])  }else{    return ("")  }}

  

#抽取日期
#需要解决的问题:
#1. There are many lines in the message header that match the "Date:", but the only thing that really needs to happen is the string header.
#利用这一点, the regular expression is required to match only the "Date:" in the string header, using the caret "^date:"
#2. It is possible to match in the body of the text, so you only need to save the string for the first successful match.
#3. Working with text: splitting characters: plus or minus or colon
#4. Replace the first or trailing blank characters
#5. Returns only the part that matches the format, removing the content after 25 characters

Get.date <-Function (Msg.vec) {  date.grep <-grepl ("^date:", Msg.vec)  date.grep <-which (date.grep = = TRUE) Date  <-msg.vec[date.grep[1]]  date <-strsplit (date, "\\+|\\-|:") [[1]][2]  date <-gsub ("^\ \s+|\\s+$ "," ", date)  return (Strtrim (date, 25)}

  

#处理邮件

Easyham.docs <-dir (easyham.path) easyham.docs <-Easyham.docs[which (easyham.docs! = "Cmds")]easyham.parse <- Lapply (Easyham.docs,                         function (p) parse.email (Paste (Easyham.path, p, Sep = "))) Ehparse.matrix <-Do.call (rbind , easyham.parse) allparse.df <-data.frame (ehparse.matrix, stringsasfactors = FALSE) names (ALLPARSE.DF) <-C ("Date "," From.email "," Subject "," Message "," Path ")

  

#时间格式不统一, further processing is required to convert POSIX objects in R to facilitate sorting

#将形如 (Wed, Dec 2002 11:36:32) and (2002 11:49:23) format strings converted to POSIX format

Date.converter <-function (dates, PATTERN1, pattern2) {  pattern1.convert <-strptime (dates, pattern1)  Pattern2.convert <-strptime (dates, pattern2)  pattern1.convert[is.na (Pattern1.convert)] <- Pattern2.convert[is.na (Pattern2.convert)]  return (Pattern1.convert)}

  

#指明需要转换的格式

Pattern1 <-"%a,%d%b%Y%h:%m:%s" pattern2 <-"%d%b%Y%h:%m:%s"

  

#系统区域设置, the "lc_time" setting will be for the AS. Effects of Posixlt () and Strptime ()
#如果不进行设置, the resulting date column is missing from the subsequent post-processing results

Sys.setlocale ("Lc_time", "C") allparse.df$date <-date.converter (allparse.df$date, PATTERN1, pattern2)

  

#转换成小写, unified format

Allparse.df$subject <-ToLower (allparse.df$subject) allparse.df$from.email <-tolower (allparse.df$from.email)

  

#根据时间排序 Note the sorting method, using with (, Order ()), although not intuitive, but often used, the default ascending order
#产生训练集: The first half as a training set

PRIORITY.DF <-Allparse.df[with (ALLPARSE.DF, Order (Date)),]priority.train <-priority.df[1: (Round (Nrow ( PRIORITY.DF)/2)),]

  

######################################
#设置权重策略
######################################
#用ddply () performs an action on the data frame, where the object is a data group from.email,summarise () creates freq Save frequency information
Operation #日期一列会影响ddply (), so you need to delete the column or turn it into a string type, otherwise an error occurs

#直接应用书上的代码会提示如下error

#这里给出两种解决方案, it's okay.

PRIORITY.TRAIN.TEMP1 <-priority.train[, C (2,3,4,5)]PRIORITY.TRAIN.TEMP2 <- Priority.trainpriority.train.temp2$date <-as.character (priority.train.temp2$date) from.weight <-ddply ( PRIORITY.TRAIN.TEMP2,. (From.email), summarise, Freq = Length (Subject))

  

#另外文件源码里给出另一种处理方式, the results are the same.

Library (reshape2) From.weight <-Melt (with (priority.train, table (From.email)), value.name= "Freq")

  

#结果可视化
#排序, filtering frequency less than 6 of observations, drawing graphics

From.weight <-From.weight[with (from.weight, Order (Freq)),]from.ex <-subset (From.weight, Freq > 6) From.scales <-Ggplot (from.ex) +  Geom_rect (aes (xmin = 1:nrow (from.ex)-0.5,                xmax = 1:nrow (from.ex) + 0.5,
   
    ymin = 0,                ymax = Freq,                fill = "Lightgrey",                color = "Darkblue") +  scale_x_continuous (breaks = 1:nrow (fro M.EX), labels = from.ex$from.email) +  coord_flip () +  scale_fill_manual (values = C ("Lightgrey" = "Lightgrey"), Guide = "None") +  scale_color_manual (values = C ("Darkblue" = "Darkblue"), guide = "none") +  Ylab ("Number of Email s Received (truncated at 6) "+  Xlab (" Sender Address ") +  THEME_BW () +  theme (Axis.text.y = Element_text ( Size = 5, Hjust = 1)) print (from.scales)
   

  

#根据条形图, only a few people in a large number, in special cases, will result in weight offsets
#解决方案: Scale transformation: does not affect the overall threshold calculation for very individual cases
#对数变换: Observing the weights of absolute weight, natural logarithmic transformation and common logarithm transformation

From.weight <-transform (from.weight,                         weight = log (Freq + 1),                         log10weight = log10 (Freq + 1)) From.rescaled < -Ggplot (From.weight, aes (x = 1:nrow (from.weight))) +  Geom_line (Aes (y = weight, Linetype = "ln")) +  Geom_line (AES (y = log10weight, Linetype = "log10")) +  Geom_line (Aes (y = Freq, Linetype = "Absolute") +  scale_linetype_manual (values = C ("ln" = 1,                                   "log10" = 2,
    "Absolute" = 3),                        name = "Scaling") +  Xlab ("") +  Ylab ("Number of emails receieved") +  THEME_BW () +
   theme (Axis.text.y = Element_blank (), axis.text.x = Element_blank ()) print (from.rescaled)

  

#绝对值权重过于陡峭, the difference between common logarithmic transformations is too weak, and the natural logarithm transformation is more reasonable.

#来自书中的警告: An observation record with a value of 0 is not allowed in the feature set, otherwise log () returns-inf (negative infinity), destroying the entire result

#线程活跃度的权重计算
#针对 "Re:" Reply operation, can find related thread
The subject of the sender and initial thread of the message #返回所有带有 "Re:"

Find.threads <-Function (email.df) {  response.threads <-strsplit (Email.df$subject, "Re:")  Is.thread <-sapply (response.threads, function (SUBJ) ifelse (subj[1] = = "", TRUE, FALSE))  Threads <-response.threads[ Is.thread]  senders <-Email.df$from.email[is.thread]  threads <-sapply (threads, function (t) paste (t[2: Length (t)], collapse = "Re:"))  return (Cbind (senders, threads))}threads.matrix <-find.threads (priority.train)

  

#根据线程中活跃的发件人赋予权重, increase on senders, but focus only on senders who appear in Threads.matrix
#仍然采用自然对数变换的权重

Email.thread <-Function (Threads.matrix) {  senders <-threads.matrix[, 1]  senders.freq <-table ( Senders)  Senders.matrix <-cbind (Names (senders.freq), Senders.freq, log (Senders.freq + 1))  SENDERS.DF < -Data.frame (Senders.matrix, stringsasfactors = FALSE)  row.names (SENDERS.DF) <-1:nrow (SENDERS.DF)  names ( SENDERS.DF) <-C ("From.email", "Freq", "Weight")  senders.df$freq <-as.numeric (senders.df$freq)  Senders.df$weight <-as.numeric (senders.df$weight)  return (SENDERS.DF)}senders.df <-email.thread ( Threads.matrix)

  

#基于已知的活跃线程, append weights: Assuming this thread is known, users will find these more active threads more important
#unique () gets all the thread names, thread.counts the activity and weights for all threads
#最后合并, returns the thread name, frequency, interval time, and weight

Get.threads <-function (Threads.matrix, EMAIL.DF) {  threads <-unique (threads.matrix[, 2])  Thread.counts <-lapply (threads,                          function (t) thread.counts (T, email.df))  Thread.matrix <-Do.call ( Rbind, Thread.counts)  return (Cbind (threads, Thread.matrix))}

  

#输入线程主题和训练数据, calculate how many messages the thread receives in the training data through all the message date and time stamps
#thread. Times found the timestamp of the thread, whose vector length is the frequency at which the thread receives the message
#time. Span is the time that a thread exists in the training data: In order to calculate the active level
#log. Trans.weight is the weight of common logarithm.
#一个线程中只有一条邮件记录的情况: Thread starts when training data starts collecting data when thread ends or training data ends collection
#要剔除这种情况, return missing value
#实际情况中, the frequency is small and spaced large, meaning that the trans.weight is much less than 1 value, the logarithmic transformation of it, the result is negative
#为了将权重计算不引入负值, an affine transformation, which adds 10

thread.counts <-function (thread, EMAIL.DF) {  thread.times <-email.df$date[which (email.df$subject = = Thread |                                      Email.df$subject = = Paste ("Re:", thread)]  freq <-Length (thread.times)  min.time <-min (thread.times)  max.time <-Max (thread.times)  time.span <-as.numeric (Difftime (Max.time, min.time, units = "secs"))  if (Freq < 2) {    return (c (Na, Na, na))  }else{    trans.weight <-freq/time.span    Log.trans.weight <-+ log (trans.weight, base = Ten)    return (C (Freq, Time.span, log.trans.weight))  }}

  

#生成权重数据, and do some processing, and finally eliminate the missing line with subset

Thread.weights <-get.threads (Threads.matrix, Priority.train) thread.weights <-data.frame (thread.weights, Stringsasfactors = FALSE) names (thread.weights) <-C ("Thread", "Freq", "Response", "Weight") Thread.weights$freq < -As.numeric (thread.weights$freq) thread.weights$response <-as.numeric (thread.weights$response) thread.weights$ Weight <-as.numeric (thread.weights$weight) thread.weights <-subset (thread.weights, Is.na (thread.weights$Freq ) = = FALSE)

  

As you can see, even with the same frequency, the weights given vary depending on the response time. Although it may not be sorted in this way for some people, as a general-purpose solution, this quantitative approach is needed

#线程中的高频词权重: It is assumed that the high frequency words in the active thread message topic are more important than low-frequency words and words that appear in inactive threads
#term. Counts (): Enter a word item vector and a list of TDM options, return the term's TDM, and extract the frequency of Word entries in all threads

Term.counts <-function (Term.vec, control) {  Vec.corpus <-corpus (Vectorsource (Term.vec))  VEC.TDM <- Termdocumentmatrix (Vec.corpus, control = control)  return (Rowsums (As.matrix (VEC.TDM))}

  

#计算词频, and leave only the words
#对词项进行赋予权重, the weight = The average of all the thread weights for the term
#将向量转数据框, the word item is extracted as a name, the line name is changed to the line number

Thread.terms <-term.counts (thread.weights$thread, control = list (Stopwords = Stopwords ())) Thread.terms <-names ( thread.terms) term.weights <-sapply (thread.terms,                         function (t) mean (THREAD.WEIGHTS$WEIGHT[GREPL (t, Thread.weights$thread, fixed = TRUE)]) term.weights <-data.frame (List (term = names (term.weights), Weight = term.weights),                             stringsasfactors = FALSE, Row.names = 1:length (term.weights))

  

#邮件词项权重: Suppose a message that resembles a read message is more important than a completely unfamiliar message
#计算出现在邮件中的词频, take logarithmic transformation values as weights
#将负值权重剔除

Msg.terms <-term.counts (priority.train$message,                          control = list (Stopwords = Stopwords (),                                         Removepunctuation = True, RemoveNumbers = True) msg.weights <-data.frame (List (term = names (msg.terms), Weight = log (ms G.terms, base = ten),                           stringsasfactors = FALSE, Row.names = 1:length (msg.terms)) msg.weights <-subset (msg.weights , Weight > 0)

  

#################
#总结: A total of 5 weight data frames:
#from. Weight (social features)
#senders. DF (Active within the sender's thread)
#thread. Weights (thread activity)
#term. Weights (Word entry for active threads)
#msg. Weights (common terms for all messages)
##############################################################

####################################
#训练和测试排序算法
####################################
#思路是给每封邮件都产生一个优先等级, it's going to multiply each of the weights mentioned earlier
#因此需要对每封邮件都进行解析, extract features, match weights data frame, and find weight values
#用这些权重值的乘积作为排序的依据
####################
#首先执行权重查找, that is, subject and body word items
#输入三个参数: Terms to find (string), find object (weight data frame), lookup type (T is term, F is thread)
#返回权重值
#查找失败的情况:
#1. Check if the length of the search term for input get.weights () is greater than 0, and return 1 if the input is invalid does not affect the product operation
#2. Match () returns NA for an element with no match, to replace Na with 1, by judging match.weights to be 0

Get.weights <-function (search.term, weight.df, term = TRUE) {  if (length (search.term) > 0) {    if (term) {      term.match <-Match (names (search.term), weight.df$term)    }    else{      term.match <-match (search.term, weight.df$thread)    }    match.weights <-weight.df$weight[ Which (!is.na (Term.match))]    #书上的代码有误, but correctly expressed    if (length (match.weights) < 1) {      return (1)    }    else{      return (Mean (match.weights)}    }  else{    return (1)}  }

  

#输入邮件路径, returns the sort weight (rank)

Rank.message <-Function (path) {#抽取四个特征: #msg [] 1st Issue 2 Sender 3 Subject 4 Body 5 path msg <-Parse.email (path) # Weighting based on Me Ssage Author # First is just on the total frequency #查找发件人地址权重, unmatched return 1 from <-ifelse (length (which (From.weight$fro  M.email = = msg[2])) > 0, from.weight$weight[which (from.weight$from.email = = msg[2])], 1) # Second is Based on senders in threads, and threads themselves #查找发件人活跃度权重, unmatched returns 1 Thread.from <-ifelse (Length (which (senders.df $From. EMail = = msg[2])) > 0, senders.df$weight[which (senders.df$from.email = = Msg[2]), 1) #解析主 Question is online range subj <-strsplit (ToLower (msg[3)), "Re:") Is.thread <-ifelse (subj[[1]][1] = = "", TRUE, FALSE) #线程活跃度查找并匹  With if (is.thread) {activity <-get.weights (subj[[1]][2], thread.weights, term = FALSE)} else{activity <-1 } # Next, weight based on terms # weight based on terms in threads #活跃线程词项权重匹配 thread.terms <-term.counts (msg[3], control = List (Stopwords = TRUE) thread.terms.weights <-get.weights (thread.terms, term.weights) # Weight based terms in all mess                                                  Ages #正文词项权重匹配 msg.terms <-term.counts (msg[4], control = List (Stopwords = true, Removepunctuation = True,  RemoveNumbers = TRUE) msg.weights <-get.weights (msg.terms, msg.weights) # Calculate rank by interacting all weights #排序依据是所有查找到的权重的乘积 rank <-prod (from, thread.from, activity, THREAD.TERMS.W Eights, msg.weights) #返回日期, sender, subject, sort weight product return (C (msg[1], msg[2], msg[3], rank)}

  

#启动排序算法
#按时间分为训练数据和测试数据, note round () when processing. 5, the returned value is the nearest even

Train.paths <-priority.df$path[1: (Round (Nrow (PRIORITY.DF)/2))]test.paths <-priority.df$path[((Round (Nrow ( PRIORITY.DF)/2)) + 1): Nrow (PRIORITY.DF)]

  

#对训练数据进行处理, returns the sorted value
#警告可以忽略, apply the suppresswarning () function to

Train.ranks <-lapply (train.paths, rank.message) Train.ranks.matrix <-do.call (Rbind, Train.ranks) Train.ranks.matrix <-cbind (train.paths, Train.ranks.matrix, "TRAINING") train.ranks.df <-Data.frame ( Train.ranks.matrix, stringsasfactors = FALSE) names (TRAIN.RANKS.DF) <-C ("Message", "Date", "from", "subj", "Rank", " Type ") Train.ranks.df$rank <-as.numeric (Train.ranks.df$rank)

  

#计算优先邮件的阈值 (take median) new column setting recommended

Priority.threshold <-Median (train.ranks.df$rank) train.ranks.df$priority <-ifelse (Train.ranks.df$rank >= Priority.threshold, 1, 0)

  

#将设定阈值的结果可视化:

Threshold.plot <-Ggplot (TRAIN.RANKS.DF, aes (x = Rank)) +   stat_density (Aes (fill= "darkred") +   Geom_vline ( Xintercept = priority.threshold, Linetype = 2) +   scale_fill_manual (values = C ("darkred" = "darkred"), guide = "None") +   THEME_BW () print (Threshold.plot)

  

You can see that the threshold is about 24, and the sorting result is an obvious heavy-tailed distribution, indicating that the sorting algorithm performed well on the training set.

The book mentions the standard deviation as the threshold, at which point the threshold value is 90. The median approach is more inclusive, and the standard deviation means that most messages are excluded

#测试集测试效果

Test.ranks <-suppresswarnings (lapply (test.paths,rank.message)) Test.ranks.matrix <-Do.call (Rbind, Test.ranks ) Test.ranks.matrix <-cbind (test.paths, Test.ranks.matrix, "testing") Test.ranks.df <-Data.frame ( Test.ranks.matrix, stringsasfactors = FALSE) names (TEST.RANKS.DF) <-C ("Message", "Date", "from", "subj", "Rank", " Type ") Test.ranks.df$rank <-as.numeric (test.ranks.df$rank) test.ranks.df$priority <-IfElse (test.ranks.df$ Rank >= priority.threshold, 1, 0)

  

#合并训练集和测试集

FINAL.DF <-rbind (train.ranks.df, TEST.RANKS.DF) Sys.setlocale ("Lc_time", "C") final.df$date <-Date.converter ( Final.df$date, Pattern1, pattern2) final.df <-Final.df[rev (with (FINAL.DF, order (Date))),]

  

Stacking test data on the basis of the sorting density

Testing.plot <-Ggplot (Subset (FINAL.DF, type = = "TRAINING"), AES (x = Rank) +  stat_density (AES (fill = type, alpha = 0.65) +  stat_density (data = subset (FINAL.DF, type = = "Testing"),               aes (fill = Type, alpha = 0.65)) +  Geom_vlin E (xintercept = priority.threshold, Linetype = 2) +  Scale_alpha (guide = "none") +  scale_fill_manual (values = C ("TR Aining "=" darkred "," testing "=" darkblue ")) +  THEME_BW () print (Testing.plot)

  

Chart Analysis:

The distribution tail density of test data is higher, which indicates that the priority sorting value of more messages is not high;

The density estimation is not smooth, indicating that the test data contains more features that are not present in the training data.

The test data is sorted in the previous message:

As you can see, the sorting algorithm behaves better on the thread, that is, the individual messages of the same threads are roughly in the same group, and the same subjects have different priorities for different senders.

[Reading notes] machine learning: Practical Case Analysis (4)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.