Objective
Continuation of the R language before reading Langya list of novels, continue to talk about the use of R language to do some simple text processing, participle things. In fact, it is to continue to talk about reading in R language, and how to use it in the simple text processing method, to optimize our reading experience, if read the mail and read the code is also counted reading. The code used is super simple and does not involve other packages
Here are two examples, which end again to spit out and summarize.
1) R-blogger Subscription message splitting
2) Fast reading method of R code base
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only guarantee in the blog Park blog Layout Clean and the code block and picture correctly display, he station reproduced please keep the author information respecting copyright AH
1, r-blogger subscription mail split
The text data used in this case comes from the R-blooger website. R-blogger is a place that specializes in collecting and publishing articles related to R language. It provides a Daily Mail subscription function, will be a good article today, sent directly to your mailbox. The site itself is not FQ, but the subscription confirmation email requires FQ to be on (based on Google)
I previously bulk subscribed to the fast half a year, the beginning of the day is also a daily look, but it is a message of a large amount of information, and some articles are not too interested, gradually interest shifted, accumulated a pile of unread in the inside.
We can use the Outlook software to get the text of these messages in bulk. (Outlook, select All R-blogger mail, save As, you can save all the messages in a TXT file)
Next is the code example.
Source data text File please download here:
Http://vdisk.weibo.com/s/o_UNVWL3aJf
1 #1. Read Data2R_blog<-readlines ("F:/r/r-readbooks/r-blogger.txt")3R_blog[20:30]4 5 #2, to posted as the positioning conditions, respectively extract the article time, title, author and other relations6 #Article release time7Sample (R_blog[grep ("Posted:", R_blog)],10)8 #article title9Sample (R_blog[grep ("Posted:", R_blog) -2],10)Ten #article author OneSample (R_blog[grep ("Posted:", R_blog) +3],5) A - #3, according to the above information, according to the article split long text (4.5M) - the #4, the library as a condition to see what the recent popular libraries -Library_list<-strsplit (R_blog[grep ("^library\\ (", R_blog)],"\\(|\\)|\\,") -Library_list<-sapply (Library_list,function (e) e[2]) -Library_list<-gsub ("\\p{p}","", library_list,perl=TRUE) +A<-sort (Table (library_list), decreasing =TRUE) - #Most Popular +Head (a,20) A at #5. Select all of the GitHub addresses that are involved in the e-mail in the github+http condition -URL_RAW<-R_BLOG[GREPL ("\\", R_blog) & (!grepl ("\\.jpg|\\.png", R_blog)] -Url_list<-sapply (Strsplit (Url_raw,"<|>"), function (e) e[2]) -Url_list<-unique (url_list[! is. NA (url_list)]) -Sample (grep ("GitHub", url_list,value=true), 10)
Sample diagram:
2. Code Text analysis
As has just been mentioned, the R language can handle some simple text. So we extend one down to think, code. R, why can't it also be considered as the text we are going to process, according to the previous logic, to sweep the text data inside?
Especially in everyone said to write more code, look at other people's code, and accumulate code function blocks, but every time you open someone else's code, all the thousands of English is daunting, with the program to deal with the code block, whether can draw some rules, from more objective and agile angle, do some statistics and analysis?
Source data from this book: "Machine Learning: Practical Case Analysis"
The GitHub address of the book Code is as follows and can be downloaded directly:
Https://github.com/johnmyleswhite/ML_for_Hackers
Take a look at the code example.
1 #1. Read the original data, it can be read as a line of text, it is necessary to traverse the method read into the data, because there is more than one R file in the folder2Fileslist<-list.files ("f:/code/ml_for_hackers-master/", recursive = true,pattern="\\. R $", Full.names =TRUE)3code_detail<-NULL4 for(Iinch1: Length (fileslist)) {5code_detail<-C (Code_detail,readlines (Fileslist[i]))6 }7 8 #2, see what the R package9Library_list<-strsplit (Code_detail[grep ("^library\\ (", Code_detail)],"\\(|\\)|\\,")TenLibrary_list<-sapply (Library_list,function (e) e[2]) OneLibrary_list<-gsub ("\\p{p}","", library_list,perl=TRUE) AA<-sort (Table (library_list), decreasing =TRUE) - a - the #3. See how much the comments account for the total code line -Zhushi<-code_detail[grep ("#", Code_detail)] -Length (Zhushi)/Length (code_detail) - #It accounts for about 30% . + - #4. See what functions are customized +FUNCTION_LIST<-CODE_DETAIL[GREPL ("function\\ (", Code_detail) & Grepl ("<-", Code_detail) &!grepl ("Apply", Code_detail)] AFunction_list<-sapply (Strsplit (Function_list," "), function (e) e[1]) atSample (FUNCTION_LIST,30)
3. Summary
Now, let's summarize the characteristics of the three examples.
If we use a framework of data analysis, they all meet the basic flow of demand analysis, data acquisition, data processing, data analysis and integration. In fact, this is also our daily analysis, or make statements to follow a set of theorems.
and using these kinds of text data to get started with the data processing and analysis process of the R language can also be a good way to maintain the interest of learners.
When I was studying R the biggest pain was that in Coursera JHU's R class, teachers always like to use various social statistics, weather data, biological data to give us examples. But these data are too professional, and many for the European and American Society's data, the use of these data always feel very painful.
So when I looked at the Langya in the R language before, I found that the data is what we are interested in everyday, learn to really use, everyone has their own set of analytical techniques, you can easily design a set of data analysis process, to obtain a deduction and conclusions of the data. and these methods are used well, our reading experience, reading experience will be highly improved, and accelerate the accumulation of knowledge of the process of classification.
We are surrounded by data, why can't we think of text as an alternative kind of data? A qualified person who wants to go to the direction of data processing, at least in life can discover the mysteries and numbers behind the text?
In addition, it should be noted that:
1) The data itself has its own characteristics, as well as processing purposes, contact the business, understand the operation, is definitely not oral White said.
Like reading Langya book novels, we should focus on "characters" to series "plot", as well as the relationship between characters and characters, so such as text[gerpl ("Fei", Text) & Grepl ("Shan", text)] such a multi-conditional screening is abound
And for R-blogger, because this data comes from multiple authors, and there is a time span, how to effectively select these authors, then how to decide what is useful to us, and in time change people pay attention to what changes in the trend, these processing is even more important. Even to design some word segmentation, text mining, frequency statistics of the content.
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only guarantee in the blog Park blog Layout Clean and the code block and picture correctly display, he station reproduced please keep the author information respecting copyright AH
And for the code selection, we have to master the regular expression, can be in a number of {}, () to choose some common things, but also the names of those random variables to be filtered out, left out, leaving only the function ah, the key point of application technology. Here more is the ability to take a number, because it is not as structured as the text of the first two cases.
Just here refers to the three simple cases, their treatment method is very different, and if the novel does not understand the character driven plot, look at the subscription does not have their own e-mail habits, and even do not understand the code, there is no way to do the above data further processing.
So familiar with business, not only familiar with the characteristics of business data, but also familiar with the needs of the business, I can find some deep point to continue to dig
2) from a trend, everyone will be programmed to become a more wide range of hard requirements
Here are some of the imagination:
From the angle of text processing, when we want to recite the word, why not find a U.S. drama or movie subtitle, import r inside, and then match the IELTS TOEFL vocabulary, or the word book, the words to recite the paragraph is all selected to read? (Source of inspiration: The book "The Word Social Network")
Then before those who write summary sets, comb the role of the very painful editing work, can you use a simple code program to replace, let people from the meaningless turn to find out the allusions, more focused on internal logic thinking? No longer need to cut the page manually, extract newspaper, everything, just remember the key words and source? (interpretation of the classical poems we want to see in high school)
And then for the operation of a website, such as those who often want to pay attention to the enemy on what promotion of operations, can you simply get a crawler, regularly to the other home to push the price promotion, so as to understand their operational strategy? (In fact, there are a number of big e-business now doing it, but the tools have been relegated to the operational itself, not so much)
If you write a script, to the classic Bridge section can write their own program to the desired artistic conception from the thousands of novels to take a look at, that efficiency how much improve ah, only need to learn a little bit of programming, we can be freed from repetitive labor, to do the real value of things, I think this is the most valuable part of amateur programming for non-computer people.
By the way, recently in a Python course with codecademy, thanks to this world there are always people willing to make a boring programming learning process like playing games so lively and interesting real-time interactive. The more people do these programming promotions, the more people will be able to write more complex program scripts than the text mentioned in this article, the threshold for programming is getting lower.
---
Finally, read more and see more. Originally written these code is basically want to read a little faster, remember a bit, tidy things faster, definitely not to accumulate information and not read. If you have worked hard to write a code to help us take all the interesting text out, but nothing to look at, and do not want to do data analysis and business dealings with those who know the truth of the Fool what is the difference?
By the way, attach the other things played with R, Welcome to the Groove:
- R Language: XLSX package installation and integration with VBA quickly convert xlsx files to CSV
- R language: Three examples of handling outlier values TryCatch
- R language: Reptile first try-rvest package
- R language: Ggplot2 Refinement Chart Example
- R language: Kindle Crawl Special Book example (rvest) +r Output HTML page example
- R language: Reading Langya in R language
PS and PS:
This article adds the R language to read the Langya list novel together, is before for a speech preparation demonstration material, but then was too nervous, also prepares some other things and finally forgot to say haha haha--the conclusion is if on the stage speech, must want to say the thing to write a small copy, or put in the PPT main point, otherwise the certainty forgets =
Finally, at the end of November to complete the 2015 monthly mission of a bo ...
R language: Optimizing our reading experience with a simple text-processing approach