Web crawler-Crawl network data using regular expressions
About the network data crawl not only in the development of iOS, but also in other development, also known as web crawler, roughly divided into two ways to achieve
- 1: Regular expression
- 2: Using a toolkit in other languages: Java/python
Let's take a look at the fundamentals of Web crawlers:
The framework of a generic web crawler:
The basic workflow of web crawler is as follows:
1. Select the selected seed URL first;
2. Place these URLs into the queue for the URL to crawl;
3. Take the URL from the queue to crawl to crawl at the URL, resolve DNS, and get the host's IP, and the URL corresponding to the page downloaded, stored in the downloaded Web page library. In addition, these URLs are placed in the crawled URL queue.
4. Analyze the URLs in the crawled URL queue, parse the other URLs, and place the URLs in the URL queue to crawl into the next loop.
The following are my personal understanding.
Network data fetching
- Concept: Network data crawl, also known as web crawler. is in our iOS program to get the data on the page to crawl to.
- Useful: If you want to use some of the data of a website, this time we need to use crawl data technology.
- Recommendation: It is suggested that the process of fetching, multi-use classification, write some classification methods, help improve the readability of the program, but also improve efficiency.
Let's introduce the first one today: regular expressions
Note the point:
In fact, the network crawl data is very simple, but useful to the regular expression, this someone said difficult, some people say it is difficult, some people say very difficult, in fact, we crawl the data will only use ".", "*", "?" These three symbols!
In the regular expression: "." is to include any characters that do not include line breaks, "*" is any number of characters, "? "refers to the nearest URL, if not the one that is to the farthest!"
1 NSString *pantten = [NSString stringwithformat:@ "<ul class=\" cs_list\ "> (. *?) </ul> "];2 3 nsregularexpression *regx = [nsregularexpression regularexpressionwithpattern:pantten options: nsregularexpressioncaseinsensitive | Nsregularexpressiondotmatcheslineseparators Error:null];
There are two parameters you need to understand, it is important
12345 |
NSRegularExpressionCaseinsensitive 不区分大小写 NSRegularExpressionDotMatcheLineSeparators 让“点”字符可以匹配换行符 |
Catch the data, in fact, the main will write matching string on the line
- (.*?) Indicate what you want to catch.
- . *? means to ignore the content, love is what is what
- String Escape double quotes with \ escape brackets with \ \
In the process of developing the project, in many cases we need to use some data on the Internet, in which case we may have to write a crawler to crawl the data we need. In general, we use regular expressions to match HTML to get the data we need. In general, the following three steps are divided.
1. Get HTML for Web pages
- 2. Use regular expressions to get the data we need
- 3, analysis, use the obtained data, (for example, save to the database)
Next we analyze the code:
1. Get HTML for Web pages
For some web pages, we do not need to submit the post data, we can simply use the Nsurl class to get the HTML we need, turn it into the kcfstringencodinggb_18030_2000 format, to solve the problem of Chinese garbled.
1 + (nsstring*) urlstring: (nsstring*) strurl{2 Nsurl *url = [Nsurl Urlwithstring:strurl]; 3 NSData *data = [NSData Datawithcontentsofurl:url]; 4 5 Nsstringencoding enc = cfstringconvertencodingtonsstringencoding (kcfstringencodinggb_18030_2000); 6 NSString *retstr = [[NSString alloc] Initwithdata:data Encoding:enc]; 7 8//nslog (@ "html =%@", retstr); 9 return retstr;11}
For Web pages that require post submissions, we can take advantage of the powerful Asiformdatarequest class, such as:
1 + (void) Getpostresult: (nsstring*) startqi{2 asiformdatarequest *request = [[Asiformdatarequest alloc] Initwithurl:[ns URL Urlwithstring:urlpost]]; 3 4 [Request Setpostvalue:startqi forkey:@ "STARTQI"]; 5 [Request setpostvalue:@ "20990101001" forkey:@ "Endqi"]; 6 [Request setpostvalue:@ "Qihao" forkey:@ "SearchType"];//the search method in the Web page 7 [request startsynchronous]; 8 9 nsdata* data = [Request Responsedata];10 one-if (Data==nil) {Fclog (@ "have not data");}14 else{15 Nsstringencoding enc = cfstringconvertencodingtonsstringencoding (kcfstringencodinggb_18030_2000); NSString *retstr = [[NSString alloc] Initwithdata:data encoding:enc];17 fclog (@ "html =%@", retstr); 18}19}
That way, we've got two ways to get the HTML we need.
2. Analyzing HTML
On the use of regular expression matching, I have extended a method to the NSString class-(Nsmutablearray *) Substringbyregular: (NSString *) regular. Returns all matching arrays, based on the regular expression passed in.
1 @implementation nsstring (stringregular) 2 3 4-(Nsmutablearray *) Substringbyregular: (NSString *) regular{5 6 NSString * reg=regular; 7 8 Nsrange r= [self rangeofstring:reg options:nsregularexpressionsearch]; 9 Nsmutablearray *arr=[nsmut Ablearray array];11 if (r.length! = Nsnotfound &&r.length! = 0) {Int. int i=0;15 + while (r . length! = Nsnotfound &&r.length! = 0) {Fclog (@ "index =%i RegIndex =%d loc =%d", (++i), R.length,r.lo cation) nsstring* substr = [self substringwithrange:r];21 fclog (@ "substr =%@", substr); RR addobject:substr];25 nsrange Startr=nsmakerange (r.location+r.length, [self length]-r.location-r.length); 27 28 R=[self rangeofstring:reg options:nsregularexpressionsearch range:startr];29}30}31 return arr;32}33 @end
In this case, we first get the regular expression we want to get the data, about the regular expression of the Martian text I will not say, I am also very tangled, I will not say, but one thing is, the written regular expression must be the data we need, and can block invalid information, It may not be available in a single match and can be obtained multiple times using regular expressions. Here is my statement, which in my case is two times the use of regular expressions.
NSString *regstr = @ "<td class=\ ' z_bg_05\ ' >\\w{11}</td><td class=\ ' z_bg_13\ ' > (\\w{2}\\s{0,1}) * </td> "; Nsmutablearray *arr=[strhtml SUBSTRINGBYREGULAR:REGSTR];
3, analysis or use of data, here, I just use the method described in the previous blog to simply save this data in the database (Sqlite3).
In fact, in this arr array is a record in my database table, but like TD class and other information I do not need, so again using regular expressions to analyze NSString
1 if (Arr!=nil&&[arr count]>0) {2 3 nsstring *[email protected] "\\w{11}"; 4 nsstring *[emai L protected] "(\\w{2}\\s{0,1}) {8}"; 5 6 Ticketresultservice *service=[[ticketresultservice alloc] init]; 7 [[Sqlite3helper Instance] opendb]; 8 for (NSString *sub in arr) {9 ticketresult* R=[[[ticketresult alloc] init] autorelease];11 Nsmuta blearray* prearr=[sub substringbyregular:prereg];13 if (Prearr!=nil&&[prearr count]>0) {R.S Ectionid= (nsstring*) [Prearr objectatindex:0];16}17 else{18 continue;19}20 Ray *backarr=[sub substringbyregular:backreg];22 if (Backarr!=nil&&[backarr count]>0) {r.result= [Backarr objectatindex:0];24}25 else{26 continue;27}28 if ([Service Isexist:r.sectioni D]) {continue;31}32 r.type=[nsnumber numberwithint:1];34 [service addmodel:r];PNS}38 [[Sqlite3helper Instance] closedb];39 [Service release];41}
The above crawler is formally completed, in fact, before this there is a No. 0 step, that is, to determine the current network state of the device, if there is no networking there is no need to go to the crawler, because you also can not climb any data. Judging the status of the network I'm using Apple's official example reachability, there are many examples of this on the Internet, and I'm not going to go into it, thank you very much for all the good things that Daniel has to offer, so that I can write these more quickly.
iOS Development--Network Usage technology OC & web crawler-Crawl network data using regular expressions