[IOS] uses regular expressions to capture web page data and create a small dictionary for ios

Source: Internet
Author: User

[IOS] uses regular expressions to capture web page data and create a small dictionary for ios

Applications do not have to provide data on their own. It is good to learn to use existing data.

The network is large, and various search engines crawl everywhere every day. This article uses regular expressions to capture website data for a small dictionary.

I. Use of Regular Expressions

1. Determine the matching scheme, that is, pattern

2. instantiate NSRegularExpression with pattern

3. Use the matching method to start matching.

Match once: You can useFirstMatchMethod

Matching Multiple times: availableMatchsMethod


Regular Expression table(I found a good table on the Internet. Regular Expressions are common in various languages)

Http://www.jb51.net/shouce/jquery1.82/regexp.html


The following is the test code that can match: xn4545945

// Test the regular expression with-(void) findAnswerInHTMLTest {NSString * srcStr = @ "http://blog.csdn.net/xn4545945"; NSString * pattern = @ "xn [^ \ s] *"; // match any non-blank characters starting with xn // instantiate a regular expression. You must specify two options // NSRegularExpressionCaseInsensitive ignore case sensitivity // NSRegularExpressionDotMatchesLineSeparators. NSRegularExpression * regex = [[NSRegularExpression alloc] initWithPattern: pattern options: Rule | internal error: nil]; // match NSTextCheckingResult * checkResult = [regex firstMatchInString: srcStr options: NSMatchingReportCompletion range: NSMakeRange (0, srcStr. length)]; // retrieve the found content. NSString * result = [srcStr substringWithRange: [checkResult rangeAtIndex: 0]; NSLog (@ "data is ----- >%@", result );}


2. Capture webpage data and make a small dictionary

Use Sea words as a query dictionary to capture data. Http://dict.cn

Directly in the URL after the query word can be queried, such as the query "hello", that is, the http://dict.cn/hello

View the source code of the webpage, such as: CaptureHelloYou can.



Tips:

(.*?)To retrieve the desired content. The content can be extracted only when the brackets are enclosed. This is enough to capture webpages.. *. Indicates matching any character. ** indicates repeating 0 to multiple times *? Indicates that the matching is as few as possible. * uses (.*?) Replace. Use large spaces for line breaks .*? Replaced by ignore.
In combination with the search function of the browser, search in the source code to see if the selected keywords are repeated. (more convenient to find) * select a block with a slightly larger value (id marked in the tag), which can be unique. (If the selection is too small, some small labels may be repeated frequently on the webpage) * then the regular expression method is called multiple times to gradually narrow the scope. * The quotation marks must be escaped (with a backslash ). % escape is required for Chinese characters. (usage)
Directly run the Code:
@ Implementation XNSpider-(void) loadHTMLWithWord :( NSString *) word {// 1. send an HTML request to get the returned webpage. (converted to a string) NSString * urlString = [NSString stringWithFormat: @ "% @", kBaseURL, word]; // concatenate the request URL urlString = [urlString encoding: NSUTF8StringEncoding]; // Chinese escape NSURL * url = [NSURL URLWithString: urlString]; // obtain URLNSURLRequest * request = [NSURLRequest requestWithURL: url cachePolicy: NSURL RequestUseProtocolCachePolicy timeoutInterval: 5.0f]; [NSURLConnection sendAsynchronousRequest: request queue: [NSOperationQueue mainQueue] completionHandler: ^ (NSURLResponse * response, NSData * data, NSError * connectionError) {// convert the obtained data to the string NSString * html = [[NSString alloc] initWithData: data encoding: NSUTF8StringEncoding]; // NSLog (@ "% @", html ); // 2. matches (Regular Expression filtering) the string to be returned. (write another findAnswerInHTM method. L) // then pass the result to the main thread through the proxy to update the UI NSString * result = [self findAnswerInHTML: html]; NSLog (@ "% @", result ); if ([self. delegate respondsToSelector: @ selector (finishSpider :)]) {[self. delegate finishSpider: result]; // upload the finished result to the UI thread via proxy.}];} /*** core method for regular expression matching strings ** @ param html returns the matching result for the entire webpage string ** @ return */-(NSString *) findAnswerInHTML :( NSString *) html {// use (. *?) Replace. Use large spaces for line breaks .*?. NSString * pattern = @ "<ul class = \" dict-basic-ul \ "> .*? <Li> <span> (.*?) </Span> <strong> (.*?) </Strong> </li> "; // to instantiate a regular expression, you must specify two options. // NSRegularExpressionCaseInsensitive ignores case sensitivity. // NSRegularExpressionDotMatchesLineSeparators. NSRegularExpression * regex = [[NSRegularExpression alloc] initWithPattern: pattern options: Rule | internal error: nil]; // match NSTextCheckingResult * checkResult = [regex firstMatchInString: html options: NSMat ChingReportCompletion range: NSMakeRange (0, html. length)]; // obtain the found content. The numbers correspond to the parentheses (.*?), When 0 is used, the entire matching sentence is output. NSString * result = [html substringWithRange: [checkResult rangeAtIndex: 2]; NSLog (@ "data is ----- >%@", result); return result ;}

The matching result is finally uploaded to the main thread using a proxy to update the UI. The program interface is as follows: (After querying Android, the data found on the webpage is "robot ")

The error is not handled here.

Program source code: http://download.csdn.net/detail/xn4545945/7619349

Reprinted please indicate the source: http://blog.csdn.net/xn4545945



How to use regular expressions to capture website content

An HTML webpage is a text document. Regular expressions are used to match specific strings in a text document, for example, "text" is as simple as text, but uses a flexible lexical expression to express a string pattern, matching according to this pattern.

Extract content from HTML documents. HTML tags or text content can be used as matching targets and references. Therefore, you must first understand the structure of the target HTML document. In addition, regular expressions are not easy to understand. In fact, HTML documents are semi-structured documents that are divided into structure blocks by HTML tags. Therefore, there is another way to extract them: Using XPath or XQuery, the syntax is much easier to grasp.

Let's take a look at the implementation principle of MetaSeeker's website capturing software. We use XPath as the main method and string processing function as the supplementary method to extract website content. There are a lot of technical materials on GooSeeker's website, the software can be downloaded and used for free.

Regular Expressions capture webpage content

Pattern p = Pattern. compile ("([\ d-] +) <\/span> <a onclick = \" openWin \ ('([^'] +) '\) \ "href = \" # \ "> ([^ <] +) <");

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.