xml,json都有大量的庫來解析,我們如何解析html呢?
TFHpple是一個小型的封裝,可以用來解析html,它是對libxml的封裝,文法是xpath。
今天我看到一個直接用libxml來解析html,參看:http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 那張圖畫得一目瞭然,很值得收藏。這個文章中的源碼不能遍曆所有的html,我做了一點修改可以將html遍曆列印出來
// NSData data contains the document data// encoding is the NSStringEncoding of the data// baseURL the documents base URL, i.e. location CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);const char *enc = CFStringGetCStringPtr(cfencstr, 0); htmlDocPtr _htmlDocument = htmlReadDoc([data bytes], [[baseURL absoluteString] UTF8String], enc, XML_PARSE_NOERROR | XML_PARSE_NOWARNING);if (_htmlDocument){ xmlFreeDoc(_htmlDocument);}xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;while (currentNode) {// output node if it is an elementif (currentNode->type == XML_ELEMENT_NODE){NSMutableArray *attrArray = [NSMutableArray array];for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next){xmlNodePtr contents = attrNode->children;[attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]];}NSString *attrString = [attrArray componentsJoinedByString:@" "]; if ([attrString length]){attrString = [@" " stringByAppendingString:attrString];}NSLog(@"<%s%@>", currentNode->name, attrString);}else if (currentNode->type == XML_TEXT_NODE){//NSLog(@"%s", currentNode->content);NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]);}else if (currentNode->type == XML_COMMENT_NODE){NSLog(@"/* %s */", currentNode->name);}if (currentNode && currentNode->children){currentNode = currentNode->children;}else if (currentNode && currentNode->next){currentNode = currentNode->next;}else{currentNode = currentNode->parent;// close nodeif (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);}if (currentNode->next){currentNode = currentNode->next;}else {while(currentNode){currentNode = currentNode->parent;if (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);if (strcmp((const char *)currentNode->name, "table") == 0){NSLog(@"over");}}if (currentNode == nodes->nodeTab[0]){break;}if (currentNode && currentNode->next){currentNode = currentNode->next;break;}}}}if (currentNode == nodes->nodeTab[0]){break;}}
不過我還是喜歡用TFHpple,因為它很簡單,也好用,但是它的功能不是很完完善。比如,不能擷取children node,我就寫了兩個方法,一個是擷取children node,一個是擷取所有的contents. 還有node的屬性content的key與node's content的key一樣,都是@"nodeContent", 正確情況下屬性的應是@"attributeContent",
所以我寫了這個方法,同時修改node屬性的content key.
NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult){NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary];if (currentNode->name) {NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding];[resultForNode setObject:currentNodeContent forKey:@"nodeName"]; }if (currentNode->content){NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding];if (currentNode->type == XML_TEXT_NODE){if (currentNode->parent->type == XML_ELEMENT_NODE){[parentResult setObject:currentNodeContent forKey:@"nodeContent"];return nil;}if (currentNode->parent->type == XML_ATTRIBUTE_NODE){[parentResult setObject: [currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:@"attributeContent"];return nil;}}}xmlAttr *attribute = currentNode->properties;if (attribute) {NSMutableArray *attributeArray = [NSMutableArray array];while (attribute) {NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary];NSString *attributeName = [NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding];if (attributeName) {[attributeDictionary setObject:attributeName forKey:@"attributeName"]; }if (attribute->children) {NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary);if (childDictionary) {[attributeDictionary setObject:childDictionary forKey:@"attributeContent"]; } }if ([attributeDictionary count] > 0) {[attributeArray addObject:attributeDictionary]; }attribute = attribute->next; }if ([attributeArray count] > 0) {[resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"]; } }xmlNodePtr childNode = currentNode->children;if (childNode) {NSMutableArray *childContentArray = [NSMutableArray array];while (childNode) {NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode);if (childDictionary) {[childContentArray addObject:childDictionary]; }childNode = childNode->next; }if ([childContentArray count] > 0) {[resultForNode setObject:childContentArray forKey:@"nodeChildArray"]; } }return resultForNode;}
TFHppleElement.m裡加了兩個key 常量
NSString * const TFHppleNodeAttributeContentKey = @"attributeContent";NSString * const TFHppleNodeChildArrayKey = @"nodeChildArray";
並修改擷取屬性方法為:
- (NSDictionary *) attributes{ NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary]; for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) { [translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey] forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]]; } return translatedAttributes;}
並添加擷取children node 方法:
- (BOOL) hasChildren{NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey];if (childs) {return YES;}return NO;}- (NSArray *) children{ if ([self hasChildren]) return [node objectForKey: TFHppleNodeChildArrayKey]; return nil;}
最後我還加了一個擷取所有content的主法:
- (NSString *)contentsAt:(NSString *)xPathOrCss;
請看源碼。
參看:http://giles-wang.blogspot.com/2011/08/iphoneansi.html