XML and JSON have a large number of libraries for parsing. How can we parse HTML?
Tfhpple is a small encapsulation that can be used to parse HTML. It is an encapsulation of libxml, And the syntax is XPath.
Today I saw a direct use of libxml to parse HTML, see: http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 that picture at a glance, it is worth collecting. The source code in this article cannot traverse all HTML. I made some modifications to print the HTML traversal.
// NSData data contains the document data// encoding is the NSStringEncoding of the data// baseURL the documents base URL, i.e. location CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);const char *enc = CFStringGetCStringPtr(cfencstr, 0); htmlDocPtr _htmlDocument = htmlReadDoc([data bytes], [[baseURL absoluteString] UTF8String], enc, XML_PARSE_NOERROR | XML_PARSE_NOWARNING);if (_htmlDocument){ xmlFreeDoc(_htmlDocument);}xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;while (currentNode) {// output node if it is an elementif (currentNode->type == XML_ELEMENT_NODE){NSMutableArray *attrArray = [NSMutableArray array];for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next){xmlNodePtr contents = attrNode->children;[attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]];}NSString *attrString = [attrArray componentsJoinedByString:@" "]; if ([attrString length]){attrString = [@" " stringByAppendingString:attrString];}NSLog(@"<%s%@>", currentNode->name, attrString);}else if (currentNode->type == XML_TEXT_NODE){//NSLog(@"%s", currentNode->content);NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]);}else if (currentNode->type == XML_COMMENT_NODE){NSLog(@"/* %s */", currentNode->name);}if (currentNode && currentNode->children){currentNode = currentNode->children;}else if (currentNode && currentNode->next){currentNode = currentNode->next;}else{currentNode = currentNode->parent;// close nodeif (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);}if (currentNode->next){currentNode = currentNode->next;}else {while(currentNode){currentNode = currentNode->parent;if (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);if (strcmp((const char *)currentNode->name, "table") == 0){NSLog(@"over");}}if (currentNode == nodes->nodeTab[0]){break;}if (currentNode && currentNode->next){currentNode = currentNode->next;break;}}}}if (currentNode == nodes->nodeTab[0]){break;}}
However, I still like to use tfhpple because it is simple and easy to use, but its functions are not perfect. For example, if the children node cannot be obtained, I wrote two methods: Get the children node and get all the contents. in addition, the content key of the node attribute is the same as the key of the node's content. It is @ "nodecontent". In the correct case, the attribute should be @ "attributecontent ",
So I wrote this method and modified the content key of the node attribute at the same time.
NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult){NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary];if (currentNode->name) {NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding];[resultForNode setObject:currentNodeContent forKey:@"nodeName"]; }if (currentNode->content){NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding];if (currentNode->type == XML_TEXT_NODE){if (currentNode->parent->type == XML_ELEMENT_NODE){[parentResult setObject:currentNodeContent forKey:@"nodeContent"];return nil;}if (currentNode->parent->type == XML_ATTRIBUTE_NODE){[parentResult setObject: [currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:@"attributeContent"];return nil;}}}xmlAttr *attribute = currentNode->properties;if (attribute) {NSMutableArray *attributeArray = [NSMutableArray array];while (attribute) {NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary];NSString *attributeName = [NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding];if (attributeName) {[attributeDictionary setObject:attributeName forKey:@"attributeName"]; }if (attribute->children) {NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary);if (childDictionary) {[attributeDictionary setObject:childDictionary forKey:@"attributeContent"]; } }if ([attributeDictionary count] > 0) {[attributeArray addObject:attributeDictionary]; }attribute = attribute->next; }if ([attributeArray count] > 0) {[resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"]; } }xmlNodePtr childNode = currentNode->children;if (childNode) {NSMutableArray *childContentArray = [NSMutableArray array];while (childNode) {NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode);if (childDictionary) {[childContentArray addObject:childDictionary]; }childNode = childNode->next; }if ([childContentArray count] > 0) {[resultForNode setObject:childContentArray forKey:@"nodeChildArray"]; } }return resultForNode;}
Tfhppleelement. m adds two key constants.
NSString * const TFHppleNodeAttributeContentKey = @"attributeContent";NSString * const TFHppleNodeChildArrayKey = @"nodeChildArray";
And modify the method to obtain the property:
- (NSDictionary *) attributes{ NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary]; for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) { [translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey] forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]]; } return translatedAttributes;}
And add the methods to obtain the children node:
- (BOOL) hasChildren{NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey];if (childs) {return YES;}return NO;}- (NSArray *) children{ if ([self hasChildren]) return [node objectForKey: TFHppleNodeChildArrayKey]; return nil;}
Finally, I added a primary method for getting all content:
- (NSString *)contentsAt:(NSString *)xPathOrCss;
See the source code.
See: http://giles-wang.blogspot.com/2011/08/iphoneansi.html