IOS parsing HTML

Source: Internet
Author: User

XML and JSON have a large number of libraries for parsing. How can we parse HTML?

Tfhpple is a small encapsulation that can be used to parse HTML. It is an encapsulation of libxml, And the syntax is XPath.

Today I saw a direct use of libxml to parse HTML, see: http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 that picture at a glance, it is worth collecting. The source code in this article cannot traverse all HTML. I made some modifications to print the HTML traversal.

// NSData data contains the document data// encoding is the NSStringEncoding of the data// baseURL the documents base URL, i.e. location  CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);const char *enc = CFStringGetCStringPtr(cfencstr, 0); htmlDocPtr _htmlDocument = htmlReadDoc([data bytes],      [[baseURL absoluteString] UTF8String],      enc,      XML_PARSE_NOERROR | XML_PARSE_NOWARNING);if (_htmlDocument){   xmlFreeDoc(_htmlDocument);}xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;while (currentNode) {// output node if it is an elementif (currentNode->type == XML_ELEMENT_NODE){NSMutableArray *attrArray = [NSMutableArray array];for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next){xmlNodePtr contents = attrNode->children;[attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]];}NSString *attrString = [attrArray componentsJoinedByString:@" "]; if ([attrString length]){attrString = [@" " stringByAppendingString:attrString];}NSLog(@"<%s%@>", currentNode->name, attrString);}else if (currentNode->type == XML_TEXT_NODE){//NSLog(@"%s", currentNode->content);NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]);}else if (currentNode->type == XML_COMMENT_NODE){NSLog(@"/* %s */", currentNode->name);}if (currentNode && currentNode->children){currentNode = currentNode->children;}else if (currentNode && currentNode->next){currentNode = currentNode->next;}else{currentNode = currentNode->parent;// close nodeif (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);}if (currentNode->next){currentNode = currentNode->next;}else {while(currentNode){currentNode = currentNode->parent;if (currentNode && currentNode->type == XML_ELEMENT_NODE){NSLog(@"</%s>", currentNode->name);if (strcmp((const char *)currentNode->name, "table") == 0){NSLog(@"over");}}if (currentNode == nodes->nodeTab[0]){break;}if (currentNode && currentNode->next){currentNode = currentNode->next;break;}}}}if (currentNode == nodes->nodeTab[0]){break;}}

However, I still like to use tfhpple because it is simple and easy to use, but its functions are not perfect. For example, if the children node cannot be obtained, I wrote two methods: Get the children node and get all the contents. in addition, the content key of the node attribute is the same as the key of the node's content. It is @ "nodecontent". In the correct case, the attribute should be @ "attributecontent ",

So I wrote this method and modified the content key of the node attribute at the same time.

NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult){NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary];if (currentNode->name)    {NSString *currentNodeContent =        [NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding];[resultForNode setObject:currentNodeContent forKey:@"nodeName"];    }if (currentNode->content){NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding];if (currentNode->type == XML_TEXT_NODE){if (currentNode->parent->type == XML_ELEMENT_NODE){[parentResult setObject:currentNodeContent forKey:@"nodeContent"];return nil;}if (currentNode->parent->type == XML_ATTRIBUTE_NODE){[parentResult setObject: [currentNodeContent  stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:@"attributeContent"];return nil;}}}xmlAttr *attribute = currentNode->properties;if (attribute)    {NSMutableArray *attributeArray = [NSMutableArray array];while (attribute)        {NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary];NSString *attributeName =            [NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding];if (attributeName)            {[attributeDictionary setObject:attributeName forKey:@"attributeName"];            }if (attribute->children)            {NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary);if (childDictionary)                {[attributeDictionary setObject:childDictionary forKey:@"attributeContent"];                }            }if ([attributeDictionary count] > 0)            {[attributeArray addObject:attributeDictionary];            }attribute = attribute->next;        }if ([attributeArray count] > 0)        {[resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"];        }    }xmlNodePtr childNode = currentNode->children;if (childNode)    {NSMutableArray *childContentArray = [NSMutableArray array];while (childNode)        {NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode);if (childDictionary)            {[childContentArray addObject:childDictionary];            }childNode = childNode->next;        }if ([childContentArray count] > 0)        {[resultForNode setObject:childContentArray forKey:@"nodeChildArray"];        }    }return resultForNode;}

Tfhppleelement. m adds two key constants.

NSString * const TFHppleNodeAttributeContentKey  = @"attributeContent";NSString * const TFHppleNodeChildArrayKey        = @"nodeChildArray";

And modify the method to obtain the property:

- (NSDictionary *) attributes{  NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary];  for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) {    [translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey]                             forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]];  }  return translatedAttributes;}

And add the methods to obtain the children node:

- (BOOL) hasChildren{NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey];if (childs) {return  YES;}return  NO;}- (NSArray *) children{    if ([self hasChildren])        return [node objectForKey: TFHppleNodeChildArrayKey];    return nil;}

Finally, I added a primary method for getting all content:

- (NSString *)contentsAt:(NSString *)xPathOrCss;

See the source code.

See: http://giles-wang.blogspot.com/2011/08/iphoneansi.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.