XmlReader and XElement combination to read large xml documents, xmlreaderxelement
Introduction
There are a large number of class libraries and APIs for operating xml data in. NET framework. However, after. NET framework 3.5, we generally prefer to use linq to xml.
Whether it is XElement. the Load method is still XElement. the Parse method loads the entire xml file into the memory, which is not suitable when the xml file is too large.
The best method for large xml files is to read only a part of the file each time. This gradually reads the entire xml file, which exactly corresponds to the XmlReader class.
XmlReader is highly efficient to use, but it is not convenient to operate on linq to xml, so we hope to take the advantages of both: it is as convenient to use efficiently as it is with linq to xml.
Ideas
The XElement class has a method ReadFrom, which accepts an XmlReader parameter: XNode. ReadFrom method (XmlReader)
In the above link, MSDN actually has a combination method, and the name is also good: Execute streaming conversion for large XML documents
static IEnumerable<XElement> StreamXElements(string uri, string matchname){ XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreComments = true; settings.IgnoreWhitespace = true; using (XmlReader reader = XmlReader.Create(uri, settings)) { reader.MoveToContent(); while (reader.Read()) { switch (reader.NodeType) { case XmlNodeType.Element: if (reader.Name == matchname) { XElement el = XElement.ReadFrom(reader) as XElement; if (el != null) { yield return el; } } break; } } }}
The above code is to use XmlReader to Read it all the time. When the XmlNodeType. Element type is encountered, XElement. ReadFrom (reader) can be used to construct XElement. The most important thing is the final yield return.
So far, so far so good.
However, during the test, we found that this method has a serious bug. Every time we read an XElement, We will skip an XElement:
For example, after reading the first 470002048 nodes, the 470002049 nodes are skipped.
This is actually a problem of XmlReader's accidental Read too far. read too far is actually read once more, which can be understood as follows:
initial read;(while "we're not at the end") { do stuff; read;}
Return to the code above, in fact, in XElement. after ReadFrom (reader) constructs an XElement, it has been read once internally, but we are still in reader in the while statement, so that the next XElement won't be read.
After knowing the cause, the solution is simple. Here we use reader. EOF to determine the condition and remove the extra read. The specific code is as follows:
static IEnumerable<XElement> StreamXElements(string uri, string matchname){ XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreComments = true; settings.IgnoreWhitespace = true; using (XmlReader reader = XmlReader.Create(uri, settings)) { reader.MoveToContent(); while (!reader.EOF) { if (reader.NodeType == XmlNodeType.Element && reader.Name == matchname) { XElement el = XElement.ReadFrom(reader) as XElement; if (el != null) { yield return el; } } else { reader.Read(); } } }}
Summary
The combination of XmlReader and XElement has already been introduced in the relevant articles in MSDN, but there are still many gains in the process of self-exploration. refer to the following article:
Http://stackoverflow.com/questions/2299632/why-does-xmlreader-skip-every-other-element-if-there-is-no-whitespace-separator
Https://msdn.microsoft.com/en-us/library/mt693229.aspx
Http://stackoverflow.com/questions/2441673/reading-xml-with-xmlreader-in-c-sharp
Https://blogs.msdn.microsoft.com/xmlteam/2007/03/24/streaming-with-linq-to-xml-part-2/