For some projects, developers need to extract data from the Word documents and export the data to the database. The biggest challenge is to support existing Word documents.
There are thousands of Word documents with multiple data blocks in the same format. The document format is not designed to be read by another system. This means that there are no bookmarks, merge fields, and identify the actual data from standard instructions. Fortunately, all input fields are in the table, but these tables are also in different formats, some are single rows/cells, and others are changeable.
We can useAspose. WordsCreate and operate Word documents.
Create a similar table model in C # so that we can use it later when reading the document.
As shown below, you can see that the createdWordDocumentTableClass with three attributes:TableID,RowIDAndColumnIDAs mentioned earlier, we do not support TableID/RowIDs. These attributes only imply the location of the Word document. The START index is assumed to be 0.
public class WordDocumentTable{ public WordDocumentTable(int PiTableID) { MiTableID = PiTableID; } public WordDocumentTable(int PiTableID, int PiColumnID) { MiTableID = PiTableID; MiColumnID = PiColumnID; } public WordDocumentTable(int PiTableID, int PiColumnID, int PiRowID) { MiTableID = PiTableID; MiColumnID = PiColumnID; MiRowID = PiRowID; } private int MiTableID = 0; public int TableID { get { return MiTableID; } set { MiTableID = value; } } private int MiRowID = 0; public int RowID { get { return MiRowID; } set { MiRowID = value; } } private int MiColumnID = 0; public int ColumnID { get { return MiColumnID; } set { MiColumnID = value; } }}
Now we are at the extraction stage. As shown below, you will see the set of table cells that I want to read from the document.
private List<WordDocumentTable> WordDocumentTables{ get { List<WordDocumentTable> wordDocTable = new List<WordDocumentTable>(); //Reads the data from the first Table of the document. wordDocTable.Add(new WordDocumentTable(0)); //Reads the data from the second table and its second column. //This table has only one row. wordDocTable.Add(new WordDocumentTable(1, 1)); //Reads the data from third table, second row and second cell. wordDocTable.Add(new WordDocumentTable(2, 1, 1)); return wordDocTable; }}
The following section extracts data from the Aspose. Words documents based on tables, rows, and cells.
public void ExtractTableData(byte[] PobjData){ using (MemoryStream LobjStream = new MemoryStream(PobjData)) { Document LobjAsposeDocument = new Document(LobjStream); foreach(WordDocumentTable wordDocTable in WordDocumentTables) { Aspose.Words.Tables.Table table = (Aspose.Words.Tables.Table) LobjAsposeDocument.GetChild (NodeType.Table, wordDocTable.TableID, true); string cellData = table.Range.Text; if (wordDocTable.ColumnID > 0) { if (wordDocTable.RowID == 0) { NodeCollection LobjCells = table.GetChildNodes(NodeType.Cell, true); cellData = LobjCells[wordDocTable.ColumnID].ToTxt(); } else { NodeCollection LobjRows = table.GetChildNodes(NodeType.Row, true); cellData = ((Row)(LobjRows[wordDocTable.RowID])). Cells[wordDocTable.ColumnID].ToTxt(); } } Console.WriteLine(String.Format("Data in Table {0}, Row {1}, Column {2} : {3}", wordDocTable.TableID, wordDocTable.RowID, wordDocTable.ColumnID, cellData); } }}