0.
1. Reference
HTML table
Table Labels
Table |
Description |
<table> |
Defining tables |
<caption> |
Defines the table title. |
<th> |
Defines the table header for the table. |
<tr> |
Defines the row for the table. |
<td> |
Defines a table cell. |
<thead> |
Defines the header of the table. |
<tbody> |
Defines the body of the table. |
<tfoot> |
Defines the footer of the table. |
<col> |
Defines the properties used for table columns. |
<colgroup> |
Defines the group of table columns. |
Table element Positioning
See page source code and not thead and tbody ...
<tableclass="wikitable sortable"style="text-align:center; font-size:85%; width:auto; table-layout:fixed ;"> <caption>list of text editors</caption> <tr> <th style="Width:12em">Name</th> <th>Creator</th> <th>first public release</th> <th data-sort-type=" Number">latest stable version</th> <th>latest Release date</th> <th><a href="/wiki/programming_language"title="programming Language">programming language</a></th> <th data-sort-type="Currency">cost (<a href="/wiki/united_states_dollar"title="states dollar">US$</a>) </th> <th><a href="/wiki/software_license"title="Software License">software license</a></th> <th><a href="/wiki/free_and_open-source_software"title="Free and Open-source software">open source</a></th> <th><a href="/wiki/command-line_interface"title="command-line interface">CLI available</a></th> <th>minimum installed size</th> </tr> <tr> <th
2. Extracting Tabular data
The table header may appear hyperlinked, causing the title to be split,
Or it may not have a table title.
For
<a href="/wiki/lists_of_network_protocols" title="Lists of thenetwork Protocols">network protocols</a>
</caption>
Wrapping the contents of a table
<td>
<a href="/wiki/plan_9_from_bell_labs" title="Plan 9 from Bell Labs ">plan 9</a>
and
<a href="/wiki/inferno_ (operating_system)" title="Inferno ( Operating system)">Inferno</a>
</td>
Tag rule
Table |
|
|
|
|
Thead TR1 |
Th |
Th |
Th |
Th |
Tbody TR2 |
Td/th |
Td |
Tbody TR3 |
Td/th |
Tbody TR3 |
Td/th |
2.1 Extract all table header lists
Filenames = [] forIndex, tableinchEnumerate (Response.xpath ('//table')): Caption= Table.xpath (' String (./caption)').Extract_first () #提取caption All text inside the tag, including child nodes and text subnodes,this also caption = '. Join(Table.xpath ('./caption//text () '). Extract ())filename = str (index+1) +'_'+captionifCaptionElseSTR (index+1) #xpath to use the table count, starting with [1] filenames.append (re.sub (R'[^\w\s ()]',"', filename))#Removing special symbolsIn [233]: filenamesout[233]:[u'1_list of text editors', U'2_text Editor support for various operating systems', U'3_available Languages for the UI', U'4_text Editor support for common document interfaces', U'5_text Editor support for basic editing features', U'6_text Editor support for programming features (see Source code Editor)', U'7_text Editor support for other programming features', '8', U'9_text Editor support for key bindings', U'10_text Editor support for remote file editing over network Protocols', U'11_text Editor support for some of the most common character encodings', U'12_right to Left (RTL) bidirectional (BIDI) support', U'13_support for newline characters on line endings']
2.2 Each table is written to the CSV file separately
forIndex, filenameinchEnumerate (filenames):Printfilename with open ('%s.csv'%filename,'WB') as Fp:writer=csv.writer (FP) forTrinchResponse.xpath ('//table[%s]/tr'% (index+1): Writer.writerow ([I.xpath ('string (.)'). Extract_first (). Replace (U ' \xa0 ', U '). strip (). Encode (' utf-8 ', ' replace ') forIinch Tr.xpath ('./* ')]) #xpath组合, limit the tag range,Tr.xpath ('./th |./td ')
Code handling. Replace (U ' \xa0 ', U ')
HTML escape character &npsp; represents non-breaking space,unicode encoded as U ' \xa0 ', beyond the GBK encoding range?
Using ' W ' to write a CSV file, the following problems will occur, using ' WB ' to solve the problem
"Resolved" Python has an extra blank line in the contents of the CSV Writerow output – on the way
All tables are written to different worksheets sheet the same Excel file, and you need to use XLWT
Python: Creating an Excel workbook and dumping a CSV file as a worksheet
Python extracts a Web form and saves it as a CSV