Python extracts a Web form and saves it as a CSV

Source: Internet
Author: User
Tags xpath

0.

1. Reference

HTML table

Table Labels
Table Description
<table> Defining tables
<caption> Defines the table title.
<th> Defines the table header for the table.
<tr> Defines the row for the table.
<td> Defines a table cell.
<thead> Defines the header of the table.
<tbody> Defines the body of the table.
<tfoot> Defines the footer of the table.
<col> Defines the properties used for table columns.
<colgroup> Defines the group of table columns.

Table element Positioning

See page source code and not thead and tbody ...

<tableclass="wikitable sortable"style="text-align:center; font-size:85%; width:auto; table-layout:fixed ;"> <caption>list of text editors</caption> <tr> <th style="Width:12em">Name</th> <th>Creator</th> <th>first public release</th> <th data-sort-type=" Number">latest stable version</th> <th>latest Release date</th> <th><a href="/wiki/programming_language"title="programming Language">programming language</a></th> <th data-sort-type="Currency">cost (<a href="/wiki/united_states_dollar"title="states dollar">US$</a>) </th> <th><a href="/wiki/software_license"title="Software License">software license</a></th> <th><a href="/wiki/free_and_open-source_software"title="Free and Open-source software">open source</a></th> <th><a href="/wiki/command-line_interface"title="command-line interface"&GT;CLI available</a></th> <th>minimum installed size</th> </tr> <tr> <th
2. Extracting Tabular data

The table header may appear hyperlinked, causing the title to be split,

Or it may not have a table title.

For
<a href="/wiki/lists_of_network_protocols" title="Lists of thenetwork Protocols">network protocols</a>
</caption>

Wrapping the contents of a table

<td>
<a href="/wiki/plan_9_from_bell_labs" title="Plan 9 from Bell Labs ">plan 9</a>
and
<a href="/wiki/inferno_ (operating_system)" title="Inferno ( Operating system)">Inferno</a>
</td>

Tag rule

Table
Thead TR1 Th Th Th Th
Tbody TR2 Td/th Td
Tbody TR3 Td/th
Tbody TR3 Td/th

2.1 Extract all table header lists
Filenames = [] forIndex, tableinchEnumerate (Response.xpath ('//table')): Caption= Table.xpath (' String (./caption)').Extract_first ()    #提取caption All text inside the tag, including child nodes and text subnodes,this also caption = '. Join(Table.xpath ('./caption//text () ').  Extract ())filename = str (index+1) +'_'+captionifCaptionElseSTR (index+1) #xpath to use the table count, starting with [1] filenames.append (re.sub (R'[^\w\s ()]',"', filename))#Removing special symbolsIn [233]: filenamesout[233]:[u'1_list of text editors', U'2_text Editor support for various operating systems', U'3_available Languages for the UI', U'4_text Editor support for common document interfaces', U'5_text Editor support for basic editing features', U'6_text Editor support for programming features (see Source code Editor)', U'7_text Editor support for other programming features', '8', U'9_text Editor support for key bindings', U'10_text Editor support for remote file editing over network Protocols', U'11_text Editor support for some of the most common character encodings', U'12_right to Left (RTL) bidirectional (BIDI) support', U'13_support for newline characters on line endings']

2.2 Each table is written to the CSV file separately
 forIndex, filenameinchEnumerate (filenames):Printfilename with open ('%s.csv'%filename,'WB') as Fp:writer=csv.writer (FP) forTrinchResponse.xpath ('//table[%s]/tr'% (index+1): Writer.writerow ([I.xpath ('string (.)'). Extract_first (). Replace (U ' \xa0 ', U '). strip (). Encode (' utf-8 ', ' replace ')  forIinch Tr.xpath ('./* ')]) #xpath组合, limit the tag range,Tr.xpath ('./th |./td ')

Code handling. Replace (U ' \xa0 ', U ')

HTML escape character &npsp; represents non-breaking space,unicode encoded as U ' \xa0 ', beyond the GBK encoding range?

Using ' W ' to write a CSV file, the following problems will occur, using ' WB ' to solve the problem

"Resolved" Python has an extra blank line in the contents of the CSV Writerow output – on the way

All tables are written to different worksheets sheet the same Excel file, and you need to use XLWT

Python: Creating an Excel workbook and dumping a CSV file as a worksheet

Python extracts a Web form and saves it as a CSV

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.