Python implementation of simple HTML table parsing method

Source: Internet
Author: User
The examples in this article describe how Python implements simple HTML table parsing. Share to everyone for your reference. The specific analysis is as follows:

Rely on libxml2dom here to make sure to install first! Import into your footsteps and call the Parse_tables () function.

1. Source = A string containing the source code can pass in just the table or the entire page code

2. Headers = a list of ints OR a list of strings
If The headers is ints the-tables with no headers, just list the 0 based index of the rows in which you want to ex Tract data.
If the headers is strings the-tables with header columns (with the tags) it would pull the information from the SP ecified columns

3. The 0 based index of the table in the source code. If There is multiple tables and the table you want to parse is the third table in the code then pass in the number 2 here

It'll return a list of lists. Each inner list would contain the parsed information.

The specific code is as follows:

#The Goal of Table parser is-get specific information from specific#columns in a table. #Input: Source code from a typic Al website#arguments:a List of headers the user wants to Return#output:a list of lists of the data in each Rowimport Lib Xml2domdef parse_tables (source, headers, Table_index): "" "Parse_tables (string source, list headers, table_index) Heade RS May is a list of strings if the table has headers defined or headers is a list of ints if no headers defined thi    s would get data from the rows index.  This method returns a list of lists "" "#Determine If the headers list was strings or ints and make sure they #are all The same type J = 0 print ' Printing headers: ', headers #route to the correct function #if the header type is int if Type (headers[0]) = = Type (1): #run no_header function return No_header (source, headers, table_index) #if the header Type is string elif type (headers[0]) = = Type (' a '): #run the Header_given function return Header_Given (source, headers, table_index) Else: #return None if the headers aren ' t correct return none#this function take s in the source code of the whole page a string list of#headers and the index number of the table on the page. It returns a list of#lists with the scraped informationdef header_given (source, headers, Table_index): #initiate a list t  o Hole the return list return_list = [] #initiate a list to hold the index numbers of the data in the rows Header_index = [] #get a Document object out of the source code doc = libxml2dom.parsestring (source,html=1) #get the tables from th E Document tables = doc.getelementsbytagname (' table ') Try: #try to get focue on the desired table main_table = tab  Les[table_index] except: #if the table doesn ' t exits then return an error return [' The table index is not found '] #get a list of headers in the table table_headers = main_table.getelementsbytagname (' th ') #need a sentry value for the Header loop Loop_sentry = 0 #loop Through each header looking for matches for header in Table_headers: #if the header is in the desired headers list If header.textcontent in headers: #add it to the Header_index header_index.append (loop_sentry) #add one to The Loop_sentry loop_sentry+=1 #get the rows from the table rows = Main_table.getelementsbytagname (' tr ') #sentry Val   UE Detecting if the first row is being viewed Row_sentry = 0 #loop through the rows in the table, skipping the first row      For row in rows: #if row_sentry are 0 This is our first row if Row_sentry = = 0: #make The row_sentry not 0 Row_sentry = 1337 Continue #get all cells from the current row cells = Row.getelementsbytagname (' TD ') #ini Tiate a list to append into the return_list cell_list = [] #iterate through all of the header index ' s for i in He Ader_index: #append The cells text content to the Cell_list cell_list.append (cells[i].textcontent) #append th e cell_list to the ReturN_list return_list.append (cell_list) #return the Return_list return return_list#this function takes in the source cod E of the whole page an int list of#headers indicating the index number of the needed item and the index number#of the TABL E on the page.  It returns a list of lists with the scraped infodef no_header (source, headers, Table_index): #initiate A list to hold the return list return_list = [] #get a Document object out of the source code doc = libxml2dom.parsestring (source, html=1 #get the tables from document tables = doc.getelementsbytagname (' table ') Try: #Try to get focus on the desired tab Le main_table = Tables[table_index] except: #if the table doesn ' t exits then return an error return [' The table Index is not found '] #get all of the rows out of the main_table rows = Main_table.getelementsbytagname (' tr ') #loop thr Ough each row to row in rows: #get all cells from the current row cells = Row.getelementsbytagname (' TD ') #initi Ate a LIST to append into the return_list cell_list = [] #loop through the list of desired headers for I in headers:        Try: #try to add text from the cell into the Cell_list cell_list.append (cells[i].textcontent) except: #if There is an error usually a index error just continue continue #append the data scraped into the Retu Rn_list return_list.append (cell_list) #return The return list return return_list

Hopefully this article will help you with Python programming.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.