Python capture--data storage

Source: Internet
Author: User
Tags rowcount

Python Network data collection 3-data stored in CSV and MySQL

Warm up first and download all the pictures from a page.

Import requestsfrom BS4 Import beautifulsoupheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) ' chrome/52.0.2743.116  safari/537.36 edge/15.16193 '}start_url = ' https ://www.pythonscraping.com ' r = Requests.get (Start_url, headers=headers) soup = BeautifulSoup (r.text, ' lxml ') # Get all img Tags img_tags = soup.find_all (' img ') for tag in Img_tags:print (tag[' src ')
HTTP://PYTHONSCRAPING.COM/IMG/LRG%20 (1). jpg

Store a Web page table in a CSV file

Take this URL for example, there are several tables where we crawl the first table. wiki-Comparison of various editors

Import csvimport requestsfrom bs4 Import beautifulsoupheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) ' chrome/52.0.2743.116  safari/537.36 edge/15.16193 '}url = ' https:// En.wikipedia.org/wiki/comparison_of_text_editors ' r = Requests.get (URL, headers=headers) soup = BeautifulSoup (R.text, ' lxml ') # as long as the first table is rows = soup.find (' table ', class_= ' wikitable '). Find_all (' tr ') # CSV writes a blank line is written every time a line is written, so set newline to null with Open (' Editors.csv ', ' W ', newline= ', encoding= ' utf-8 ') as f:    writer = Csv.writer (f) for row in rows:        csv_row = []fo R cell in Row.find_all ([' th ', ' TD ']):            csv_row.append (Cell.text)        Writer.writerow (Csv_row)

One thing to note is that when you open a file, you need to specify it, newline='' because when you write to the CSV file, a blank line is written to each line.

Reading a CSV file from the network

The contents of the Web page are stored in a CSV file. What if I get a CSV file from the Internet? We do not want to download and then read from the local. However, the network request returns a string rather than a file object. csv.reader()a file object needs to be passed in. Therefore, you need to convert the obtained string into a file object. Python's built-in libraries, Stringio and Bytesio, can treat strings/bytes as files. For CSV modules, the reader iterator is required to return a string type, so using Stringio, if you are processing binary data, use Bytesio. Converted to a file object, it can be processed with the CSV module.

The most critical of the following code is the conversion of a data_file = StringIO(csv_data.text) string to a file-like object.

From IO import stringioimport csvimport requestscsv_data = Requests.get (' http://pythonscraping.com/files/ Montypythonalbums.csv ') data_file = Stringio (csv_data.text) reader = Csv.reader (data_file) for row in Reader:print (row)
[' Name ', ' year '] ["Monty Python ' s Flying Circus", ' 1970 '] [' Another Monty Python Record ', ' 1971 '] ["Monty Python ' s Previous Record", ' 1972 '] [' The Monty Python Matching Tie and Handkerchief ', ' 1973 '] [' Monty Python Live at Drury Lane ', ' 1974 '] [' An Album of the soundtrack of the Trailer of the Film of Monty Python and the Holy Grail ', ' 1975 '] [' Monty Python Live at City Center ', ' 1977 '] [' The Monty Python Instant Record Collection ', ' 1977 '] ["Monty Python ' s Life of Brian ', ' 1979 '] ["Monty Python ' s cotractual obligation Album", ' 1980 '] ["Monty Python's The Meaning of Life", ' 1983 ') [' The Final Rip Off ', ' 1987 '] [' Monty Python Sings ', ' 1989 '] [' The Ultimate Monty Python Rip Off ', ' 1994 '] [' Monty Python sings Again ', ' 2014 ']

Dictreader can fetch data like an operation dictionary, putting the first row of the table (usually the header) as key. You can access the data for that key in each row.
Each row of data is OrderDict accessible using key. Look at the first line of printed information above, description by Name and Year two keys. You can also use reader.fieldnames view.

from io import stringioimport csvimport requestscsv_data = requests.get ('/HTTP/ Pythonscraping.com/files/montypythonalbums.csv ') data_file = Stringio (csv_data.text) reader = csv. Dictreader (data_file) # view Keyprint (Reader.fieldnames) for row in Reader:print (row[' year ', row[' Name '], sep= ': ') 
[' Name ', ' year ']1970:monty python ' s Flying circus1971:another Monty python record1972:monty python ' s Previous Record197 3:the Monty python Matching Tie and Handkerchief1974:monty python Live at Drury Lane1975:an Album of the soundtrack of The Trailer of the Film of Monty Python and the Holy grail1977:monty python Live at City center1977:the Monty python Ins Tant Record collection1979:monty python ' s life of Brian1980:monty Python ' s cotractual obligation Album1983:monty Python ' s The meaning of life1987:the Final rip off1989:monty python sings1994:the Ultimate Monty python rip off2014:monty Py Thon Sings Again

Storing data

Big data storage and data interaction ability, in the new development of the program is already a priority.

2 main ways to store media files: Get URL links only, or download the source files directly

Advantages of directly referencing URL links:

Crawlers run faster and consume less traffic because you don't need to download files as long as the links are linked.

You can save a lot of storage space because you just need to store the URL link.

The code that stores the URL is easier to write and does not need to implement a file download code.

Not downloading files can reduce the load on the target host server.

Disadvantages of directly referencing URL links:

These outbound URL links embedded in the site or app are called hotlinking (hotlinking), and each site implements anti-theft chain measures.

Because the link file is on someone else's server, the app is going to run with someone else's rhythm.

Hotlinking is very easy to change. If the hotlinking picture is on a blog, if it is found by the other server, it is likely to be spoof. If the URL link is saved and ready to be used later, it may be useless when the link has expired, or it becomes completely irrelevant.

Python3 Urllib.request.urlretrieve can download files according to the URL of the file:

From urllib.request import urlretrievefrom urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("http ://www.pythonscraping.com ") Bsobj = BeautifulSoup (html) imagelocation = Bsobj.find (" a ", {" id ":" logo "}). Find (" img ") [" SRC "]urlretrieve (imagelocation," logo.jpg ")

CSV (comma-separated values, comma separated value) is a common file format for storing tabular data

A common function of network data acquisition is to get HTML tables and write to CSV

In addition to user-defined variable names, MySQL is case-insensitive, and is customarily used in uppercase for MySQL keywords

Connections and Cursors (connection/cursor) are 2 modes of database programming:

Connection mode in addition to connecting to the database, you also send database information, handle rollback operations, create cursor objects, etc.

A connection can create multiple cursors, and a cursor keeps track of a state information, such as the usage state of a database. The cursor also contains the results of the last query execution. Get query results by invoking a cursor function, such as Fetchall

After the cursor and connection are used, be sure to close it, or it will cause a connection leak that will always consume the database resources

Use the Try ... finally statement to ensure that the database connection and cursor are closed

Several ways to make your database more efficient:

Add an ID field to each table. Usually the database is difficult to select the primary key intelligently

With smart index, CREATE index definition on dictionary (ID, definition (16));

Choosing the right Paradigm

Send email, get information via crawler or API, set conditions to send automatically email! Those who subscribe to the mail, that is definitely the way!

Save link between links

For example, link A, you can find link b in this page. It can be represented as a A -> B . We just want to save this connection to the database. To build a table first:

Pages table only saves the link URL.

CREATE TABLE ' pages ' (  ' id ' int (one) not null auto_increment,  ' url ' varchar (255) DEFAULT NULL,  ' created ' Times Tamp not NULL DEFAULT current_timestamp,  PRIMARY KEY (' id '))

The links table holds the linked Fromid and toid, which are identical to the two IDs in pages. 1 -> 2in pages with ID 1, you can access the URL with ID 2.

CREATE TABLE ' links ' (  ' id ' int (one) not null auto_increment,  ' fromid ' int (one) ' DEFAULT NULL,  ' toid ' int (one) DEF Ault null,  ' created ' timestamp not null DEFAULT current_timestamp,  PRIMARY KEY (' id ')

The above statement looks a bit bloated, I first use a visual tool to build a table, and then use show create table pages such a statement to view.

Import reimport pymysqlimport requestsfrom bs4 Import beautifulsoupheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) ' chrome/52.0.2743.116 safari/537.36 edge/15.16193 '}conn = Pymysql.connect ( host= ' localhost ', user= ' root ', password= ' admin ', db= ' wiki ', charset= ' utf8 ') cur = conn.cursor () def insert_page_if_not_ Exists (URL): Cur.execute (F "select * from pages WHERE url= ' {URL} ';")        # This URL is not inserted if Cur.rowcount = = 0:# then insert Cur.execute (f "INSERT into pages (URL) VALUES (' {URL} ');") Conn.commit () # just inserted the data Idreturn cur.lastrowid# otherwise already exists this data, because the URL is generally unique, so get one on the line, take the foot mark 0 is to get Idelse:return cur.fetchone () [0 ]def Insert_link (From_page, To_page):p rint (From_page, '-a ', to_page) Cur.execute (f "select * from links WHERE FromI D={from_page} and Toid={to_page}; ") # If the query is not data, INSERT, insert the ID that requires two pages, i.e. two urlif Cur.rowcount = = 0:cur.execute (f "INSERT into links (fromid, toid) VALUES ({fro        M_page}, {to_page}); ") Conn.commit () # link go back pages = set () # GetAll links def get_links (Page_url, Recursion_level): global pagesif Recursion_level = = 0:return# This is the link just inserted page_id = Insert_page_ If_not_exists (page_url) R = Requests.get (' https://en.wikipedia.org ' + page_url, headers=headers) soup = Beautifulsou P (r.text, ' lxml ') link_tags = Soup.find_all (' A ', Href=re.compile (' ^/wiki/[^:/]*$ ')) for Link_tag in link_tags:# page_id Yes The URL that was just inserted, the parameter calls the Insert_page again ... method to get the URL list that you just inserted in the URL to be able to make a connection, such as the newly inserted ID 1,id 1 of the URL can go to the ID has 2, 3, 4 ..., the formation of 1, 2, 1, 3 such a contact Insert_link (page_id,            Insert_page_if_not_exists (link_tag[' href '))) if link_tag[' href '] not in pages:new_page = link_tag[' href '] Pages.Add (new_page) # recursive lookup, only recursive recursion_level times get_links (New_page, recursion_level-1) if __name__ = = ' __main__ ': try : Get_links ('/wiki/kevin_bacon ', 5) except Exception as E:print (e) finally:cur.close () Conn.close ()
1----  54 ----  1210, 1110,  94,  1010  ,  1810,  1610,  1710  , 1510, 1410, 1310,  1910,  21 ...

Look at the printed information at a glance. Look at the first two lines of print, the pages table ID 1 URL can access the ID 2 URL, and the pages table ID 2 URL can access the URL with ID 1 ... In turn.

First you need to use insert_page_if_not_exists(page_url) the ID that gets the link, and then use the insert_link(fromId, toId) form link. The url,toid of the current page is the ID of the fromId URL that can go from the current page, and the URLs that go to it are returned in a list with BS4. The current URL is page_id, so you need to insert_link call it again in the second parameter to insert_page_if_not_exists(link) get the ID of each URL in the list. This creates a connection. For example, a URL that has just been inserted with an ID of 1,id 1 can go to the ID 2, 3, 4 ..., then form 1, 2, 1, 3 such links.

Look at the database. Here is pages a table with each ID corresponding to a URL.

And then here's the links table, and it's fromId toId pages in id . Of course, and the printed data is the same, but the printed look on the past, save the words which day need to analyze the data is of great use.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.