[Python] The whole process of a new handwritten reptile

Source: Internet
Author: User

Write a crawler that uses only the Python string built-in function , and the data is stored in the TXT text. In fact, the main is not learning reptiles, but in accordance with the needs of the training of their own programming ability, the most important thing is to have a clear idea.

Target website: http://bohaishibei.com/post/category/main/(a very interesting website, a paragraph with a picture, old interesting ~) the website form as follows:

Objective: To divide a big goal into several small goals. Because it is the first time to do this, so it is clear to oneself ability, so complete order from simple to complex.

1. Crawl a period of content, including the title, and the URL of the picture

2. Place the data in a local TXT file

3. Climb as many as you want

4. Write a website and show it. (Purely for learning)

The first step:

I use the Google Browser, into the developer mode, using the ' page element selector ', first look at the structure inside the page, find the data we want to ' tag '.

Here we need to bohai the content of the first phase of all in <article class= "article-content" > This tag inside, such as:

The first red line is: element selector within the page

The second is: the contents of the label

The third article is: Title

After analysis, I just <article class= "article-content", the contents of this tag: so write the following method:

def content (HTML):    # Contents split label    str = ' <article class= ' article-content ' > '    content = html.partition ( STR) [2]    str1 = ' <div class= ' article-social ' > '    content = content.partition (STR1) [0]    return Content # Get the contents of a Web page

Here's what we need to say: Before I write this crawler I'm going to just use the string's built-in function to deal with the matching problem, so I'm going to go to the string page and see what the http://www.w3cschool.cc/python/of the string is.

The partition () method is used to split the string according to the specified delimiter.

If the string contains the specified delimiter, a tuple of $3 is returned, the first one is the substring to the left of the delimiter, the second is the delimiter itself, and the third is the substring to the right of the delimiter.

The partition () method was added in version 2.5.

So I get only the contents of the string, clean ~

Step Two:

Get the title content. The format of title is as follows, I just "2" after the text, the following IMG temporarily do not consider step by step.

<p> "2" This is my recent state, please tell me that I am not alone! </p><p>src=http://ww4.sinaimg.cn/mw690/005cfbldtw1etay8ifthnj30an0aot8w.jpg /></p><p>

I wrote the following method:

def title (Content,beg = 0):    # The idea is to take advantage of Str.index () and the slices of the sequence    try:        title_list = [] While        True:               num1 = Content.index (' "', beg)            num2 = Content.index (' </p> ', NUM1)            title_list.append (content[num1:num2])            Beg = num2            except ValueError:         return title_list

The try....except here is because I don't know how to jump out of the loop .... Ask the great God for a better way to tell me.

I jump out of the loop with the Vlaueerror exception is thrown when the description is not found, then return to the list. Just jump out of the loop.

NUM1 is "The location, Num2 is the location of </p>, and then with a series of slices, click on the click is the data I want." The thing to note here is that the slices ' want a head and no tail ' so the data we get is like this:

Oh, what the hell is this! The head does not end is this meaning!

Then I thought: "Then put the NUM1 add 1 not to be finished?" I was so naïve ....

Please +3, I think the principle is this, this is the Chinese characters! (Ask the Great God for guidance)

Step Three:

Explain what I did last night, record the time--10:01, below I want to crawl the URL of the picture. Here to say, if you want to put down the picture, the most important step is to get the URL, and then download and save to local (with text IO).

I first get the URL, the principle of the implementation of the title, I think, since the same uninstall to get the title of the method is good, or write a method good. I wrote a separate method, but actually copied the title of the method, changed the next matching string, the code is as follows:

def img (Content,beg = 0):    # The idea is to take advantage of Str.index () and the slices of the sequence    try:        img_list = [] While        True:               src1 = Content.index (' http ', beg)            src2 = Content.index ('/></p> ', Src1)            img_list.append (content[src1:src2 ])            Beg = Src2            except ValueError:         return img_list

The result diagram is as follows:

Found here, sometimes a title will have a lot of pictures. After I think about it, I have the following ideas:

1. You need to write a method that captures the URL when a title appears with multiple images. This need to have a judgment statement, when the URL length is greater than a URL length, you need to call this function.

2. How do I place the URL of multiple images? Use symbols to separate or nest into an array? I'm going to use ' | ' here. Separated, so add a sentence sentence, or first judge the URL length, can be done.

This problem is put here first, because when I want to download this URL need to filter, so the next step, the data into the local txt text, here to solve this problem is not too late.

Fourth Step:

Save the data to a local txt.

It is important to note that when text is written, Close is remembered, and the mode of opening the text is noticed.

When strings are concatenated, use jion () more efficiently than ' + '

So I wrote the following code:

def data_out (data):    #这里写成一个方法好处是, write the text here        #这里注意重新写一个地址#for i,e in Enumerate (data):    fo.write ("\ n". Join (data));         #print '%d, title:%s '% (i,e)    # Close Open file    fo.close ()

This creates a problem, look at the picture

Causes the last and new list to be written on the same line. At the same time with with....as better. The modified code is as follows:

def data_out (data):    #写入文本    with open ("/home/qq/foo.txt", "A +") as fo:        fo.write (' \ n ')        

Here's a look at what format the title and IMG will deposit in txt text:

Title$img

Here I have a concept to confuse, + and join () method of efficiency problem mainly in the connection of multiple strings, I this only use the connection once, do not need to consider this issue.

def data_out (Title, img):    #写入文本 with    open ("/home/qq/foo.txt", "A +") as fo:        fo.write (' \ n ')        size = 0        for size in range (0, Len (title)):                   

The contents of the text are as follows:

At the end of writing the text, solve the problem of multiple pictures:

def many_img (Data,beg = 0):    #用于匹配多图中的url    try:        many_img_str = "while        True:            src1 = Data.index (' http ', beg)            src2 = Data.index ('/><br/> :                 img[size] = many_img (img[size]) # Call the Many_img () method            Fo.write ( title[size]+ ' $ ' +img[size]+ ' \ n ')        

The output is as follows:

Vitality Maiden Ivy Chen by @TopFashionStyle $http://ww2.sinaimg.cn/mw690/005cfbldtw1etay848iktj30bz0bcq4x.jpg|http:// ww1.sinaimg.cn/mw690/005cfbldtw1etay83kv5pj30c10bkjsr.jpg|http://ww3.sinaimg.cn/mw690/ 005cfbldtw1etay82qdvsj30c10bkq3z.jpg|http://ww1.sinaimg.cn/mw690/005cfbldtw1etay836z8lj30c00biq40.jpg|http:// ww4.sinaimg.cn/mw690/005cfbldtw1etay8279qmj30ac0a0q3p.jpg|http://ww1.sinaimg.cn/mw690/ 005cfbldtw1etay81ug5kj30c50bnta6.jpg|http://ww2.sinaimg.cn/mw690/005cfbldtw1etay8161ncj30c20bgmyt.jpg|http:// Ww2.sinaimg.cn/mw690/005cfbldtw1etay804oy7j30bs0bgt9r.jpg|

Temporary function is realized, the following problems need to be modified in the change bar .... Novice take a step to see a step!!!

So far, the first two simple plans have been completed:

1. Crawl a period of content, including the title, and the URL of the picture

2. Place the data in a local TXT file

The full code is as follows:

#coding: Utf-8import urllib###### #爬虫v0.1 ##### #def gethtml (URL) with Urlib and string intrinsics: # get page content page = urllib.urlopen (URL)    html = Page.read () return htmldef content (HTML): # contents Split Tags str = ' <article class= ' article-content ' > '    Content = html.partition (str) [2] str1 = ' <div class= ' article-social ' > ' content = content.partition (STR1) [0] Return content # Gets the contents of the page def title (Content,beg = 0): # Match Title # The idea is to use Str.index () and the slices of the sequence Try:title_li st = [] While true:num1 = Content.index (' "', beg) +3 num2 = Content.index (' </p> ', NUM1 ) Title_list.append (content[num1:num2]) Beg = num2 except Valueerror:return titl        E_list def get_img (Content,beg = 0): # matches the URL of the picture # The idea is to use Str.index () and sequence of slices try:img_list = []            While true:src1 = Content.index (' http ', beg) Src2 = Content.index ('/></p> ', SRC1) Img_list.append (ConTENT[SRC1:SRC2]) Beg = Src2 except Valueerror:return img_listdef many_img (Data,beg = 0):  #用于匹配多图中的url try:many_img_str = "while True:src1 = Data.index (' http ', beg) src2 = Data.index ('/><br/>  70:img[size] = many_img (img[size]) # Tune Using the Many_img () method Fo.write (title[size]+ ' $ ' +img[size]+ ' \ n ') content = content (gethtml (" http://bohaishibei.com/post/10475/")) title = title (content) img = get_img (content) data_out (title, IMG) # Implements the URL of the title and IMG of a single page crawled and deposits text

The following to re-analyze the site, I have been able to get a period of content, I now want to get, the other period of the URL, so I want to climb as much as how much.

Destination URL: http://bohaishibei.com/post/category/main/

Follow the above method to enter the developer mode analysis site structure, find the target data in the tag, masturbate it!

The data needed in the home page are all in <div class= "content" > tags, separated by the following:

def main_content (HTML): # Home Content split tags    str = ' <div class= ' content ' > '    content = html.partition (str) [2]    str1 = ' </div> ' content    = content.partition (STR1) [0]    return content # Get the contents of a Web page

The data I need for the time being: the name of each issue and the URL of each issue.

After my analysis: the site's URL format for each issue is this:"http://bohaishibei.com/post/10189/" only the number is changed.

Then I found out that both of the data I wanted were under the

def page_url (content, Beg = 0):    try:        url = []        while True:            url1 = Content.index (' 

The format of the title,

I think about it, I want the title actually not much meaning, the user can not say I want to see that period, only need to enter to see how many periods on it, the title has no practical significance (unlike the content of the title is to help understand change sketch laugh point). So I'm going to just implement it in this version, you enter how many periods you want to see, and return how many periods!

Then we need a strategy:

http://bohaishibei.com/post/category/main/Total 20 Period

http://bohaishibei.com/post/category/main/page/2/Total 20 Period

......

After viewing, each page is 20 issue

When you want to view the number of periods, more than 20 period of time need, increase the page value, go to the next page to get

Last page for this: http://bohaishibei.com/post/category/main/page/48/

Implementation code, this I want to think how to write, I was the first time to write a crawler, do not taunt me ah!

Time--17:09

The feeling is coming true, still writing:

def get_order (num):    page = num/20    order = num% 20 # More than a full page entry for    I in range (1, page+1): # Need a tail        URL here = ' http://bohaishibei.com/post/category/main/page/%d '% i         print URL                if (i = = page) & (Order > 0):            url = ' http://bohaishibei.com/post/category/main/page/%d '% (i+1)             print url+ ",%d"% order
Get_order (55)

Operation Result:

http://bohaishibei.com/post/category/main/page/1http://bohaishibei.com/post/category/main/page/2http:// bohaishibei.com/post/category/main/page/3,15 Strip 2~~~~~~~~~~~~15

Here's what I'm thinking. I need to rewrite page_url and need to add one more parameter, as follows:

# Add a parameter order, default to 20def Page_url (content, order = $, beg = 0):    try:        url = []        i = 0 while        i < order:
   
    URL1 = Content.index (' 

The next method is to pass in the parameter num (how many periods are required), one page 20, and return the URL for each period, with the following code:

def get_order (num): # NUM represents the number of entries obtained url_list = [] page = num/20 order = num% 20 # exceeds a full page entry if num < 20: # If the number of entries obtained is less than 20 (one page 20), the NUM bar URL of the first page is crawled directly = ' http://bohaishibei.com/post/category/main ' main_html = gethtml (ur      L) clean_content = main_content (main_html) url_list = url_list + page_url (clean_content, num)        For I in range (1, page+1): # need to get here tail url = ' http://bohaishibei.com/post/category/main/page/%d '% i # crawl full page entry main_html = gethtml (URL) clean_content = main_content (main_html) url_list = url_list + Page_url (clean_conte NT) #获取整夜 if (i = = page) & (Order > 0): # Crawl to the last page, if there are more than one page of entries continue to fear the order bar URL = ' Http://bohaishi bei.com/post/category/main/page/%d '% (i+1) main_html = gethtml (URL) clean_content = Main_content (M ain_html) url_list = url_list + page_url (clean_content, order) #print Len (Page_ URL (clean_content, order)) return Url_list 

Start GoGoGo below

Order = Get_order (+) for I in range (0, Len (order)): #这个遍历列表太丑了, changed: For I in order    HTML = gethtml (Order[i])         content _data = content (html)    Title_data = title (content_data)    img_data = get_img (content_data)    data_out (title_ Data, Img_data)

Okay, all the code is finished.

The complete code I have uploaded to my github, the address is: https://github.com/521xueweihan/PySpider/blob/master/Spider.py

Here I have a bug in the test because sometimes there is no IMG address on the site. Such as

My code followed the problem because the number of my title and IMG lists was different, and the length of the list was based on the Len () of the title, and the result was out of range.

Here's a note, and then I'm going to get rid of the bug.

OK, bug eliminated. I changed the IMG matching method as follows:

def get_img (Content,beg = 0):    # matches the URL of the picture    # The idea is to take advantage of the Str.index () and the sequence of slices    try:        img_list = [] While        True:               # This will match src= "/"            src2 = Content.index ('/></p> ', Src1)            img_list.append (content[src1 : Src2])            beg = Src2            except ValueError:         return img_list

Main function:

order = Get_order # Get_order method takes parameters, fetches how many periods of data for I in order:  # Iterate through the list of methods    HTML = gethtml (i)                content_data = Content (HTML)    Title_data = title (content_data)    img_data = get_img (content_data)    data_out (title_data , Img_data)

Crawl down the data:

Finally finished!

[Python] The whole process of a new handwritten reptile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.