Python makes the selection of furniture more convenient

Source: Internet
Author: User

Original link: Https://mp.weixin.qq.com/s/tQ6uGBrxSLfJR4kk_GKB1Q

Home want to buy some furniture, listen to a friend said Suzhou Li (second voice) The furniture is more famous, because the work in Suzhou, also go over there to see, simply ..., broken legs are not finished shopping, not to mention also tired of shopping.
Also browse the official website of the furniture city, in a certain budget range and so on the selection of the most suitable, as a procedural ape, a restless heart, decided to their own crawler under the site, lists an Excel table, but also convenient for parents to view, incidentally, and then practice the crawler.
You can also take this form for reference when you purchase again later in the field.
There are two other practical articles about the crawler:

Python itchat Crawl friend information

Python crawler learning: Crawler QQ to say and generate word cloud, memories full

Excel table:

Word Frequency Statistics:

Reptile Analysis

Open the official website http://www.likoujiaju.com/, you can see the classification, here with "sofa" for example.

A total of 8 pages of data, the first page of the URL sell/list-66.html, the second page of the sell/list-66-2.html, so sell/list-66-1.html is the first page of data, so it is more convenient to traverse the URL to obtain data.

Use BeautifulSoup to parse the data here, F12 find the label for the title, price, and picture.

Def get_data (): # defines a list of stored data furniture = [] # is used to store the name of the furniture, followed by the generation of the word frequency Title_all = "" # Paging data gets for the NUM in range ( 1, 9): url = "http://www.likoujiaju.com/sell/list-66-%d.html"% num response = requests.get (URL) cont ent = BeautifulSoup (response.content, "lxml") # Find the div block where the data resides Sm_offer = Content.find ("div", class_= "Sm-offer" ) lis = Sm_offer.ul.find_all ("Li") # traverses each data for Li in Lis: # Price Price_span = l  I.find ("span", class_= "sm-offer-pricenum") Price = Price_span.get_text () # Name Title_div = Li.find ("div", class_= "Sm-offer-title") title = Title_div.a.get_text () Title_all = Title_all + titl E + "" # Picture Photo_div = Li.find ("div", class_= "Sm-offer-photo") photo = Photo_div.a.img. Get ("src") # details link href = photo_div.a.get ("href") # each entry in the array is Ganso furniture.append ( (Price, title, photo, HREF) # Sort furniture.sort (Key=take_price, reverse=true) # Generate Excel Create_excel (furniture, Title_all) 

Crawled to the price is a string type, and some prices are not clear, so it is necessary to deal with the price and sorting, the use of the list sort(key=take_price) method, where the key=take_price specified method, using the specified method to compare sorting.

# 传参是列表的每一个元素,这里即元祖def take_price(enum):    # 取元祖的第一个参数--价格,处理价格得到数值类型进行比较    price = enum[0]    if "面议" in price:  # 面议的话就设为0        return 0    start = price.index("¥")    end = price.index("/")    new_price = price[start + 1:end]    return float(new_price)

Sort the list again, in reverse=True descending order

furniture.sort(key=take_price, reverse=True)
Generate table

The library is used here for xlsxwriter easy image insertion, installationpip install xlsxwriter
The main method to use:
xlsxwriter.Workbook("")Create an Excel table.
add_worksheet("")Create a worksheet.
write(row, col, *args)Writes data to a cell based on row and column coordinates.
set_row(row, height)Sets the row height.
set_column(first_col, last_col, width)Set the column width, first_col Specify the Start column position, and last_col Specify the end column position.
insert_image(row, col, image[, options])Used to insert a picture into a specified cell

Create two tables, one for crawling data, and one for storing word frequency.

# Create Exceldef create_excel (furniture, Title_all): # Create Excel Table file = Xlsxwriter. Workbook ("Furniture.xlsx") # Create sheet 1 sheet1 = file.add_worksheet ("Sheet1") # define Header headers = ["Price", "title", "Picture", "Details link"] # Write header for I, header in enumerate (headers): # First behavior Header sheet1.write (0, I, header) # Set column width sh    Eet1.set_column (0, 0, Sheet1.set_column) (1, 1, si) sheet1.set_column (2, 2,) Sheet1.set_column (3, 3, 40) For row in range (len (furniture)): # line # Sets the line height Sheet1.set_row (row + 1, set) for Col in Range (len (heade                RS): # column # col=2 is the current column image, and the URL to read the picture shows if col = = 2:url = Furniture[row][col] Image_data = Bytesio (Urlopen (URL). Read ()) Sheet1.insert_image (row + 1, 2, URL, {"Image_data": Image _data}) Else:sheet1.write (row + 1, col, Furniture[row][col]) # Create worksheet 2 to hold the word frequency Sheet2 = fil E.add_worksheet ("Sheet2") # generates the word frequency word_count (Title_all,Sheet2) # Close Table File.close () 

The FURNITURE.XLSX table is generated under the directory

Generate word Frequency

Using Jieba participle to the furniture name word processing, with a dictionary to save the number of nouns, written to excel.

# 生成词频def word_count(title_all, sheet):    word_dict = {}    # 结巴分词    word = jieba.cut(title_all)    word_str = ",".join(word)    # 处理掉特殊的字符    new_word = re.sub("[ 【】-]", "", word_str)    # 对字符串进行分割出列表    word_list = new_word.split(",")    for item in word_list:        if item not in word_dict:            word_dict[item] = 1        else:            word_dict[item] += 1    # 对字典进行排序,按照数目排序    val = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)    # 写入excel    for row in range(len(val)):        for col in range(0, 2):            sheet.write(row, col, val[row][col])

Word frequency statistics, the field to buy time, can also be based on the corresponding words to consult the seller ~

This article used in the crawler knowledge is relatively basic, Excel form is also the use of the xlsxwriter library, made into a table is also convenient for parents to view. Of course, reptile data can also be used in many places.

Detailed code See
GitHub Address: Https://github.com/taixiang/furniture

Welcome to follow my blog: https://blog.manjiexiang.cn/
More welcome attention number: Spring Breeze ten miles Better know you

There is a "Buddha system code Farming Circle", welcome everyone to join the chat, Happy good!

Expired, can add me tx467220125 pull you into the group.

Python makes the selection of furniture more convenient

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.