Original link: Https://mp.weixin.qq.com/s/tQ6uGBrxSLfJR4kk_GKB1Q
Home want to buy some furniture, listen to a friend said Suzhou Li (second voice) The furniture is more famous, because the work in Suzhou, also go over there to see, simply ..., broken legs are not finished shopping, not to mention also tired of shopping.
Also browse the official website of the furniture city, in a certain budget range and so on the selection of the most suitable, as a procedural ape, a restless heart, decided to their own crawler under the site, lists an Excel table, but also convenient for parents to view, incidentally, and then practice the crawler.
You can also take this form for reference when you purchase again later in the field.
There are two other practical articles about the crawler:
Python itchat Crawl friend information
Python crawler learning: Crawler QQ to say and generate word cloud, memories full
Excel table:
Word Frequency Statistics:
Reptile Analysis
Open the official website http://www.likoujiaju.com/, you can see the classification, here with "sofa" for example.
A total of 8 pages of data, the first page of the URL sell/list-66.html, the second page of the sell/list-66-2.html, so sell/list-66-1.html is the first page of data, so it is more convenient to traverse the URL to obtain data.
Use BeautifulSoup
to parse the data here, F12 find the label for the title, price, and picture.
Def get_data (): # defines a list of stored data furniture = [] # is used to store the name of the furniture, followed by the generation of the word frequency Title_all = "" # Paging data gets for the NUM in range ( 1, 9): url = "http://www.likoujiaju.com/sell/list-66-%d.html"% num response = requests.get (URL) cont ent = BeautifulSoup (response.content, "lxml") # Find the div block where the data resides Sm_offer = Content.find ("div", class_= "Sm-offer" ) lis = Sm_offer.ul.find_all ("Li") # traverses each data for Li in Lis: # Price Price_span = l I.find ("span", class_= "sm-offer-pricenum") Price = Price_span.get_text () # Name Title_div = Li.find ("div", class_= "Sm-offer-title") title = Title_div.a.get_text () Title_all = Title_all + titl E + "" # Picture Photo_div = Li.find ("div", class_= "Sm-offer-photo") photo = Photo_div.a.img. Get ("src") # details link href = photo_div.a.get ("href") # each entry in the array is Ganso furniture.append ( (Price, title, photo, HREF) # Sort furniture.sort (Key=take_price, reverse=true) # Generate Excel Create_excel (furniture, Title_all)
Crawled to the price is a string type, and some prices are not clear, so it is necessary to deal with the price and sorting, the use of the list sort(key=take_price)
method, where the key=take_price
specified method, using the specified method to compare sorting.
# 传参是列表的每一个元素,这里即元祖def take_price(enum): # 取元祖的第一个参数--价格,处理价格得到数值类型进行比较 price = enum[0] if "面议" in price: # 面议的话就设为0 return 0 start = price.index("¥") end = price.index("/") new_price = price[start + 1:end] return float(new_price)
Sort the list again, in reverse=True
descending order
furniture.sort(key=take_price, reverse=True)
Generate table
The library is used here for xlsxwriter
easy image insertion, installationpip install xlsxwriter
The main method to use:
xlsxwriter.Workbook("")
Create an Excel table.
add_worksheet("")
Create a worksheet.
write(row, col, *args)
Writes data to a cell based on row and column coordinates.
set_row(row, height)
Sets the row height.
set_column(first_col, last_col, width)
Set the column width, first_col
Specify the Start column position, and last_col
Specify the end column position.
insert_image(row, col, image[, options])
Used to insert a picture into a specified cell
Create two tables, one for crawling data, and one for storing word frequency.
# Create Exceldef create_excel (furniture, Title_all): # Create Excel Table file = Xlsxwriter. Workbook ("Furniture.xlsx") # Create sheet 1 sheet1 = file.add_worksheet ("Sheet1") # define Header headers = ["Price", "title", "Picture", "Details link"] # Write header for I, header in enumerate (headers): # First behavior Header sheet1.write (0, I, header) # Set column width sh Eet1.set_column (0, 0, Sheet1.set_column) (1, 1, si) sheet1.set_column (2, 2,) Sheet1.set_column (3, 3, 40) For row in range (len (furniture)): # line # Sets the line height Sheet1.set_row (row + 1, set) for Col in Range (len (heade RS): # column # col=2 is the current column image, and the URL to read the picture shows if col = = 2:url = Furniture[row][col] Image_data = Bytesio (Urlopen (URL). Read ()) Sheet1.insert_image (row + 1, 2, URL, {"Image_data": Image _data}) Else:sheet1.write (row + 1, col, Furniture[row][col]) # Create worksheet 2 to hold the word frequency Sheet2 = fil E.add_worksheet ("Sheet2") # generates the word frequency word_count (Title_all,Sheet2) # Close Table File.close ()
The FURNITURE.XLSX table is generated under the directory
Generate word Frequency
Using Jieba participle to the furniture name word processing, with a dictionary to save the number of nouns, written to excel.
# 生成词频def word_count(title_all, sheet): word_dict = {} # 结巴分词 word = jieba.cut(title_all) word_str = ",".join(word) # 处理掉特殊的字符 new_word = re.sub("[ 【】-]", "", word_str) # 对字符串进行分割出列表 word_list = new_word.split(",") for item in word_list: if item not in word_dict: word_dict[item] = 1 else: word_dict[item] += 1 # 对字典进行排序,按照数目排序 val = sorted(word_dict.items(), key=lambda x: x[1], reverse=True) # 写入excel for row in range(len(val)): for col in range(0, 2): sheet.write(row, col, val[row][col])
Word frequency statistics, the field to buy time, can also be based on the corresponding words to consult the seller ~
This article used in the crawler knowledge is relatively basic, Excel form is also the use of the xlsxwriter
library, made into a table is also convenient for parents to view. Of course, reptile data can also be used in many places.
Detailed code See
GitHub Address: Https://github.com/taixiang/furniture
Welcome to follow my blog: https://blog.manjiexiang.cn/
More welcome attention number: Spring Breeze ten miles Better know you
There is a "Buddha system code Farming Circle", welcome everyone to join the chat, Happy good!
Expired, can add me tx467220125 pull you into the group.
Python makes the selection of furniture more convenient