Web Crawler: crawls book information from allitebooks.com and captures price from amazon.com (2): crawls allitebooks.com book information and ISBN code,

Source: Internet
Author: User

Web Crawler: crawls book information from allitebooks.com and captures price from amazon.com (2): crawls allitebooks.com book information and ISBN code,

This article first crawls the book information in the book list from allitebooks.com and the ISBN code corresponding to each book.
I. analyze the requirements and website structure
The structure of allitebooks.com is very simple, with pages + book lists + book details pages. To get the detailed information and ISBN code of a book, we need to traverse all the pages to go to the book list, and then go to the details page of each book from the book list, in this way, the details and ISBN codes can be captured.
2. traverse the book list on each page from the page
# Get the next page url from the current page urldef get_next_page_url (url): page = urlopen (url) soup_page = BeautifulSoup (page, 'lxml') page. close () # Get current page and next page tag current_page_tag = soup_page.find (class _ = "current") next_page_tag = current_page_tag.find_next_sibling () # Check if the current page is the last one if next_page_tag is None: next_page_url = None else: next_page_url = next_page_tag ['href '] return next_page_url

 

3. Find the details page link from the book list
You can click the title or cover image in the book list to go to the details page. Select either the title or cover image, and select the title here.

# Get the book detail urls by page urldef get_book_detail_urls (url): page = urlopen (url) soup = BeautifulSoup (page, 'lxml') page. close () urls = [] book_header_tags = soup. find_all (class _ = "entry-title") for book_header_tag in book_header_tags: urls. append (book_header_tag.a ['href ']) return urls

 

4. Capture the title and ISBN code from the book details page
# Get the book detail info by book detail urldef get_book_detail_info (url): page = urlopen (url) book_detail_soup = BeautifulSoup (page, 'lxml') page. close () title_tag = book_detail_soup.find (class _ = "single-title") title = title_tag.string isbn_key_tag = book_detail_soup.find (text = "Isbn :"). parent isbn_tag = isbn_key_tag.find_next_sibling () isbn = isbn_tag.string.strip () # Remove the whitespace with the strip method return {'title': title, 'isbn ': isbn}

 

5. Integrate the three parts of code
def run():    url = "http://www.allitebooks.com/programming/net/page/1/"    book_info_list = []    def scapping(page_url):        book_detail_urls = get_book_detail_urls(page_url)        for book_detail_url in book_detail_urls:            # print(book_detail_url)            book_info = get_book_detail_info(book_detail_url)            print(book_info)            book_info_list.append(book_info)        next_page_url = get_next_page_url(page_url)        if next_page_url is not None:            scapping(next_page_url)        else:            return    scapping(url)

 

Running result

6. Write the results to a file for further processing.

def save_to_csv(list):    with open('books.csv', 'w', newline='') as fp:        a = csv.writer(fp, delimiter=',')        a.writerow(['title','isbn'])        a.writerows(list)

 

To be continued...

Complete code please move to github: https://github.com/backslash112/book_scraper_pythonBeautiful Soup Basic Knowledge: Web Crawler: From allitebooks.com capture book information and capture price from amazon.com (1): Basic knowledge Beautiful Soup we are in the big data era, if you are interested in data processing, please refer to another series of Essays: using Python for data analysis basic series summary next article is based on the obtained ISBN code to get the corresponding price for each book on the amazon.com website, the obtained data is processed through the knowledge of data analysis, and finally output to the csv file. If you are interested, please follow this blog and leave a message for discussion.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.