Summary of the phases of beiyou BBS crawling

Source: Internet
Author: User

In general, the function is implemented. In general, PythonCodeAbout 400 lines, but it is far from a real crawler. It can only be regarded as a customized information capture.Program. Background search uses the Open Source coreseek, so it doesn't matter to me as a whole.

The following describes the process:

    1. BBS has many columns. manually enter the RSS address of the column to be crawled in the seed file;
    2. Read the RSS address and analyze the link and content. Here, we use beatifulsoup and insert it into the database. Of course, we won't crawl the same page.
    3. Remove HTML tags and place them in a database field.
    4. Install coreseek and configure the conf file so that the index is crawled. coreseek integrates the mmseg word segmentation software, so you don't have to worry about word segmentation.
    5. Write two web pages, written in Python, so that they can connect to the search program coreseek, query keywords, return the corresponding information, and extract the ID, then retrieve the hit link from the database and display it on the search results.

I want to think about things that really don't have much to do, but are a little bright.

    • Beatifulsoup parses tags, learns quickly, and uses this item quickly
    • I have written some scripts, including shell, so I am familiar with some things again.
There was an episode in the middle, and I lost my card on Friday. At that time, I opened a page one by one in the lost and found my name, which was quite troublesome, then I searched my name or card on the search page and found some information,

So this is coming to an end. We will be busy with the company and open classes next week. This code is a demo and will be improved later.

Original blog address: Click

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.