In general, the function is implemented. In general, PythonCodeAbout 400 lines, but it is far from a real crawler. It can only be regarded as a customized information capture.Program. Background search uses the Open Source coreseek, so it doesn't matter to me as a whole.
The following describes the process:
- BBS has many columns. manually enter the RSS address of the column to be crawled in the seed file;
- Read the RSS address and analyze the link and content. Here, we use beatifulsoup and insert it into the database. Of course, we won't crawl the same page.
- Remove HTML tags and place them in a database field.
- Install coreseek and configure the conf file so that the index is crawled. coreseek integrates the mmseg word segmentation software, so you don't have to worry about word segmentation.
- Write two web pages, written in Python, so that they can connect to the search program coreseek, query keywords, return the corresponding information, and extract the ID, then retrieve the hit link from the database and display it on the search results.
I want to think about things that really don't have much to do, but are a little bright.
- Beatifulsoup parses tags, learns quickly, and uses this item quickly
- I have written some scripts, including shell, so I am familiar with some things again.
There was an episode in the middle, and I lost my card on Friday. At that time, I opened a page one by one in the lost and found my name, which was quite troublesome, then I searched my name or card on the search page and found some information,
So this is coming to an end. We will be busy with the company and open classes next week. This code is a demo and will be improved later.
Original blog address: Click