Easy to build their own vertical search engine

Source: Internet
Author: User
Keywords Search engine data collection

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Recently because of the need for work, online data acquisition software, just started downloading several kinds of acquisition software, found that there is no limit to function, is not enough to meet my needs (because I want to collect the structure of the site is more complex, specifically what the site is confidential ~), Accidentally found a previously did not see the Topfisher data acquisition software, see the introduction of the site, said to be able to accurately collect data and date types, there is no function restrictions, and this is what I want ah, so download down to try.

Download down a look depressed a bit, topfisher this software is the way to write script code to analyze the Web site architecture, unlike other software is configured a lot of the kind of dialog box. Anyway, first find some examples of the program to run a try it, I tried to run three examples: the first is to collect Baidu search results data, the result is really can collect, this although relatively strong, but not very useful. The second is to collect and download a mobile phone standby image of the site data, this is really good, not only the pictures are downloaded to a specified directory, but also the picture related to the data directly into an MDB file. The third one is the data collected from the mobile phone number of the station. This is believed to most webmaster are very useful (including myself), this is also good, but also directly put the results into an MDB file, and the target site is a post way to pass parameters, Topfisher can also easily take down.

It seems that the software has his unique, fortunately I also have a certain ability to programming, decided to take some time to learn this software, look back to the previous tried the three examples of the script code, are very short ah, also on the 10 lines of code, roughly read the code content, with the general programming language are almost, People like me who have a programming background should be easier to learn. Spent almost a day of time, finally put Topfisher straightened, but also the data I want to the perfect collection down, haha. Learn the process found Topfihser is very powerful, provides a lot of string manipulation functions, can be collected data filtering is very clean, flexible code to write the way is really can adapt to the vast majority of web sites, unless the Site page is not regular, As long as the rules can be used Topfisher code to parse out.

After a few days, and then try to collect a few other sites, but also tried the time collection, and data directly into the MSSQL database functions, are very good, just configure the MSSQL stored procedures there is a little trouble. To sum up, Topfisher has the following advantages and disadvantages:

Advantages:
1. Flexible scripting code that allows software to take all of the most Web sites.
2. Using an array-like approach to directly access the label properties in the Web page, data positioning is accurate.
3.TOPFISHER Script Execution program is very stable, I configured a timed collection of tasks, put on the server running for more than a week to now, but also very normal, and usually only occupy hundreds of K of memory. Really realized that there is a robot in the background input data, and I almost do not have to pipe it, hehe.
4. Provides a function to adjust the frequency of data acquisition, so that you can avoid too frequent access to be blocked IP, hey.

Shortcomings:
1. Scripting code in the way, if it is not programming people, it must be difficult to learn. Fortunately there are rich text and video tutorials, at least the people who will be programmed to learn is still relatively fast.
2. Single-threaded execution, even if you put two scripts in the task queue at the same time, it is also one execution.
3. The functionality of the download file does not support multi-threaded downloads and breakpoint continuation. It's not convenient to download larger files.

Topfisher compared with other similar software, is a difficult to get started, but after the high efficiency of the acquisition software, to my current level, as long as the target site is not too complex, one hours to compile a complete collection script is not a problem. Another point is that topfisher in the collection of digital/date data on the site is very strong, plus it runs a stable script interpreter, to build a vertical search engine site of their own is not a problem, I am going to find an industry to take a industry search engine to play, hey.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.