I know reptiles can jiangzi.--Train collector

Source: Internet
Author: User

I know that reptiles can do that.-Use of the train collector

say in front

The amount ... Well, I am a sanmao money of the cock Silk also began to enter the internship stage, in Beijing is actually very good, although the pressure is big, but the future employment opportunities are relatively larger. Well, to say back to today's topic, before learning the Python crawler has always thought that the future work of the crawler need to write their own source code and then climb up and climb, but is not so drop (should not), the day before yesterday the company threw me a crawl web tools, and then I have been pondering, With the results this afternoon, I learned a simple crawl of web data. So I'm here to summarize the website Data Collector-The simple use of the train collector.

Body

First of all, download the train collector, this online link has a lot.

This is the train collector folder after the installation is complete.

Use steps

1. After login into the account (as if the account application is required to spend money), we first create a new group, pay attention to select the group to choose the right time to OK.

2. Right-Select a new task for the group you want to work on in its group

3. Edit this task to take the example of the IT industry dynamics in the HC network. Because of the link to the page, so we need to select the "Bulk/Multi-page" column, and then the URL changes in the number

Instead of (*), you can also take the geometric series of its link URLs according to your own needs. Then click "Add" and click "Finish".

4. In the multi-level URL gets a column inside to set up. I chose to manually fill in the link address rules, which requires the source code of the Web page analysis and interception. Note In the "Extract URLs from this selection" of the two blank box is filled in the site we crawl home source code We need those links to the code of the title before and after, that is to say that the two boxes of the source codes we need to the source of those links in the middle. Finally click Save.

5. Collect content rules. Our tag name is the information we need to crawl the page, double-click the tag name to add code, the same principle as the 4th step. When extracting the content, we can also data processing, click Add to choose.

6. We keep the crawled content on the local computer, this time we need to note that: The train collector has a default template, but if we collect the content of the label name and the default template inconsistencies, we need to modify it, so that it and our tag name is consistent. Click Save.

7. Start crawling your site data. First tick these three options.

Then right-click to start the task and wait for the data to be collected.

8. Crawl completed, after successful, open the local file, but did not see the data, and the tag name is garbled. Do not know how the matter, is not my posture is not ah, and looked for several sites and tried several times, seriously read the source code several times, it is not find out where the wrong ah, all kinds of catching urgent. Later, I know, mom. txt file default format is not UTF-8, we need to change, so save as a bit OK. Then run the tool again, look at the file, lying trough, sure enough to have data, the successful crawl to the site data, and the link also crawled out.

Summary

This is just a simple start, train collectors still have a lot of operations I need to learn, such as data into the database, crawling pictures ah what.

Come on, keep trying!!!

I know reptiles can jiangzi.--Train collector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.