I know reptiles can jiangzi.--Train collector

Last Update:2015-08-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I know that reptiles can do that.-Use of the train collector

say in front

The amount ... Well, I am a sanmao money of the cock Silk also began to enter the internship stage, in Beijing is actually very good, although the pressure is big, but the future employment opportunities are relatively larger. Well, to say back to today's topic, before learning the Python crawler has always thought that the future work of the crawler need to write their own source code and then climb up and climb, but is not so drop (should not), the day before yesterday the company threw me a crawl web tools, and then I have been pondering, With the results this afternoon, I learned a simple crawl of web data. So I'm here to summarize the website Data Collector-The simple use of the train collector.

Body

First of all, download the train collector, this online link has a lot.

This is the train collector folder after the installation is complete.

Use steps

1. After login into the account (as if the account application is required to spend money), we first create a new group, pay attention to select the group to choose the right time to OK.

2. Right-Select a new task for the group you want to work on in its group

3. Edit this task to take the example of the IT industry dynamics in the HC network. Because of the link to the page, so we need to select the "Bulk/Multi-page" column, and then the URL changes in the number

Instead of (*), you can also take the geometric series of its link URLs according to your own needs. Then click "Add" and click "Finish".

4. In the multi-level URL gets a column inside to set up. I chose to manually fill in the link address rules, which requires the source code of the Web page analysis and interception. Note In the "Extract URLs from this selection" of the two blank box is filled in the site we crawl home source code We need those links to the code of the title before and after, that is to say that the two boxes of the source codes we need to the source of those links in the middle. Finally click Save.

5. Collect content rules. Our tag name is the information we need to crawl the page, double-click the tag name to add code, the same principle as the 4th step. When extracting the content, we can also data processing, click Add to choose.

6. We keep the crawled content on the local computer, this time we need to note that: The train collector has a default template, but if we collect the content of the label name and the default template inconsistencies, we need to modify it, so that it and our tag name is consistent. Click Save.

7. Start crawling your site data. First tick these three options.

Then right-click to start the task and wait for the data to be collected.

8. Crawl completed, after successful, open the local file, but did not see the data, and the tag name is garbled. Do not know how the matter, is not my posture is not ah, and looked for several sites and tried several times, seriously read the source code several times, it is not find out where the wrong ah, all kinds of catching urgent. Later, I know, mom. txt file default format is not UTF-8, we need to change, so save as a bit OK. Then run the tool again, look at the file, lying trough, sure enough to have data, the successful crawl to the site data, and the link also crawled out.

Summary

This is just a simple start, train collectors still have a lot of operations I need to learn, such as data into the database, crawling pictures ah what.

Come on, keep trying!!!

I know reptiles can jiangzi.--Train collector

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

I know reptiles can jiangzi.--Train collector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

I know reptiles can jiangzi.--Train collector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support