The magical uses of Data Warehouse

Source: Internet
Author: User

In order to set up the Data warehouse in the SS function, there is some controversy in the team. The main focus is on why to provide this function, in the end there is no need for such issues, but ultimately this function is still on. Believe that the initial users will have the same question, I would like to introduce some of the use of data warehousing, but also by the way for us to resolve this doubt.
Use 1, temporarily save intermediate data:

Take the acquisition of NetEase International news as an example. Open http://news.163.com/world/, you can see that this is a list page, there are dozens of news per page, there are many many pages. After each news point can see the text, this text is the content that we finally want to collect. It would be a bit more complicated to do just that in a script, even if you're technically superb, and think about it before you start. I recommend a simple step, you can not go through the brain, to do the pen!

1. Create a dataset in the Data Warehouse News.163.com.list

2. Script A completes the collection of http://news.163.com/world/, and the result (title, URL) is output to News.163.com.list

Datamanager.appenddata ("News.163.com.list", Dataentry.create (). Set ("Title", ...). Set ("Url", ...));

3. Write Script B read the link (title, URL) from the news.163.com.list, and then open each to collect the text

var de = Datamanager.readdata ("News.163.com.list");
var title = de. Get ("Title");
var url = de. Get ("Url");

4. Run

Is it convenient? With your hands and feet, you'll probably have your scripts written before anyone else's idea is finished:)
Use 2, parallel operation to improve acquisition performance:

The SS is integrated with the collection Elf, which is an interesting little program. It can run the scripts in SS independently and write the results back to the Data warehouse. If we run multiple collection sprites at the same time, we can achieve parallel crawling. Don't worry, they will crawl again, because the Datamanager.readdata is based on the cursor to work, it is only in the back, each read once, move forward one, so each read is different.

Here is a more complete script example for your reference:

Script A

public void Run ()
{
    default.navigate ("http://news.163.com/world/");
    Default.ready ();
    while (default.available)
    {
        var rows = Default.selectnodes ("...");
        foreach (var r in rows)
        {
            var title = R.selectsinglenode ("a"). Text ();
            var url = r.selectsinglenode ("a"). Attr ("href");
            Datamanager.appenddata ("News.163.com.list", Dataentry.create (). Set

("title", title). Set ("url", url));
        var nextPage = Default.selectsinglenode ("...");
        if (Nextpage.isempty ()) return;
        Nextpage.click ();
        Default.reset ();
        Default.ready ();
    }

Script B

public void Run ()
{while
    (default.available)
    {
        var de = datamanager.readdata ("News.163.com.list");
        if (de = = null) return;
        var url = de. Get ("Url");
        Default.navigate (URL);
        Default.ready ();
        var content = Default.selectsinglenode ("...");
        Datamanager.appenddata ("News.163.com.content", DE. Set ("Content", 

content)); Please create a DataSet news.163.com.content
    }
} in advance

Then, is it hurrying feet to have multiple collection sprites running at the same time? (Although the experience version can only run a collection wizard, it can be improved by allowing SS and the collection wizard to run script B at the same time).

Finally, enjoy~!

See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/database/basis/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.