Ruiji scraper basics-Ruiji Expression Model

Source: Internet
Author: User
Preface

Ruiji scraper is a visual browser crawler extension. It is a data collection tool suitable for finance, news editing, new media personnel, personal websites, and crawlers.

Ruiji expressions are the extraction model of Ruiji scraper and the extraction model of Ruiji. Net open-source crawler framework. Ruiji. NET is an open-source project on GitHub, and the contributor is also the author of Ruiji scraper.

The Ruiji expression is an experience gained from a large number of crawlers. It is applicable to all webpages that need to be crawled.

Ruiji scraper is currently only available in Firefox, address: https://addons.mozilla.org/zh-CN/firefox/addon/ruiji-scraper/

Data Block

When designing a webpage, webpage designers often use styles, IDs, and so on to differentiate the webpage areas. Different Areas display different content, this is not only a habit that most designers use together in web design, but also forms a Visually identifiable area on the final product, allows users to quickly locate the content they care about.

Block is the area we need to extract. There may be multiple areas in a page. Take the search result of Google's Ruiji scraper as an example to illustrate the block.

We can see that Google's page search results (all) are roughly divided into three parts: video search results, image search results, and Web search results. The three search results are displayed to search users in different forms. The search results for videos and images are horizontal, and the search results for webpages are vertical, the data of the three search results are displayed in different forms. Video search results are displayed horizontally, including video preview, duration, title, author, source, and upload time. The image search result is displayed horizontally in the form of an image. The webpage search result is vertical, with the title, URL, and abstract. We may be interested in some or all of the content.

Block is used to locate the areas we are interested in.

When the block is not defined, the Ruiji expression uses the body as the block by default.

Data slice Tile

As described above, each block may have its own presentation form (or the same), but the content in each block is the same. In web development, developers usually use a loop to display the data to be displayed in the form of the same child element on the page. Take the webpage search result in the search results of Google's Ruiji scraper as an example, as shown in:

If we ignore the content in tile, you will find that the block contains the same sub-element for loop to show the data to the user.

Tile is used to define repeated child elements in a block.

Metadata Meta

Metadata is the data we really want to extract. We capture the data and save it to a document or database for future use. Ruiji scraper groups extracted data by tile and block, and converts the extracted data to structured data. The search result of a Data slice is displayed. The title, link, and abstract are provided here. Meta is used to describe the data we are interested in this data sheet. The number of metadata is defined based on your needs.

Meta is used to describe the data we are interested in this data sheet.

Ruiji scraper basics-Ruiji Expression Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.