hawk:advanced crawler& ETL tool written in C#/WPF1. Introduction to Software
Hawk's meaning is "eagle", can effectively and accurately kill prey.
Hawk is written in C #, and its front-end interface uses WPF development to support plug-in extensions. With graphical operation, the solution can be quickly established.
GitHub Address: Https://github.com/ferventdesert/Hawk
Its Python equivalent implementation is etlpy:
Http://www.cnblogs.com/buptzym/p/5320552.html
The author has developed a project file specifically for it that has been publicly available on GitHub:
Https://github.com/ferventdesert/Hawk-Projects
When using, click on the file, loading the project can be loaded.
If you do not want to compile, the executable file is:
Https://github.com/ferventdesert/Hawk/tree/master/Versions
The compilation path is:
Hawk.core\hawk.core.sln
For example, for all Beijing cuisine with popular reviews, the software can be configured in 10 minutes, automatically crawl all content in parallel within 1 hours, and monitor sub-threading performance. While writing code manually, even with Python, a skilled programmer may take more than a day:
Video presentation, complexity from small to large:
Chain Home Housing
Public platform
Popular Reviews-Beijing cuisine
2. Interface and component Introduction 2.1 interface Introduction
The software uses a dock style like Visual Studio and Eclipse, and all components can be hovered and toggled. Includes the following core components:
- Upper left corner area: Main work area, module management.
- Below: Output debugging information, and task management, monitoring the percentage of a task's completion.
- Upper right Area: Property manager, can set properties on different modules.
- Lower Right Area: Displays all data tables and modules that are currently loaded.
2.2 Data Management
The ability to add connectors from different data sources and load and manage the data:
In the blank, right-click to add a new connector. On the connector's datasheet, double-click to view the sample and right-click to load the data into memory. You can also choose to load the virtual dataset, at which point the system maintains a virtual collection that dynamically accesses the database when the upper layer requests paging data, effectively improving performance.
2.3 Module Management
At present, the system only provides two modules: Web collector and Data cleansing ETL, double-click to load a new module.
Previously configured modules, can be saved as tasks, double-click to load an existing task:
2.4 System State Management
When a dataset or module is loaded, it can be viewed and edited in System State management:
Right-click, you can delete the dataset, modify the name, and so on. You can also drag the dataset to the icon below, or drag it to the Recycle Bin to delete the module.
Double-click the dataset or module to view the contents of the module. Drag the dataset to the data cleansing (the first icon below the Data view) to clean the data set directly.
3. Web Grabber 3.1 principle (recommended reading)
The function of the Web grabber is to get the data in the Web page (nonsense). In general, the target may be a list (such as a shopping cart list) or a fixed field on a page (such as the price and introduction of a product in JD, only one on the page). Therefore, it is necessary to set its read mode. Traditional collectors need to write regular expressions, but the methods are overly complex. If you realize that HTML is a tree, just find the node that hosts the data. XPath is a syntax for describing a path in a tree. Specify XPath to search for nodes in the tree.
Manual XPath is also complex, so the software can automatically retrieve the XPath through the keyword, providing the keyword, and the software will recursively search the tree for the leaf node that contains the data. So the keyword is best to be unique on the page.
As shown, the parent node can be found as long as the two keywords "Beijing" and "42" are provided, and the two list elements are obtained div[0] and div1. By comparing Div[0] and div1 two nodes, we can automatically discover the same child nodes (Name,mount) and Different nodes (Beijing: Shanghai, 37:42). The same nodes are saved as property names, and the different nodes are property values. However, cannot provide Beijing and 37, at this time, the public node is div[0], this is not a list.
The software does not provide keywords, but also through the characteristics of the HTML document to calculate the most likely to be the parent of the list node (in the Parents) node, but when the Web page is particularly complex, speculation may be wrong, so you need to provide at least two keywords (attributes).
The principle of this algorithm is original, can view the source code or message exchange.
3.2 Basic List
We take the crawl chain home housing as an example, introduce the use of Web Capture device. First, double-click the icon to load the collector:
At the top of the address bar, enter the destination URL to be collected, this time is http://bj.lianjia.com/ershoufang/. and click Refresh Page. At this point, the obtained HTML text is shown below. The original site page is as follows:
Because the software doesn't know exactly what to get, you need to manually give a few keywords, let hawk search for the keyword, and get the location.
Take the above page for example, by retrieving 8.2 million and 51789 (unit price, each acquisition will be different), we can use the path of the DOM tree to find the root node of the entire listing.
Here are the actual steps
Because the list is to be crawled, the read mode selects list. Fill in the search character 700, found to be able to obtain XPath successfully, write the property as "Total Price", click Add Field , you can add a property. Similarly, fill in 30535 and set the property name to "unit Price" to add another attribute.
If you find an error, click Edit Collection To delete, modify, and sort the attributes.
You can similarly add all the feature fields you want to grab, or just click on a good luck , and the system will speculate on the other properties based on the current attributes:
The name of the property is automatically inferred, if not satisfied, you can modify the name of the first column of the list , hit the keyboard in the corresponding column enter to commit the modification. These properties are then automatically added to the attribute list.
In the process of work, you can click on the extraction test , at any time to see the collector's current data can be captured content. In this way, a chain of home-made web collector can be completed. At the top of the property manager, the module name of the collector can be modified so that the data Cleansing module is used to invoke the collector.
4. Data Cleansing
Data Cleansing module consisting of dozens of sub-modules containing four classes: generation, conversion, filtering and execution
4.0 rationale (can be skipped) explanation of 4.0.1 C # version
The essence of data cleansing is the dynamic assembly of LINQ, whose data chain isIEnumerable<IFreeDocument>.IFreeDocumentis aIDictionary<string, object>
The extension of the interface. The Select function of LINQ can be transformed by convection, in this case, the operation of different columns of the dictionary (add-and-remove), and the different modules define a complete LINQ
Flow:
result= source.Take(mount).where(d=>module0.func(d)).select(d=>Module1.func(d)).select(d=>Module2.func(d))….
Thanks to the gift of the C # compiler, LINQ can easily support streaming data, even if it is a mega-collection (hundreds of billions of elements) that can be handled efficiently.
Explanation of the 4.0.2 python version
Because Python does not have LINQ, the Builder is assembled (generator), and the generator is manipulated to define a complete chain of LINQ-like:
for tool in tools: generator = transform(tool, generator)
For detailed source code, you can refer to the open source project on GitHub https://github.com/ferventdesert/etlpy/
4.1 Taking the chain home as an example crawl 4.1.1 construct URL list
The 3.1 section describes how to implement a page collection, but how to collect all the secondary data? This involves paging.
For example, in the case of a chain, when we turn the page, we will see how the pages are transformed:
http://bj.lianjia.com/ershoufang/pg2/http://bj.lianjia.com/ershoufang/pg3/…
Therefore, you need to construct a string of URLs above. Smart you will certainly think that should mister be a set of sequences, from 1 to 100 (assuming we crawl only the first 100 pages).
- Double-click the Search field in the left side of the data cleansing ETL to search for the number of generated intervals , and drag the module to the upper right column:
- In the right column, double-click Generate interval number, you can pop up the settings window, for the column name (ID), the maximum value is 100, the generation mode defaults to append:
Why does the top 20 show only? This is the virtualization mechanism of the program, and does not load all the data, in the ETL properties of the Debugging column, modify the sample amount (the default is 20).
- Convert a number to a URL, familiar with C # readers, can think of String.Format, or Python's% symbol: Search Merge multiple columns , and drag it to the ID column just generated, write format, you can transform the original value column into a set of URLs
(If more than one column is required to be merged into a single column, fill in the column names of the other columns in the "Other Items" section, separated by a space, and {1},{2} in format.) and other representations)
(Because of the design problem, the data viewer width does not exceed 150 pixels, so the long text display is not complete, you can click on the right-hand Properties dialog box to View the sample , pop-up editor can support copying data and modify the column width.)
4.1.2 using a well-configured Web Capture device
After this URL is generated, we can merge the Web Capture that we just completed with this string of URLs.
Drag-and-drop from the crawler to the current URL, double-click the module: the name of the Web Capture just now, fill in the crawler selection column.
The system will then convert the top 20 data for the crawl:
As you can see, the "Attribute 3" column in the data contains the HTML escape character, which is automatically converted to all escape characters by dragging the HTML character escapes to the attribute 3 column.
If you want to change the name of the column, modify it directly at the top of the column name and click Enter to change the name.
The Where (area) column contains numbers, you want to extract the numbers, you can drag the extraction of the digital module onto the column, all the numbers can be extracted.
Similarly, you can split the text and replace the text by dragging a character segment or a regular split into a column. Do not repeat here.
Some columns are empty, and you can drag an empty object filter to the column, and the row will automatically filter the data if it is empty.
4.1.4 Saving and exporting data
When you need to save the data, you can choose to write to the file, or temporary storage (the software's data manager), or a database. Therefore, you can drag the "execute" module into the back end of the cleaning chain:
Drag the data table to any column, and fill in the new table name (such as the Chain Home housing)
Is the list of all submodules for this operation:
After that, the entire process can be manipulated:
Select Serial mode or parallel mode, parallel mode uses the thread pool to set the maximum number of simultaneous threads (preferably not more than 100). It is recommended to use parallel mode,
Click the Execute button to capture data in task management view.
After that, you can export the raw data to an external file by right-clicking on the data- Management data table link , selecting Save As, exporting to Excel,json, etc.
Similarly, you can drag into the actuator during the cleaning process, save an intermediate process, or drag multiple actuators at the end to write to the database or file at the same time, resulting in great flexibility.
4.1.5 Saving Tasks
Right-click on any module in the lower-right corner of the algorithm view , save the task, you can save a new task in the Task view (the task name is the same as the current module name), the next time can be loaded directly. If there is a task with the same name, the original task is overwritten.
In the empty space of the algorithm view , click Save All Modules , will save all the tasks in bulk.
You can save a batch of tasks as a project file (XML) and load and distribute them later.
5. Summary
Above to crawl real estate website Chain Home As an example, introduced the overall use of the software process. Of course, Hawk functionality is far from being limited to these. We will then use a series of articles to describe how they are used.
It is important to note that due to the continuous upgrading of software features, video and graphics may appear inconsistent with the software, so all the introduction and functionality to the actual function of the software prevail.
hawk-Data Capture Tool