Overview:
Core Technical Requirements: No loss of orders, distributed order fetching
Order drop-down technology selection, push back is the crawl? At present, the method of timing grasping is more reasonable, because it can control the speed of order inflow and prevent the processing ability of the back-end system. Generally take a time-slicing method: Each time a task is executed to fetch an order within a period of time, in order to ensure that the order is not lost, we will have a few seconds to overlap the boundaries of each slice.
As an order in the middle, often grab orders from multiple platforms, then we can consider allocating multiple servers to complete the order fetching task, to improve throughput, for orders of many platforms, we can allocate more than one server, for the platform of fewer orders, a server can handle multiple platform orders, How to dispatch tasks efficiently is a challenge. Especially when multiple servers correspond to a single platform, we need each server to crawl the orders of different time periods in parallel (order of orders) so as not to repeat the omission. One approach is to assign a server to do task scheduling, then this server may become a single point bottleneck, once the server is down, the entire crawl process will be stalled. It is a better practice to have multiple servers negotiate with zookeeper to determine the time period each server should allocate.
Configuration Management:
Currently using zookeeper to do the configuration management of order fetching. For configuration management, the biggest advantage of zookeeper is the ability to centrally manage configuration information, and when configuration information changes, all nodes are automatically heard and aligned. Two levels of configuration management are currently supported for all configuration options, all platforms can share the configuration (platform number =default) or for a platform personalization (overriding default values)
Configuration Management zookeeper Data structure:
path |
zookeeper data |
node type |
/tasks/cfg/starttime/[ platform number ]
|
the start time of the order fetching task for each platform (accurate to seconds), such as fetching the most recent 3 month's order, that start time is the current time -3 a month
|
Persistent
|
/tasks/cfg/interval/[ platform number ]
|
Time slice length (in seconds) for each platform order fetch task
|
Persistent |
/tasks/cfg/timeout/[ platform number ] |
Time-out for each platform order fetch task |
|
/tasks/cfg/retries/[ platform number ] |
|
/tasks/cfg/retryintv/[ platform number Span style= "Font-size:15px;line-height:115%;font-family:calibri, ' Sans-serif ';" >] |
Each platform order fetch task error interval time of each retry |
|
platform number Span style= "Font-size:15px;line-height:115%;font-family:calibri, ' Sans-serif ';" >] |
Each platform order grab any The number of concurrent threads on each node |
|
/schedulers/assignment/[ Platform number / [ server node number ]
|
Empty string
|
Ephemeral
|
Note: current server and platform correspondence cannot be intelligently assigned automatically and need to be created manually in zookeeper before starting the server
Task scheduling algorithm:
Through zookeeper coordination between nodes, the zookeeper data structure is as follows
path |
zookeeper data |
|
/tasks/runtime/prevcomplete/[ platform number " |
|
|
/tasks/runtime/inprogress/[ platform number ]-[ task time slice start ]-[ task time slice end ]
|
Start execution time for actual tasks
|
Persistent |
Get task:
Gets the most recent execution time/tasks/runtime/prevcomplete/from zookeeper, and gets the start time if the most recent execution time is empty
Calculate the next execution time based on the time slice
Save the start and end time of the next execution time slice/tasks/runtime/inprogress/and the new start time of the current platform task. Note: The Zookeeper Multi command is used to ensure that two saves are atomic (both successful and unsuccessful) when two different nodes try to acquire the same time, only the first one will be saved successfully, and the second will throw a keeperexception.
If save fails, retrieve the most recent execution time (repeat step 1-3), repeat n times, if still unsuccessful, return to fetch fetch task failed
Task Completion:
Delete a task in zookeeper/tasks/runtime/inprogress/
Exception Handling:
Each fetch task fails to retry a certain number of times (/tasks/cfg/retries), the interval between retries will gradually increase (retries * [/TASKS/CFG/RETRYINTV]), avoid frequent retries to further deteriorate the network condition, When the number of retries exceeds the limit, the task is placed in a failed queue and notified to the administrator, which may require troubleshooting and subsequent manual processing.
Another situation is that after the task is acquired in the processing process due to server downtime, crawl thread crashes and other reasons can not delete/tasks/runtime/inprogress/, that is, the task is not completed and the processing state is unknown. The cluster needs to elect a leader server through zookeeper to monitor the task list (about zookeeper leader elections are described in the official receipe, which is not discussed here), the monitoring thread in leader is active and periodically scans all/ Tasks under the tasks/runtime/inprogress/node, if the task exists for more than one threshold, the task is deleted and also placed in the failed queue for subsequent processing
E-Commerce Order integration