Wangchinglang Wang Ming Wang
EBay as a global business platform and payment industry leader , has a huge amount of user behavior data. Based on the existing Hadoop Big Data processing, it has not been able to meet the needs of real-time business. Based on ebay 's past experience of big data processing and the use of the latest technology,ebay explores a platform for real-time collection, processing, distribution, and analysis of massive data streams. And at the end of the year 2 Open Source this platform : Pulsar.
Pulsar as a complex event processing platform, with fast, accurate and flexible characteristics, to ensure low latency and high reliability to point-to, so well satisfied with the EBay second-level real-time data analysis needs. At the same time, millions traffic processing capacity per second, to bring customers a better personalized experience, to help customers monitor real-time business information and customized real-time marketing strategy, timely monitoring network fraud and reduce robot intervention. and Pulsar is a standard-based, distributed cloud architecture deployment that spans multiple data centers, ensuring no cluster downtime during system upgrades and topology updates.
The Pulsar platform provides a complete solution for real-time Big data analytics:
the platform enables real-time collection Event Stream , and to Event for real-time Enrichment and the push to different real-time applications while being able to perform statistics and analysis in real time, providing Key Insights to the business.
in the Pulsar inside the platform, it puts Event Stream As a kind of database table, on the above through the application of the statement-type 4GL to define and at the same time open source as a support of a new big data stream processing framework
pulsar.stream is a generic new processing framework for big data streams. He implemented an open, auto-discovered topology, different apps can be distributed in different data Center, automatically discovers and establishes connections over the network, and the data is active from the producer push to subscriber. pipeline 4GL epl topology is open and dynamically extensible, corresponding to the epl is also capable of dynamic updates without service interruption.
A typical deployment structure
EPL Sample:
Event Filtering and routing
Insert INTO Substream Select D1, D2, D3, D4
From rawstream where D1 = 2045573 or D2 = 2047936 or D3 = 2051457 or D4 = 2053742; Filtering
@PublishOn (topics= "TOPIC1")//Publish sub stream at TOPIC1
@OutputTo ("Outboundmessagechannel")
@ClusterAffinityTag (column = D1); Partition key based on column D1
SELECT * from Substream;
Aggregate computation
Create 10-second Time Window context
Create context Mccontext start @now end pattern [Timer:interval (10)];
Aggregate event count along Dimension D1 and D2 within specified time window
Context Mccontext INSERT INTO AGGREGATE select COUNT (*) as METRIC1, D1, D2 from Rawstream Group by D1,D2 output Snapshot W Hen terminated;
SELECT * from AGGREGATE;
TopN computation
Create 60-second Time Window context
Create context Mccontext start @now end pattern [Timer:interval (60)];
Sort to find top ten event counts along Dimensions D1, D2, and D3
Within specified time window
Context Mccontext INSERT INTO Topitems select COUNT (*) as TotalCount, D1, D2, D3 from Raweventstream Group by D1, D2, D3 o Rder by Count (*) limit 10;
SELECT * from Topitems;
For more information, see
Www.ebaytechblog.com/2015/02/23/announcing-pulsar-real-time-analytics-at-scale
Related Events :
1. Pulsar on QCon Shanghai 2014–
Http://www.infoq.com/cn/presentations/ebay-user-behavior-data-stream-processing-system
2. http://www.milibo.com/talent/events.aspx?id=34
ebay Open Source Pulsar: Real-time Big data analytics platform