From: 74964366/
The relationship database and the Hadoop ecosystem are becoming more dense and more demanding. This article is to investigate the real-time crawl MySQL update data to HDFs.
This article is for research report only.
Preliminary research was conducted on Canal (Ali) +kafka Connect+kafka, Maxwell (Zendesk) +kafka and Mysql_streamer (Yelp) +kafka. These tools crawl MySQL by scanning the binlog, simulating MySQL master and slave (MySQL Replication architecture – Solving: Data multipoint backup, improved data availability, read and write streaming, and improved cluster concurrency. (not load-balanced), allowing some non-real-time data operations to be transferred to slaves. ) between the protocols to achieve real-time updates .
First science under the canal
Canal Introduction Principle
Canal schematic diagram
The principle is relatively simple:
- Canal simulates MySQL slave's interactive protocol, disguises itself as MySQL slave, sends dump protocol to MySQL Master
- MySQL master receives a dump request and starts pushing (slave, not master active push to slaves) binary log to Slave (i.e. canal)
- Canal parsing binary Log object (originally a byte stream)
Architecture
Canal Frame Composition
Component Description:
- The server represents a canal run instance, corresponding to a JVM
- The instance corresponds to a data queue (1 server corresponds to 1. n A instance)
And the instance module by Eventparser (data source access, simulation slave protocol and Master for Interaction, protocol resolution), EventSink (parser and store connectors, data filtering, processing, distribution work), Eventstore (data storage) and Metamanager (incremental subscription & Consumer information manager).
Eventparser gets the location of the last successful resolution from log position before sending the dump command to MySQL (if it is first started, gets the initial specified location or the current data segment Binlog bit point). After MySQL accepts the dump command, it is parsed by the Eventparser pull binlog data from MySQL and passed to the EventSink ( passed to the EventSink module for data storage, which is a blocking operation until the storage succeeds ) , the log Position is updated after the transfer is successful. The flowchart is as follows:
Eventparser flowchart
EventSink a channel-like function that allows data filtering, distribution/Routing (1:N), merging (N:1), and processing . EventSink is the bridge connecting Eventparser and Eventstore.
The Eventstore implementation pattern is memory mode, the memory structure is a circular queue, and three pointers (Put, get, and ACK) identify where the data is stored and read.
Metamanager is an incremental subscription & consumer information manager, and the agreement between incremental subscriptions and consumption includes Get/ack/rollback, respectively:
Message getwithoutack (int batchsize), which allows you to specify batchsize, can fetch more than one time at a time, and each return object is a message, containing the content: Batch id[unique identity] and entries[specific data Objects ]
void rollback (Long BatchId), Gu Mings, rolls back the last GET request and fetches the data again. BatchId based on Get gets to commit and avoid mis-operation
void ack (Long BatchId), Gu Mings, confirms that the consumption was successful and notifies the server to delete the data. BatchId based on Get gets to commit and avoid mis-operation
The protocol interactions between incremental subscriptions and consumption are as follows:
Incremental subscriptions and consumer agreements
The Get/ack/rollback protocol of the canal differs from the conventional JMS protocol, allowing get/ack asynchronous processing , such as the ability to call get multiple times consecutively, followed asynchronously to commit ack/rollback sequentially, which is referred to as the streaming API in the project.
Benefits of Streaming API design:
- Get/ack asynchronous to reduce network latency and operating costs due to ACK (99% of the state is in a normal state, the abnormal rollback is a separate case, no need to sacrifice the entire performance for individual cases)
- Get data, business consumption bottlenecks or need multi-process/multi-threaded consumption, you can constantly poll get data, and continue to send tasks, improve parallelization. (a case in the real business: business data consumption requires cross-China network, so one operation is basically above 200ms, in order to reduce latency, it is necessary to implement parallelization)
The streaming API is designed as follows:
Streaming API
- Each get operation generates a Mark,mark tag in the meta that is incremented to ensure the uniqueness of mark during the run
- Each get operation is resumed at the cursor of the last Mark action record, and if Mark does not exist, the cursor continues to fetch
- When you make an ACK, you need to sequence the ACK in order of Mark, and you cannot jump the ACK. The ACK deletes the current mark mark and updates the corresponding mark position to the last ACK Cusor
- In the event of an exception, the client can initiate a rollback situation, reset: Delete all mark, clean up the GET request location, and the next request will continue to fetch from the last ACK cursor
This streaming API is not similar to HDFs write in pipeline transmission packet form, first put packet into Dataqueue, and then downstream transmission, at this time packet into Ackqueue wait for the return of the ACK downstream, which is also asynchronous.
HA mechanism
The canal is supported for HA, and its implementation mechanism is also dependent on zookeeper, with features such as Watcher and ephemeral nodes (and session lifecycle bindings), similar to that of HDFs ha.
The canal's HA is divided into two parts, andthe canal server and the canal client respectively have a corresponding HA implementation
- Canal server: To reduce requests for MySQL dump, instance on different servers ( same instance on different servers) requires that only one at the same time be at running. The other is in the standby state (standby is the state of instance).
- Canal Client: In order to ensure order, a instance can only be operated by one canal client at the same time, otherwise the client receive cannot be guaranteed orderly.
The architecture diagram for server ha is as follows:
Ha
Approximate steps:
- When canal server launches a canal instance , it attempts to start the decision with zookeeper (implementation: Create ephemeral node, who is allowed to start if the creation succeeds)
- After the successful creation of the Zookeeper node, the corresponding canal server launches the corresponding canal instance, and the Canal instance without a successful creation will be in the standby state .
- Once zookeeper discovers that the instance node created by Canal Server A disappears, notify the other Canal server again of Step 1 and re-elect a canal server startup instance.
- Each time the canal client makes connect, it first asks Zookeeper who is currently launching the canal instance and then links to it and tries to connect again once the link is unavailable.
The canal client approach is similar to canal server and is controlled using Zookeeper's preemptive ephemeral node.
Canal deployment and use of MySQL configuration
Canal Sync data needs to scan MySQL's binlog log, while Binlog is off by default, needs to be turned on, and in order to ensure consistency of synchronized data , the log format used is row-based replication (RBR) , my.conf
Open the Binlog in,
1234 |
[Mysqld] log-bin=mysql-bin #添加这一行就okbinlog-format=row #选择row模式server_id=1 #配置mysql Replaction needs to be defined and cannot be repeated with canal SlaveID. |
After changing the my.conf, you need to restart MySQL, the way to restart a lot to find the right one.
Canal configuration
By the above introduction to know the canal by Server
and Instance
composition, and the server can contain a lot of instance, a instance corresponding to a database instance , the canal will be configured into two categories, one is the configuration of the server, The first name canal.properties
is instance, the other is the configuration, the name is instance.properties
, generally in the Conf directory, a new instance directory with the same name, put it into this directory.
First introduce several key attributes in Canal.properties
Parameter name |
parameter Description |
Default Value |
Canal.destinations |
List of instance deployed on the current server |
No |
Canal.conf.dir |
The path where the conf/directory resides |
.. /conf |
Canal.instance.global.spring.xml |
Component files for global spring configuration |
Classpath:spring/file-instance.xml (Spring catalog is relative to Canal.conf.dir) |
Canal.zkservers |
Link information for Canal Server link zookeeper cluster |
No |
Canal.zookeeper.flush.period |
Canal persistent data to zookeeper update frequency, per millisecond |
1000 |
Canal.file.data.dir |
Canal persisted data to the directory on file |
.. /conf (Default and instance.properties for the same directory, easy to transport and backup) |
Canal.file.flush.period |
Canal persisted data to the update frequency on file, per millisecond |
1000 |
Canal.instance.memory.batch.mode |
Data cache mode in the Canal memory store 1. ItemSize: Limit According to buffer.size, limit the number of records only 2. Memsize: Limits the size of the cache record according to the size of the buffer.size * buffer.memunit |
Memsize |
Canal.instance.memory.buffer.size |
Buffer number of buffers can be cached in the canal memory store, which requires an exponent of 2 |
16384 |
Canal.instance.memory.buffer.memunit |
The unit size of the memory record, the default 1KB, and the buffer.size combination determine the final memory usage size |
1024 |
Here's a look at instance.properties, which has fewer attributes:
Parameter name |
parameter Description |
Default Value |
Canal.instance.mysql.slaveId |
The ServerID concept in MySQL cluster configuration needs to be guaranteed and the ID of the current MySQL cluster is unique |
1234 |
Canal.instance.master.address |
MySQL Main library link address |
127.0.0.1:3306 |
Canal.instance.master.journal.name |
MySQL Main library link when starting the Binlog file |
No |
Canal.instance.master.position |
Binlog offset from the start of MySQL Main library link |
No |
Canal.instance.master.timestamp |
Binlog timestamp of the start of the MySQL main library link |
No |
Canal.instance.dbUsername |
MySQL Database account |
Canal |
Canal.instance.dbPassword |
MySQL Database password |
Canal |
Canal.instance.defaultDatabaseName |
MySQL Link when default schema |
No |
Canal.instance.connectionCharset |
MySQL Data parsing code |
UTF-8 |
Canal.instance.filter.regex |
MySQL Data parse the table of interest, Perl regular expressions. Multiple regular separated by commas (,), escape characters require double slash |
.*\\.. * |
In addition to the above two configuration files, there is a directory under the Conf directory need to emphasize, that is the spring directory, which is stored in the Instance.xml configuration file, the current default support Instance.xml has memory-instance.xml, File-instance.xml, Default-instance.xml and Group-instance.xml. Here the main maintenance of the incremental subscription and consumption of the relationship information ( resolution sites and consumption sites).
The corresponding two bit-point components are now available in several implementations:
- Memory (used in Memory-instance.xml)
- Zookeeper
- Mixed
- File (used in File-instance.xml, set up file+memory mode, write memory first, refresh data periodically on local file)
- Period (used in Default-instance.xml, set up zookeeper+memory mode, write memory first, refresh data regularly to zookeeper)
The functions of these configurations are described separately
all the components (parser, sink, store) are selected in the memory version of the mode, the record bit points are selected the memories mode, reboot will return to the initial point to parse
Features: fastest, least dependent (no zookeeper required)
Scenario: A scenario in which a data analysis is typically applied to a QuickStart, or after a problem occurs, and should not be applied to a production environment
All components (parser, sink, store) are selected based on the file persistence mode (where is the file existence of component content persistence??? ), note that the HA mechanism is not supported.
Features: Support single-machine persistence
Scenario: Production environment, no HA requirements, simple to use.
All components (parser, sink, store) have chosen the persistence mode, and the current persistence is mainly written to zookeeper to ensure the data cluster is shared. ( all component persisted content only has the location information??? )
Features: Supports HA
Scenario: Production environment, clustered deployment.
Primarily for multi-library consolidation, multiple physical instance can be combined into a single logical instance that provides client access.
Scenario: Sub-Library business. For example, product data split 4 libraries, each library will have a instance, if not group, business to consume data, need to start 4 clients, link 4 instance instances respectively. With group, you can merge as a logical instance on the canal server, just start up 1 clients and link this logic instance.
Canal Example Deployment
- Create a user in the MySQL database that needs to be synchronized to replica the data, where the new user name and password are, and the command is as
canal
follows:
1234 |
CREATE USER Canal identified by ' canal '; GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT on * * to ' canal ' @'% '; --GRANT all privileges on * * to ' canal ' @'% '; FLUSH privileges; |
- After MySQL creates the
canal
user and assigns the required permissions to it, it is necessary to set up the canal's configuration files (canal.properties and Instance.properties).
Canal.properties and instance.properties in the default configuration can be (here just run a sample, production can refer to the specific parameter properties to set ),
- After the canal is configured, the Canal clientis activated (the role of the client is to solidify the parsed Binlog log in the canal into the storage medium ).
The client component Canal itself is not provided and needs to be developed according to the API, where the official client code is packaged into jars for consumer canal information.
Canal HA Configuration
The HA mechanism of canal relies on ZK to implement, need to change the canal.properties file, modify the content as follows:
1234 |
# ZK cluster address canal.zkservers=10.20.144.51:2181# Select record mode canal.instance.global.spring.xml = classpath: Spring/default-instance.xml |
Change the configuration instance.properties of the instance instances on both canal machines as follows:
12 |
Canal.instance.mysql.slaveId = 1234 # #另外一台机器改成1235, ensure SlaveID is not repeated canal.instance.master.address = 10.20.144.15:3306 |
After configuration, start the canal process and execute it on both servers.sh bin/startup.sh
When the client is consuming, you can specify the zookeeper address and instance name directly, or you can have the canal client automatically get the current service's work node from the running node in the zookeeper and then establish a link with it.
Maxwell Introduction
The principle of Maxwell real-time crawling of MySQL data is also based on Binlog, which is more like Maxwell than Canal canal server + 实时client
. (Data extraction + data conversion)
Maxwell integrates the Kafka producer to get data updates and writes Kafka directly from Binlog, while the canal needs to develop its own real-time client to write Binlog content read by the canal into Kafka.
Maxwell Features:
- Support bootstrap boot, synchronize historical data
- Integrated Kafka that directly lands data to Kafka
- DML and DDL in Binlog have been pattern-matched and decoded into schema JSON ( facilitates later reorganization into NoSQL-supported languages )
{"Database": "Test", "table": "E", "type": "Update", "TS": 1488857869, "XID": 8924, "commit": true, "data": {"id": 1, "M" : 5.556666, "Torvalds": null}, "old": {"M": 5.55}}
Disadvantages:
- A MySQL instance needs to correspond to a Maxwell process
- Bootstrap's scenario uses a
select *
Maxwell configuration file has only one config.propertiesin the home directory. In addition to the need to configure the address of MySQL master, Kafka address also needs to configure a MySQL address to hold Maxwell related information, Maxwell will read binlog relationship information, such as Binlog name, position.
Tool comparison
The above is the canal principle and deployment, the rest of the similar Maxwell and Mysql_streamer for MySQL real-time data capture principle is no longer introduced, here only to compare them:
features |
Canal |
Maxwell |
mysql_streamer |
language |
java |
Java |
Python |
active |
Active |
active |
inactive |
HA |
support |
customize |
support |
Data Landing |
customizing |
Landing to Kafka |
Landing to Kafka |
partition |
support |
not supported |
not supported |
Bootstrap |
does not support |
support |
support |
data Format |
format free |
JSON (format fixed) |
JSON (format fixed) |
Document |
more verbose |
more verbose |
slightly coarse |
Random Read |
support |
support |
support |
The above just synchronizes the binlog of real-time change data in MySQL to Kafka in the same way, but to update to Hadoop in real time requires a real-time database to store the data. and self-customized development, the Kafka data is parsed into a NoSQL database can be recognized by DML to update the NoSQL database in real-time, so that it synchronizes with the data in MySQL.
Infrastructure architecture
The architecture diagram is as follows:
Infrastructure diagram
The dashed box is an optional scenario
Scenario Comparison
- Scenario 1 uses Ali Open source canal for MySQL binlog data extraction, and another need to develop a data conversion tool to convert data parsed from Binlog into the JSON data from the schema and write to Kafka. However, Scenario 2 uses Maxwell to directly complete the extraction of MySQL binlog data and to convert the JSON data into its own schema into Kafka.
- In Scenario 1 , historical data that already exists in the table is not supported for synchronization, and this feature needs to be developed (if you use Sqoop for historical data synchronization, it is not flexible enough to make the result table the same as the original table structure, which differs from the schema required for the data exchange platform). Scenario 2 provides a solution for synchronizing historical data.
- Scenario 1 supports HA deployments, while Scenario 2 does not support ha
The difference between Scenario 1 and Scenario 2 is that before Kafka, when the data is cached to Kafka, a custom data routing component is required to parse the data from the schema into the Target store.
The data routing component is primarily responsible for reading the data in Kafka in real time and writing it to the Target store. (If you save all log data to HDFs, you can also drop the data to all JDBC-enabled databases, landing to Hbase,elasticsearch, and so on.) )
In conclusion,
Scenario 1 features that need to be developed are:
- Bootstrap function
- Real-time Data conversion tool
- Data Routing Tools
Scenario 2 features that need to be developed are:
- Data Routing Tools
- Ha module (HA is not supported initially, so development urgency is not high)
Data routing tools are two scenarios need to develop, then I prefer the second scenario, because in the initial water test stage can be short-term results, can be faster to verify the idea, and in the attempt to find the problem quickly, good timely adjustment program. Even if Maxwell in scenario 2 ultimately fails to meet the requirements, we may also be able to align the data output pattern of the real-time data conversion tool with the Maxwell, so that data routing tools that are initially put into human development can continue to be used without needing to be re-developed.
Use the incremental log as the basis for all systems. Subsequent data consumers consume log by subscribing to the Kafka.
Like what:
Users of big data can save data to hive tables or parquet files to hive or spark queries;
The user who provides the search service can be saved to elasticsearch or hbase;
The consumer that provides the cache service can cache the logs in Redis or Alluxio;
The user of data synchronization can save the data to their own database;
Because Kafka log can be repeated consumption, and cached for a period of time, each consumer can be consumed Kafka log to achieve both maintain and database consistency, but also to ensure the real-time performance;
{"Database": "Test", "table": "E", "type": "Update", "TS": 1488857869, "XID": 8924, "commit": true, "data": {"id": 1, "M" : 5.556666, "Torvalds": null}, "old": {"M": 5.55}}
{"Database": "Test", "table": "E", "type": "Insert", "TS": 1488857922, "XID": 8932, "commit": true, "data": {"id": 2, "M": 4.2 , "Torvalds": null}}
20180705 how to parse MySQL Binlog