The application of hash and SOLR in the mass data distributed search engine

Source: Internet
Author: User
Tags hash http request solr backup

SOLR is an independent enterprise-class search application server that provides an API interface similar to Web-service. The user can submit a certain format XML file to the Search engine server via HTTP request, and generate index.

Most of the Internet entrepreneurship is grassroots entrepreneurship, this time there is no strong server, and there is no money to buy a very expensive huge database. In such a severe condition, a batch after batch of entrepreneurs from the entrepreneurial success, this and the current open source technology, a large number of data architecture has an essential relationship. For example, we use MySQL, Nginx and other Open-source software, through the architecture and Low-cost servers can also build Tens user access to the system. Sina Weibo, Taobao, Tencent and other large internet companies have used a lot of open source free system to build their platform. So, it doesn't matter what you do, as long as you can use a reasonable solution in a reasonable situation.

So how do you build a good system architecture? This topic is too big, here to talk about the way of data streaming. For example, our database server can only store 200 data, and suddenly have an activity estimated to reach 600 data.

There are two ways to expand horizontally or vertically.

Vertical scaling is the hardware resource that upgrades the server. But as the machine's performance is configured higher, the price is higher, which is not affordable for the average small company.

Horizontal scaling is the use of a number of low-cost machines to provide services. Such a machine can handle only 200 data, 3 machines will be able to process 600 data, if the future increase in traffic can also quickly configure the increase. In most cases, you choose the way to expand horizontally. The following figure:

Now there is a question about how these 600 data are routed to the corresponding machine. To consider if a balanced allocation, assuming that our 600 data are unified ID data, from 1~600, divided into 3 stacks can take the ID mod 3 way. In fact, in the real world it may not be this ID is a string. You need to convert the string to hashcode and then take the modulo.

It does not seem to solve our problem at the moment, all the data is well distributed and not up to the load of the system. But if our data needs to be stored and read, it's not that easy. Business increase how to do, you follow the above horizontal expansion to know the need to add a server. But there are some problems with adding this server. Look at the following example, a total of 9, need to put on 2 machines (1, 2). Each machine is stored as: No. 1th Machine holds 1, 3, 5, 7, 9, 2nd machine Storage 2, 4, 6, 8. If the expansion of a machine 3 how, the data will occur a large migration, 1th machine storage 1, 4, 7, 2nd machine Storage 2, 5, 8, 3rd machine Storage 3, 6, 9. As shown in figure:

From the figure can be seen in the number 1th machine 3, 5, 9 migrated out, 2 good machines 4, 6 migrated out, according to the new order and redistribution again. A small amount of data redistribution costs are not big, but if we have hundreds of billions of data on the T level, the operating cost is quite high, less than a few hours a few days. And the migration of the original database machine load is relatively high, then everyone has doubts, is this level of expansion of the structure is not reasonable?

————————— – Gorgeous split line —————————————

The consistent hash is presented in this application context and is now widely used in distributed caching, such as memcached. The following is a brief introduction to the basic principles of consistent hashing. The earliest version http://dl.acm.org/citation.cfm?id=258660. There are a lot of articles on the internet in China that are better written. such as: http://blog.csdn.net/x15594/article/details/6270242

The following is a simple example to illustrate the hash of consistency.

Preparation: 1, 2, 33 machines

9 numbers 1, 2, 3, 4, 5, 6, 7, 8, 9 that have yet to be allocated

Consistent hash Algorithm architecture

Steps

First, the construction of 2 of the 32-point virtual node out, because the computer is 01 of the world, the division of the use of 2 of the secondary data easily distributed balance. Another 2 of the 32 is 4.2 billion, we even have a large number of servers can not be more than 4.2 billion bar, expansion and equalization are guaranteed.

Second, the three machines take IP for hashcode calculation (here can also take hostname, as long as the only difference between the machine can be), and then mapped to 2 of the 32-time side up. For example, the number 1th machine hashcode and mod (2^32) for 123 (this is fictitious), 2nd machine calculated the value of 2300420, 3rd machine counted out for 90203920. So the three machines are mapped to this virtual 4.2 billion-ring-structured node.

Thirdly, the data (1-9) is calculated with the same method and the hashcode is configured to the ring node for the 4.2 billion modulo. Suppose the values of these nodes are 1:10,2:23564,3:57,4:6984,5:5689632,6:86546845,7:122,8:3300689,9:135468. You can see that 1, 3, 7 is less than 123, 2, 4, 9 is less than 2300420, 123, 5, 6 is greater than 8. Begins a clockwise lookup from the location where the data is mapped, saving the data to the first cache node found. If the cache node is still not found beyond the 2^32, it is saved to the first cache node. That is, 1, 3, 7 will be allocated to the 1th machine, 2, 4, 9 will be assigned to the 2nd machine, 5, 6, 8 will be assigned to the 3rd machine.

At this point, you may ask, I have not seen any benefits from a consistent hash, adding complexity to the traditional modulo. Now let's do some critical processing, such as adding a machine. According to the original we need to reassign all the data to four machines. Consistent hash How do you do it? Now the number 4th machine is added, and his hash value is calculated after modulo 12302012. 5, 8 greater than 2300420 less than 12302012, 6 greater than 12302012 less than 90203920. This adjustment is only 5, 8 from the 3rd machine deleted, 4th machine to add 5, 8.

Similarly, delete the machine how to do it, if the number 2nd machine is dead, the only affected is the data on the 2nd machine is migrated to the node, above the figure of machine number 4th.

You should understand the basic principle of consistency hash. However, this algorithm is still flawed, for example, when the machine node is relatively small, the data volume is large, the distribution of data may not be very balanced, it will lead to one of the server data than other machines much more. In order to solve this problem, we need to introduce the mechanism of Virtual server node. If we have three machines altogether, 1, 2, 3. But it is impossible to have so many machines how to solve it? These machines are virtualized into 3 machines, that is, 1a 1b 1c 2a 2b 2c 3a 3b 3c, and this becomes 9 machines. The actual 1a 1b 1c still corresponds to 1. But the actual distribution to the ring node becomes 9 machines. The distribution of data can also be a little more decentralized. As shown in figure:

With so many consistent hashes, what does this have to do with distributed search? We now use SOLR4 to build a distributed search, testing the Solrcloud distributed platform to submit 20 of the data is actually dozens of seconds, so the Solrcloud abandoned. Using their own hack SOLR platform, do not zookeeper to do distributed consistency management platform, their own management data distribution mechanism. Since you need to manage the distribution of your data yourself, you need to take into account the creation of indexes and the updating of indexes. So our consistency hash is also used. The overall architecture is shown below:

Establish and update the location where the machine needs to be maintained and be able to find the corresponding data distribution and update according to the key of the data. What you need to consider here is how to build and update your data efficiently and reliably in the index.

The backup server prevents the server from being hung up and can be recovered quickly based on the backup server.

Read server mainly do read and write separation use, to prevent the write index affect query data.

The cluster Management Server manages server status and alarms throughout the cluster.

The whole cluster can be divided according to the type of data, such as user, micro-blog and so on as the business grows. Each type is built according to the above diagram, and can satisfy the distributed search of general performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.