One of my most fulfilling tasks as a developer is optimization: making certain parts of my product kiln run faster. Because even if your product has the best features, or the most appealing interface, it's still worthless if it's slow and unbearable. Last year, my team got an opportunity to optimize the slowest part of the kiln and eventually improve it much faster than before.
This article describes how a good tool called Elasticsearch can help us to increase the speed of kiln by 1000 times times.
Kiln is a source control tool that provides boarding services for mercurial and git repositories, and includes code reviews and other practical features. We released the kiln in the early 2010, and it is undeniable that the V1 version is very functional, consisting only of repository management, code review, and code push and pull containing rights management. However, as a core product, we know that it can bring practical effect to the user. But when we use kiln to develop kiln ourselves, we begin to notice some deficiencies.
One of the most important features of our desired function is search. It's a near impossible task to manually search through a mountain of source code, and tools like grep require you to keep a copy of the source in your own computer. So when we developed the 2.0 version of the year after the first release, we decided that the search function for submission (commit) information, file name, and file content would be one of the hallmarks of this update.
SQL Server
During that time, we evaluated a number of different search engines. In search of submitting information, we finally chose to use the tools we already know: Microsoft SQL Server. Submissions are stored in a database table in a format that is easy to query, so we can easily implement the query simply by opening the Full-text Search feature of MSSQL. We also use a similar approach to file name searches: Save them in a database table and let SQL Server do the work that follows.
Opengrok
Finding the code itself is a huge challenge that requires the use of a variety of different tools. After comparing the various code search engines, we looked at the Opengrok, which is an excellent tool and seems to meet our needs. Opengrok starts looking from the code under a folder and uses Ctag to parse the code (which natively supports multiple languages) and uses Apache Lucene to build the index. Opengrok is not only for each class, method, and variable in the code, it can also differentiate between definitions and references, so you can search not only for the definition of the method, but also for all the references.
We released Kiln 2 o'clock with search and many other features, and we are very satisfied with what we have done, which allows you to drill down into your code, including code history, submitting information, and each line of code that your team submits.
But with the development of kiln, we gradually realized that its search function was not as good as we had hoped. To tell the truth, even in the best operating environment, the search speed is lower than we expect. At peak times, it's basically completely out of use. Opengrok is an impressive tool, but it's not as good at responding to tens of thousands of repositories and several terabytes of code. The failure of index updates happens all the time, and the real-time indexing code needs to make a checkout backup of each repository's code, not just history, which multiplies the storage requirements. Full-text search for SQL Server is beginning to look slow on our scale, and it is a huge burden on the database server. In addition, it has a number of deficiencies that limit the new features we want to add to the kiln.
In the early part of 2012, we found it was time to rethink our search architecture. Kiln's original design created a new database for each kiln account. SQL Server has a poor scalability to maintain thousands of databases on a single server, so we decided to redesign the kiln architecture in a multi-tenant (multi-tenant) database application in which all account data is kept in the same database. Since we need to migrate separate databases for each account into a new database, we have the opportunity to make some fundamental changes to our data store. This means that we can also start to transform our search engine. Abandoning existing Opengrok and Full-text search is a great regret, but with it comes a greater advantage.
Solr
Again, before we go back to the artboard, we rethink how to make an excellent search function within 2012 years. We want to make the search the best feature of kiln and make it a gold-plated brand that people will discuss enthusiastically. After research, we targeted two different but seemingly similar two search engines: Elasticsearch and Apache, which use lucene on the ground floor, the most powerful open source search engine in the world, and provide a friendly user interface based on it. and hides some of the complexity of Lucene. They all provide JSON based APIs, each with different query capabilities, and play the full power of Lucene. So which one is more suitable for kiln?
Elastcicsearch seems to have the upper hand after reading and experimenting with a large number of related materials for each tool: it is easy to use, powerful, scalable and fast. Running Elasticsearch is very simple: if you have Java installed, just download the latest version of it to run. In just a few minutes, I was able to develop a structure and store the test data. Although some of the queries that are available are already described in Elasticsearch's documentation, it is helpful to be able to run my own sample queries to learn the best way to use Elasticsearch. The final test was to make sure Elasticsearch was able to withstand the entire dataset of kiln on demand, and we were not going to buy a new server for this test, so we used a bit of hacking. Elasticsearch works well on commercial hardware, it can take advantage of all the resources available on all your machines, and then build a separate cluster. So we started almost every developer in the company and asked them to download Elasticsearch and join a cluster of office networks. In an afternoon test, we loaded and exported hundreds of g of data for Elasticsearch. Not only is it not being pressed across, it can also return results within milliseconds, even under the pressure of a large number of write operations. SOLR failed to meet our expectations, and it was significantly less readable in a large number of write operations, while ES remained fast. It is clear that elasticsearch is our solution.