Prepare the project record for index of search data with enterprise level
1. First identify the range of data that needs to be searched
2. Create a corresponding index library for these base data to store the data that needs to be indexed
So for the core of such a bright spot for example, my personal hands-on experience, previously done is similar to the Enterprise document search solution, for the various kinds of technical documents indexed to autonomy, and then through the foreground platform services call autonomy provided by the restful interface to obtain data;
Data Preparation steps:
1. Prepare the required database tables to record the data that needs to be indexed
Business table:
Node Document Business Table
Delta tables:
Index_cm_inc General Document Delta Table Description: Used to store the data that needs to be added, deleted, modified, monitor the main business table by the trigger node, record the corresponding operation to this delta table
Index_kn_inc Knowledge Base Document Delta Table Description: Same as above, only distinguishes categories because they are stored in different index libraries according to categories
Index Table:
INDEX_CM General Document Index Table Description: Autonomy crawler crawls the document index into the Autonomy document library according to the table's content records;
INDEX_KN Knowledge Base Document Index Table description: Ibid.
Index Status Table:
Index_cm_status General Document Index status Table description: Autonomy crawler After completion, you need to check whether the corresponding document data crawled successfully, whether the index is successful
Index_kn_status Knowledge Base Document Index status Table description: Ibid.
2. Data preparation process and processing process
2.1 First prepare data for delta tables
For the Business table node to create the corresponding trigger, add, delete changes, the data stored in these monitored changes to the corresponding type of delta table; (You can also do not use triggers, but in the program to control the insertion into the corresponding delta table, each of these methods have their own advantages and disadvantages, I feel it.) )
2.2 Delta table data is synchronized to the Index table (this table data stores all field data that needs to be indexed, as well as the location of the file, and so on, i.e. you need to index all the content fields to the index library)
Through the scheduled task calls the corresponding stored procedure calculation of the required fields to synchronize to the corresponding index table, before our project time is usually set to synchronize the data at 12 o'clock in the morning, in the process of synchronization first to clean up the day before the data to prevent re-crawling; also need to get the failed data from the Index state table to the index table , Index again (disadvantage: The policy is to only process the data for the day, that is, the published document will not take effect in time)
2.3 Start crawler (also scheduled task start)
Set different document categories to handle the corresponding index table, crawl fields and entity files, and so on.
2.4 After the crawler is over, you need to start the check task
According to the different kinds of index table to check whether the index data of the day is successful, the success and the unsuccessful need to indicate the status, because the next day when the index will need to re-index the failed data, then how to check, that is, according to the ID of the document called Autonomy Index Library to view, if you can query, Then the index succeeds, otherwise it fails;
The above is the basic data preparation and index of the great ambition process, in which the revelation can also be broken down, in order to ensure that the data can be indexed successfully, customized different strategies, such as the data in the state table, after the inspection can crawl and other strategies;
This solution revelation is not just for the autonomy, it has switched the search engine as follows: ES lucence based search framework is also available.
The disadvantage is in real-time, but on the basis of this idea can change the strategy to ensure that it's real-time I have not yet to realize the thinking, follow-up hope to do.
Enterprise Search Solution Process