First of all, the background of the project, once done a project, according to the data in Excel batch to crawl the data on the Web page, the crawl of the data in bulk backfill to excel. There are many rows in this Excel record (there will be thousands of rows in many cases), and each row of data stores the criteria for querying a unique piece of data on a Web page. The Operation webpage section uses the Microsoft Mshtml, does not do the redundant introduction here. The main point here is how to get the final query results and fill in the results back into the Excel section.
This may sound simple, the simplest way is to store the data in a list or DataTable in memory without querying the data in the Web page, and then, when the execution is complete, get the data from memory into Excel. However, there is a problem that the system may be forced to restart due to an upgrade during execution. And the process is usually long, the user cannot stare at the screen all the time, so if the data is stored in memory, it is easy to lose the previous execution result at the end. This allows the user to re-execute the data fetching process, but it also wastes a lot of time to execute, and sometimes even lets the user miss the task deadline! My implementation is to store the acquired data in the database every time, in the database to record the conditions of the query and the results of the query, after the final execution completes the bulk of the data from the database, synchronized to Excel. This ensures that the obtained results are not lost to the fullest extent, and that the performance of the system is guaranteed (compared to writing the results directly to excel each time).
So how do you get data from the database? The simplest way to do this is to loop each piece of data in Excel and get the records in the database, but obviously this is very inefficient and requires multiple accesses to the database. On the basis of this approach, one step closer is to spell out a SQL at the end, similar to
select * query result table where query condition in ( " query condition 1 ' , " query condition 2 ...)
However, there is a problem that many databases (SQL Server, Oracle) have a limit of the number of conditions that can be used in in, all 1000, and other databases may have some similar limitations. So once you have more than 1000 data in Excel, the SQL will get an error. So to use this method, according to the number of rows of Excel data, the dynamic grouping of data, each 1000 bar is 1 groups. Then loop through each group, dynamically spell out a SQL on each set of data, commit to the database, and finally merge the results of the query. To give a specific example, if there are 1005 data, then you need to divide the data into 2 groups, the 1th to 1000th is a group, the 1001th to 1005th is a group, the 2 sets of data into the database query. If this is the only way to not be unacceptable, but in the process of acquiring another problem to consider, that is, some data may not have a corresponding record in the database, this time to make a prompt to the user. The reason this happens is that when you get data from a Web page, there may be problems with the Web page, which can cause data fetching to fail. The user must manually reset the page, and delete the previously acquired data from Excel, and then restart execution to continue, so there is a risk of erasure, it is possible to synchronize the data in the last time there is no Excel data in the database.
If implemented in this way, then the final implementation is probably similar to the following code
var group number =math.cell (Excel data/+); for (int i=0;i< Group; i++) {List<string> Group data = get grouped data; block SQL based on grouped data; Execute SQL query result set based on grouped data if no data is queried in the result set, the error information is logged in Excel to synchronize the get data into Excel}
The result of this is to loop Excel to get the data of the logic and synchronization of Excel (the actual synchronization of Excel code is still more complex, here just to simplify the problem, and let this part of the logic seems relatively simple) of the logic mixed together, not only the code looks more chaotic, And it's not easy to do unit testing on this piece of code. To solve this problem is to separate the loop logic, there are 23 design patterns just one can solve the problem, that is, the iterator mode. The standard implementation of this pattern is as follows
This I do not introduce, there are many introductions on the Internet, but the implementation of this mode requires multiple classes of collaboration, although the logical separation can be achieved, but the amount of development is still relatively large. In the actual project I used a variant of the pattern under. NET, using the. NET yield keyword to implement iterators. The code is roughly as follows
/// <summary> ///Get iterators/// </summary> /// <param name= "DataSource" ></param> /// <param name= "Batchloadsize" >Bulk Load Size</param> /// <param name= "Loaddatacallback" >load Data callback, the first parameter is the data source to be loaded, returns the actual data</param> /// <param name= "Gettargetdatabysourcedatacallback" >get objects that have already been loaded based on the data source</param> /// <returns></returns> PublicIenumerable<keyvaluepair<tsource,ttarget>> getenumerable (ilist<tsource> DataSource,intBatchloadsize, Func<ilist<tsource>, ilist<ttarget>> loaddatacallback,func<tsource,ilist< Ttarget>,ttarget>gettargetdatabysourcedatacallback) { intBeginindex =0; while(true) {List<TSource> Loadsourcedata =NewList<tsource>(); inti; //loop to get the data to be loaded in bulk for(i=0; i<batchloadsize&&i+beginindex<datasource.count;i++) {Loadsourcedata.add (Datasource[beginindex+i]); } //loading data from the databaseIlist<ttarget> targetdatalist =Loaddatacallback (Loadsourcedata); //get a key value pair for the source and target data foreach(varSourceinchloadsourcedata) {Ttarget target=gettargetdatabysourcedatacallback (source, targetdatalist); KeyValuePair<tsource, ttarget> Sourcetargetkeypair =NewKeyvaluepair<tsource, ttarget>(Source,target); yield returnSourcetargetkeypair; } beginindex+=i; //If you have looped to the last group, exit the Loop if(beginindex>=datasource.count) { Break; } } }
The method for calling the iterator is as follows
foreach inch New Batchloadenumerable<tsource,ttarget> (). Getenumerable (Excel data,$, delegate that loads data, the delegate that queries data from the load results based on Excel data)) {if(Excel data and query results. query result = = NULL {log query error continue;} Synchronizing Data}
With this method, the logic of the loop is completely separated from the synchronized data, not only that the code looks more concise, but also that it is easier to do unit testing. I have done a generalized processing of this iterator, as long as it is possible to replace the loading data with a delegate that queries the data in the result set based on the original data in a different business scenario, and hopefully it will help you.
Bulk data acquisition using the iterator pattern (C # Implementation)