Have recently done a project on data quality analysis and subject analysis, it uses the IS and as part of SQL Server 2005, and in this process has experienced a lot of painful lessons, some experience, although more is "low-level error", but still want to write down, in the future to avoid similar problems :
1. Disk space
In the team I would like to suggest that the system disk allocated to 30G size, even if you are win2003+vs2005+sqlserver2007+office2007, these things after the installation is actually 15G around, seems to be enough, but experience, C disk will be with your use gradually become smaller, At the beginning I assigned 20G time to later on the C disk left a few trillion, awkward.
But if I were to do a similar project again, I would suggest a bigger C-plate. It is a lot of time to use to quickly load, this process requires temporary table to participate, if the space is not enough, you guide 20 million of the data, the first 19.99 million are no problem, it is likely to the last 10,000 when the lack of space and lead to failure.
Of course, the data Engine instance can be installed or configured to other disks, which can be configured as appropriate according to the amount of data you encounter.
This is something that you have to think about when you're dealing with a lot of data, which, if not considered, will have to spend unnecessary time in the middle of the task to reinstall or deploy.
2. A reasonable structure
The process of processing a cube is often encountered in the processing of the data is stalled in that section of the processing, do not go down. When I encountered this problem, I waited two hours, then I was forced to terminate the task, re-examine the cube structure, and found that the structure design is very problematic, It is a disaster to have a fact table that has only been used in one dimension and has been made into a dimension table at the same time. After the cube is finished again, the five-minute cube is processed.
Of course, here's the thing to mention is that after a team member completes the appropriate cube, it's best to review it yourself. This is not because of distrust of team members, but because you are ultimately responsible for process and quality, and the technical level of team members is uneven, So this is destined to do a job------I've met once before, an intern in order to get a implementation of the table and dimension table structure, set five or six attention to the graph.
If the problem is found in the process of running, then it may take you more time to troubleshoot the problem.
3. Version problem.
The reference to this is the version of SQL Server. More common in the community is the development and Enterprise editions. From the perspective of the business system, this version does not seem to have much impact, but when it comes to dealing with large scale data, team members react differently to performance. One of the brothers ' responses was that the memory consumed by the development version after processing the data was always there, unless you restart the service and the Enterprise version is clean, you will find the Enterprise version clean the battlefield after processing the data. Of course, the issue has not been officially verified, but also occasionally heard from Microsoft's friends, But in order not to delay the progress of the task, if you have an enterprise version, it is recommended to use the Enterprise version.
4. Backup of data.
Microsoft's system died before the blue screen, now the blue screen is less, instead of responding to any of your actions. After this painful lesson, I do not recommend that all work be done on one machine. In our task process, ETL, cube processing and various programs are running on a machine, One of the main reasons there was no backup was because the volume of data was too large and the process was complex, but then the problem is really near misses: this machine after running for some time suddenly died, there is no reflection, at that time is processing data, no way to heat up the machine, launched into the landing interface, After entering the password is nothing to prompt. Experienced brother said this is SQL Server in the table repair, no way we have to wait, the result, one hours on the past------to know that we are working overtime at night.
If there's another machine on the roof right now, so it's not going to take so much time. In addition, data is backed up, and if there are tasks that can be disassembled can be deployed to two machines to run, thereby improving performance (sadly, our task is sequential in every step).
5. The Network
In many cases, you may be in a network environment where network administrators are not very good at the network management, so that the network is everywhere attacks, which is often very common. Remember that year in a university intranet, but the general win2003 system, as long as you do not install SP1, connect the network cable immediately after the recruit, This may be difficult to avoid in any scenario, and the greater the number of companies, the more dangerous it is. More often than not, the problem is not the team you are responsible for, but the attack from the "meat machine" of other teams or projects. You don't want the network to crash when two machines collaborate on data. (I have encountered that no two machines can access each other).
This question may be a bit of a myth, but it's often the case. Therefore, security, in addition to pay attention to the usual security, it is recommended to deploy their own team's machine to a router below, if you find the problem immediately unplug the router's network cable, do not because the company networks problems suddenly affect the progress of your task.
Other:
The team members who deal with AS, whether it is a mining model or a long dimension dataset, have reacted to the problem that the machine is easily dead, and this is certainly true. This time, in addition to the proposed improved mining structure or cube structure, The machine configuration requirements are also very high. Our development machine is P4 3.0+2GB such a machine, so to deal with as services, if there is no server, or recommend the application of a relatively high machine.
Similar to such a project, the data processing period will be very long, each minimum granularity of processing units are likely to be one hours, so any one place out of the problem, it may be wasted one hours in vain, of course, this is not the time to solve the problem, And when you find out that the problem is caused by a system and you have to reload something, the time is even more impossible to estimate.