Performance tuning is the improvement of system performance. The motivation for such activity is called a performance problem, which can be real or anticipated. In many cases the system performance decreases for example with increasing load volumes or the number of executed applications. A system’s ability to accommodate to changing conditions or circumstances is called scalability. If it is necessary to modify a system in order to handle the new conditions like higher loads or to achieve other goals like decreasing process times, it is called performance tuning.
Further we will deal especially with ETL performance tuning. The main performance requirement for an ETL process is to load a certain amount of data in a defined amount of time from a source to a destination. During the ETL process several changes can be applied to the data. So the main indicators to identify performance problems in ETL are:
Data Volume: The data volume describes the amount of data which is loaded during the ETL process from the source to the destination. Usually this amount is measured in MB (Megabyte) or GB (Gigabyte). It is also possible to measure the data volume by the number of rows loaded. As rows can differ according to their number and type of columns, this approach is not recommended as in most cases it is not precise enough.
Processing Time: Processing time means the time a certain ETL package or the complete ETL process, including many ETL packages, takes to complete.
Throughput: The data volume loaded in a certain processing time is called throughput. It is typically measured in MB/sec or GB/sec. Throughput tells you how fast you are able to load a certain data volume on a certain environment.
So we are talking about a performance issue when the ETL data volume is not processed in the expected or defined processing time. In this case the process described in the following should be utilized.
Important: The data volume, the expected processing time and other restrictions have to be defined before the tuning process. Tuning need certain goals to be achieved otherwise it ends up tuning for eternity.
The following described process and explanations are related to the tuning process presented by Thomas Kejser („Designing highly scalable data loading“). These techniques are Best Practices and I like to thank Thomas for providing this material.
Systematic tuning follows these steps:
- Assess the problem and establish numeric values that categorize acceptable behavior.
- Measure the performance of the system before modification.
- Identify the part of the system that is critical for improving the performance. (Bottlenecks)
- Modify that part of the system to remove the bottleneck.
- Measure the performance of the system after modification.
A commonly accepted approach to analyze and handle performance issues is shown in the figure below:
Generate Hypothesis
In the first step it is necessary to agree on hypothesis in order to get a common baseline. Usually the motivation for this step is driven by already known performance issues or also by the need and desire to improve system or process performance in order to achieve defined goals.
A performance problem may be identified by slow or unresponsive systems or by long loading times. This usually occurs because high system loading, causing some part of the system to reach a limit in its ability to respond. This limit within the system is referred to as a bottleneck. The bottleneck is the part of a system which is at capacity. Other parts of the system will be idle waiting for it to perform its task. In the process of finding and removing bottlenecks, it is important to prove their existence, before acting to remove them. There is a strong temptation to guess. Guesses are often wrong, and investing only in guesses can be highly time-consuming without any effects on the dedicated goal.
In this case the hypothesis should always refer to the known performance issue, like:
“The hard drive I/O causes the bottleneck and is responsible for the high processing time.”
Even if an application, seen for itself, performs well, it can happen due to overall system performance or resource government that it is necessary to deal with performance tuning. A typical scenario can be that the application is running on a server which also operates other application. In this case it is imaginable that restrictions due to resource limitations are set up for our application. So a certain hypothesis can be:
Read more: http://nexxtjump.com/2011/12/21/ssis-performance-tuning-methodology-and-general-approach/
By Marcel Franke, from: http://nexxtjump.com/2011/12/21/ssis-performance-tuning-methodology-and-general-approach/