Gathering table statistics can have severe performance implications for your ETL & reporting process. In this post I describe why it is important to consider gathering statistics in your ETL process and I provide an example of how this can be achieved.
Databases store information (aka: statistics) related to the data that is populated in your tables. This information is quite elaborate, and common statistics that can be found in a DB are: Amount of rows in the table, Amount of unique values in a column and their maximum / minimum values.
These statistics are used by the database in order to produce execution plans for your queries. The execution plans will decide vital steps such as: Using an index, Executing steps in parallel etc.
Having accurate statistics is vital in order for the database to produce execution plans which will execute in a faster manner while using the appropriate amount of resources.
For example, if you search for a particular value in a table, and the DB assumes you will retrieve one row back (based on the statistics). The DB will optimize the query based on this assumption and predict the query will execute in seconds. However if in reality this value exists for 1 million rows, you might not be able to see your query finish anytime soon.
This type of problem is a common issue that occurs when the table statistics are not accurately reflecting the actual values in the table.
ETL & Statistics
During the ETL process, data is loaded into existing tables in the data warehouse and/or the data marts.
It is common practice to modify/insert data in moderation to your dimension tables. This means the existing statistics for these target dimension tables, are likely to be accurate enough (to a certain extent).
In addition, it is common to insert new data in high volumes into your fact tables. The fact tables will also many times be partitioned by a certain criteria.
This means that the existing statistics of these particular tables/partitions are likely to be inaccurate.
Please note that some systems re-load all dimension data and/or populate dimension data in high volumes. This means that the table statistics for dimension tables will also be inaccurate.
It Only Gets Worst
A common ETL procedure will contain steps to load data into a data warehouse and/or data marts based on newly available data residing in your DB.
- Staging > Data warehouse – The ETL fetches data from the staging area and loads it into the dimension and fact tables.
- Data warehouse > Data mart – Newly available fact data is aggregated into the data mart.
- Reporting – Data is fetched daily from aggregated data marts and presented in reports/dashboards etc.
This means that all these steps have potential issues with inaccurate statistics:
- Tables with new data in the staging area are likely to have inaccurate statistics, impacting the process of loading this data into the data warehouse.
- New fact data in partitions residing in the data warehouse are likely to have inaccurate statistics, impacting the aggregation into the data mart.
- Reports querying only recent aggregated data, are likely to use inaccurate statistics.
These inaccurate statistics, could possibly delay ETL & reporting processes, and in some scenarios even bring them to a halt. Therefore I would highly recommend to consider collecting statistics in your ETL process.
Some of the issues described above are resolved by DB vendor solutions.
However there is still many times a disconnect between the ETL development/process and the DB Administration, resulting in inaccurate statistics.
As mentioned, DB vendors provide their own solution for gathering of statistics.
For example, Oracle provides automated maintenance tasks which will gather statistics for any tables/partitions which recently had severe modifications.
Oracle decides if the current statistics are likely to be inaccurate (due to a certain percent of modification since the last time the statistics were gathered), resulting in the table statistics being marked as stale, and the automated maintenance task gathering the statistics at predefined maintenance windows.
The problem with this solution, is that the Oracle predefined windows are likely to occur in the evening and/or after your ETL process is done.
In addition, it is possible Oracle will gather the statistics in parallel to the ETL process, resulting in completely wrong table statistics (NUM_ROWS=0).
This solution is good for the over-all maintenance of the DB statistics. However it does not resolve the inaccurate statistics for newly loaded fact data, which is likely to be used in the ETL process
Gathering statistics from within the ETL has the advantage, that you are familiar with the type of data that is being loaded and you know when and where this data is being used.
Tables which are usually high candidates for gathering of statistics are fact tables. Though dimension tables that have high modification rate should also be considered.
For example, if new sales data is loaded into (daily) partitions in the data warehouse, it would be wise to gather statistics after the load, for recent daily partitions. This would produce better execution plans for any queries that might use this data (Aggregation process of the data mart and/or a reports querying recent data).
This can be achieved in many ways. You can use functionality from the DB to automate this process, or you can build your own scripts/procedures to perform this task.
Below I provide an example of a simple and effective solution to gather the statistics based on a combination of your own knowledge and the DB knowledge.
It provides a hybrid approach, which gathers statistics only when necessary, at predefined steps in the ETL.
This solution is provided in a simple ‘SAP Data Services’ script for an Oracle DB. However this logic can easily be translated into any ETL / DB scripting language.
The script uses two rules:
- NUM_ROWS – Gather statistics for recently loaded tables/partitions if the current statistics are wrong.
- STALE_STATS – Gather statistics for important tables before their usage, in case Oracle marked them as having stale statistics.
The script was written quickly, only in order to illustrate the logic. It can easily be perfected to be less redundant and apply more dynamic capabilities.