This post is a first in a set of posts, which discuss various lessons learned from migrating a BI infrastructure from Oracle to Redshift
This particular post discusses the migration of Oracle data types to Redshift, obstacles, tips and some important function usage differences
DATE to TIMESTAMP(TZ)
The source (Oracle) database stored most date time related columns in a DATE data type.
This is a 7 bytes fixed length data type, which stores the date time up to a precision of seconds.
DATE is a commonly used data type in Oracle, and many Oracle date time related functions support DATE as both the input and the output.
Auto-Suggested target TIMESTAMP
Using the AWS Schema Conversion Tool, it suggested to migrate all Oracle DATE columns to TIMESTAMP (aka: TIMESTAMP WITHOUT TIMEZONE).
TIMESTAMP is an 8 bytes data type which supports up to 6 digits of precision for fractional seconds.
The disadvantage to using TIMESTAMP is that the data is always assumed to be in UTC. The input is stored as is, and any timezone input is ignored.
This can create a lot of confusion & ambiguity, and for this reason it is highly recommended to use instead TIMESTAMPTZ (aka: TIMESTAMP WITH TIMEZONE)
Chosen Target TIMESTAMPTZ
AWS Schema Conversion Tool did not suggest TIMESTAMPTZ (aka: TIMESTAMP WITH TIMEZONE) as our target, due to Oracle DATE not containing a timezone indication.
However as we are aware of the original timezone, we decided to utilize TIMESTAMPTZ and explicitly specify the input timezone.
TIMESTAMPTZ is also an 8 bytes data type which supports up to 6 digits of precision for fractional seconds.
It is important to clarify that TIMESTAMPTZ stores the data in UTC. It does not store any timezone details related to the original input value.
When the input value contains a timezone indication, this indication is used to convert the value into UTC and is stored as such (if no timezone indication is provided, the source data is assumed to be by default in UTC).
Setting a default TIMEZONE
By default TIMESTAMPTZ assumes input and output data to be inserted and retrieved in UTC.
Because in our case, all source databases and business users are expected to use Amsterdam as a consistent timezone (even when not explicitly mentioning it). We decided to modify the Redshift users to have a default timezone.
alter user <username> set timezone to 'Europe/Amsterdam';
By setting the ETL user(s) to a default timezone, it sets all input data to be processed by this default timezone (unless a timezone is specified).
By setting the Reporting user(s) to a default timezone, it makes sure that all output data is displayed in the default timezone, in addition, any date time filters are assumed to be in the default timezone.
Thus by modifying the users timezone, we abstract the fact that all data is stored in UTC, and we eliminate the need to explicitly specify the input/output timezone.
VARCHAR2 to VARCHAR
The source (Oracle) database stored most character related fields as VARCHAR2 with a specified CHAR length.
By using CHAR length instead of BYTE length, it allowed to easily specify the exact maximum limit of characters, including special characters or languages where each character can take up to 4 bytes per character.
Using the AWS Schema Conversion Tool, it suggested to migrate all Oracle VARCHAR2 columns to VARCHAR with BYTE length (aka: CHARACTER VARYING).
It is important to note that it will automatically suggest to multiply the VARCHAR2 CHAR length by 4, in order to produce a BYTE length which supports the same limit of characters – this means a column definition of VARCHAR2(10 CHAR) will be converted to a VARCHAR(40).
Oracle uses variable length for in-memory processing, this means if you have a VARCHAR2 column of 100 Bytes, but only 20 Bytes are filled. It will only occupy about 20 bytes in-memory.
Redshift however, will occupy the amount of bytes as per the column definition, rather than the actual usage. For this reason, it is highly recommended to limit the length of columns to reflect actual value length stored.
In order to determine your required byte length, you can use in Oracle the function LENGTHB, which returns the length in bytes.
Unless you have a better knowledge of the data, I would suggest to take your maximum actual length in bytes and add 10-20%, in order to set your new target length size on Redshift.
Redshift provides multiple compression algorithms in order to compress your data. This allows to save storage space and provide faster processing.
Choosing the appropriate compression method is an important optimization task.
It is worth to go over all the different compression encoding, even if you don’t plan to utilize them.
Redshift suggests that you utilize the COPY command, which will automatically determine the compression to utilize in the target table.
I personally prefer to utilize ANALYZE COMPRESSION and/or the ColumnEncodingUtility– as it allows you to analyze the data, receive a suggestion and then you can decide what to use from these suggestions.
I found out that many times my own knowledge about the data combined with knowing all possible compression encoding, allowed me to choose a better fit than the auto suggested one.
There are a number of functions that are different between Oracle and Redshift – in input, output and syntax:
TIMESTAMPTZ vs TIMESTAMP as input
Many of the date related functions require an input of TIMESTAMP and do not support TIMESTAMPTZ or vice versa, additionally some allow both TIMESTAMP & TIMESTAMPTZ. This inconsistency between functions can be sometimes confusing.
Therefore, if you use the recommended TIMESTAMPTZ you will most likely find yourself having to CAST it many times in your function usage to TIMESTAMP.
Calculation with implicit conversion
When doing math calculations with implicit conversions, the output values might be unexpected. This is because Redshift uses the format of the input to determine the output format, and will not treat ‘2’ the same as ‘2.0’.
To solve this, simply be more explicit with your input values and/or use conversion functions.
Example of this unexpected behavior:
- 45/100 – Returns an unexpected value of zero
4/3 – Returns an unexpected value of one
- 45.00/100.00 – Returns the expected value of 0.45 (this is because a decimal point was used in the input values).
Example of issue with date calculation/condition:
- cast(‘2020-01-01 08:00:00’ as timestamp) – cast(‘2020-01-01 07:00:00’ as timestamp)
- Returns INTERVAL data type: ’01:00:00′ – To apply condition on INTERVAL result you cannot use 1/24 to represent an hour, you will have to use ‘INTERVAL 1 HOUR’ in condition.
A few other noticeable syntax differences:
- Using capital ‘J’ – You must use a capital ‘J’ in Redshift to convert a date to a Julian (in Oracle you could also use a lowercase ‘j’).
- TO_NUMBER requires input format – You must specify the format of the input value in order to convert it to a number (in Oracle the format is optional).
- REGEXP is not dynamic – You cannot use dynamic regular expressions with REGEXP functions (you can read more about this in my previous post)
- INTERVAL – in Oracle sysdate – INTERVAL ‘1’ DAY, in Redshift it is sysdate – INTERVAL ‘1 DAY’
- SUBSTR on Oracle treats position zero as one, while SUBSTRING on Redshift will not.