Dynamic RegExp pattern matching in Redshift

Recently I needed to match regular expressions patterns in Redshift, where both the string and the regular expression were dynamic values provided from table columns. Unfortunately REGEXP_COUNT and other similar functions in Redshift, do not support columns as the regular expression. Therefore I decided to solve this with a UDF.

Below is a simple Python UDF in Redshift which returns any matched regular expression patterns – where both string and regular expressions are dynamic (provided by a table).

 

CREATE OR REPLACE FUNCTION <schema>.<function_name> (string_value VARCHAR(255), regexp_pattern VARCHAR(255)) RETURNS VARCHAR(255) IMMUTABLE as $$ 
    import re

    #Compile regular expression pattern
    reg = re.compile(r'%s'%regexp_pattern)

    #Create list of patterns matched
    matches_found = reg.findall(string_value)

    #Join list into one
    return '|'.join(matches_found)
$$ LANGUAGE plpythonu;

In my particular case, I needed to verify when there was no match found. Therefore I simply used the condition:
dev.udf_regexp_match(column1,column2) = ''

Below is the error that started it all. When attempting to use dynamic values in regexp_count & regexp_instr functions of Redshift:

The pattern must be a valid UTF-8 literal character expression

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s