Recently I needed to match regular expressions patterns in Redshift, where both the string and the regular expression were dynamic values provided from table columns. Unfortunately REGEXP_COUNT and other similar functions in Redshift, do not support columns as the regular expression. Therefore I decided to solve this with a UDF.
Below is a simple Python UDF in Redshift which returns any matched regular expression patterns – where both string and regular expressions are dynamic (provided by a table).
CREATE OR REPLACE FUNCTION <schema>.<function_name> (string_value VARCHAR(255), regexp_pattern VARCHAR(255)) RETURNS VARCHAR(255) IMMUTABLE as $$ import re #Compile regular expression pattern reg = re.compile(r'%s'%regexp_pattern) #Create list of patterns matched matches_found = reg.findall(string_value) #Join list into one return '|'.join(matches_found) $$ LANGUAGE plpythonu;
In my particular case, I needed to verify when there was no match found. Therefore I simply used the condition:
dev.udf_regexp_match(column1,column2) = ''
Below is the error that started it all. When attempting to use dynamic values in regexp_count & regexp_instr functions of Redshift:
The pattern must be a valid UTF-8 literal character expression