Recently I needed to match regular expressions patterns in Redshift, where both the string and the regular expression were dynamic values provided from table columns. Unfortunately REGEXP_COUNT and other similar functions in Redshift, do not support columns as the regular expression. Therefore I decided to solve this with a UDF.
Below is a simple Python UDF in Redshift which returns any matched regular expression patterns – where both string and regular expressions are dynamic (provided by a table).
CREATE OR REPLACE FUNCTION <schema>.<function_name> (string_value VARCHAR(255), regexp_pattern VARCHAR(255)) RETURNS VARCHAR(255) IMMUTABLE as $$
import re
#Compile regular expression pattern
reg = re.compile(r'%s'%regexp_pattern)
#Create list of patterns matched
matches_found = reg.findall(string_value)
#Join list into one
return '|'.join(matches_found)
$$ LANGUAGE plpythonu;
In my particular case, I needed to verify when there was no match found. Therefore I simply used the condition:
dev.udf_regexp_match(column1,column2) = ''
Below is the error that started it all. When attempting to use dynamic values in regexp_count & regexp_instr functions of Redshift:
The pattern must be a valid UTF-8 literal character expression