ColdFusion Regex Backreferences

I want to take a string that contains HTML and increment all heading tags by 1. For example, I want to turn <h3>…</h3> into <h4>…</h4>.

I might be able to accomplish this with a long ReplaceList command. However I decide against this as I’ve run into problems with ReplaceList not replacing items in the order it claims to in the documentation. Plus I would have to hard-code every heading tag (both opening and closing) and that would just look ugly. So let’s try and be clever.

This sounds like a job for regular expressions (regex). I opt for the REReplace function over REReplaceNoCase because REReplaceNoCase seems lazy to me, and it forces me to specifically take case into consideration. I like that. Deal with potential problems with input now instead of waiting for it to become a problem later.

So my search string will be something like “<(/?[hH])([1-6])“. This will search for both opening and closing heading tags and I can use the number of the heading as a backreference (that is, I can refer to it in my replacement string). I’m only search for H1 through H6 because the HTML spec clearly states there are only 6 headings, H7 and so on aren’t valid HTML tags.

So my REReplace line might look something like REReplace( variables.input_string, “<(/?[hH])([1-6])”, “\1#Val( \2 + 1 )#”, “ALL” ).

But this will not work.

The reason being that ColdFusion will evaluate all CF variables before performing regex operations. So the backreference to the number of the heading (the “\2” in the example above) will be interpreted as the literal string “\2”. This will trigger a parsing error.

So the workaround is a bit ugly, but still more elegant than just one big ReplaceList function, and that is to loop from 1 to 5 and perform the regex operations one heading at a time. Except there’s another gotcha here. If my loop starts at 1 and works up to 5, I will replace all H1 tags to H2. In the second loop, all H2 tags, including the former H1 tags, will become H3 tags. In the end my HTML will be all H6 tags. The fix is simple, just loop backwards, from 5 to 1. Easy! The code would look something like this:

<cfloop from=”5″ to=”1″ index=”variables.h” step=”-1″>
<cfset variables.input_string = REReplace( variables.input_string, “<(\/?[hH])#variables.h#”, “<\1#Val( variables.h + 1 )#”, “ALL” ) />

Simple. Except this code won’t work. All the heading tags get dropped for some reason.”<h4>” becomes “<>”. What’s going on?

Refer back to how ColdFusion will evaluate all variables before performing the regex operation. In the replace string I use the backreference “\1”. But ColdFusion evaluates the Val() operation before performing the regex operation, so the backreference the regex operation sees (for an H1 replacement) is “\12”. The regex operation looks for a 12th backreference, which doesn’t exist, so the replaced string is empty.

In proper regex engines, this can be easily fixed in a few different ways. I could explicitly specify the base of the number (regex engines typically accept octal, hex, and decimal base numbers) with something like \x01 (hex-base). Or I could wrap the number in curly brackets like \{1}. But ColdFusion doesn’t have a proper regex engine, but a rather bastardized regex engine.

I need to separate my backreference (\1) from my replacement variable (#Val( variables.h + 1 )#). A space won’t work because my H4 tag becomes “H 4”, which isn’t an HTML tag. Can’t use HTML entities as this isn’t text to be output to the user, it’s HTML that the HTML engine needs to interpret. So what can I do?

Well, here’s the trick I’ve decided to go with.

Looking at ColdFusion’s “Using backreferences” page I see references to special characters used to make letters uppercase (\u) or lowercase (\l). I also see I can perform this case change over a string of characters by inserting \U or \L at the start of the case change, and then insert a \E where the case change should end. This is what I need. These are single, alpha-character special regex characters that do not generate output by themselves. I can use a “\U\E” sequence to separate my backreference from my variable! So my final code looks something like this:

<cfloop from=”5″ to=”1″ index=”variables.h” step=”-1″>
<cfset variables.input_string = REReplace( variables.input_string, “<(\/?[hH])#variables.h#”, “<\1\U\E#Val( variables.h + 1 )#”, “ALL” ) />

A quick test shows me I don’t even need the \U, I can stick with just a \E to get the job done. But I’ll keep the \U for completeness.

The reason I start with 5 and not 6 is that I don’t want to create H7 tags (which is not valid HTML) so H6 tags will remain H6 tags.

And now we come to the end of my post. This is where someone posts a comment and points out that my solution is, in fact, unnecessary because there IS a mechanism in ColdFusion’s regex engine to handle such a situation and I just didn’t RTFM as closely as I should have. Well, I truly welcome such a revelation should it exist. If not, perhaps Adobe could look into beefing up their regex engine a bit.