In an odd turn, I was given text like below to display on a page.
2&nbsp;&nbsp;&nbsp;&nbsp;A.&nbsp;&nbsp;Some&nbsp;subheader&nbsp;here<br />
3&nbsp;&nbsp;&nbsp;&nbsp;B.&nbsp;&nbsp;Some&nbsp;other&nbsp;subheader&nbsp;here<br />
Two issues here (aside from the fact that it should have been an HTML list): 1) it needed to retain the spacing format, and 2)it needed to wrap within a sized element. A non-breaking space, when viewed, appears as a space ( ), but isn't an actual space, so the browser doesn't know where to break the text when wrapping it in an element. Hmmmm....
So, we needed to replace any that is preceded and followed by a printable character, leaving multiple concurrent in tact for the fake indentation. I figured this was best left to a RegEx expression used in ColdFusion's ReReplace() method, but my RegEx is pretty rusty, so I reached out on Twitter.
Andy Matthews and Kris Jones both reached out to me with possible expressions for this, but nothing was working. What was happening is it was finding the characters around an and removing both the nonbreaking space and the characters. Hmmmm....
But both of these folks had pointed me in the right direction. Seems in some RegEx replace engines you can reference groups within expressions in your replacement output. Unfortunately, you can't do this with ReReplace() (or if you can I haven't figured out how).
So I said to myself, "Self" ('cause that's what I call myself) "Self, what about tapping into ColdFusion's underlying Java?" Fingers flying I hit Google. BAM! Up pops Ben Nadel talking pattern matching with the underlying java.util.regex package, with code examples all over (here's one). Time to play.
First I needed the Java RegEx Pattern object:
Then I needed to define the pattern for which I was searching. This RegEx Glossary gave me a ton of info on Java RegEx, that I used to define my matching pattern:
the \p{Print} identifies any printable character (don't want to include my
tags), and want only nonbreaking spaces bracketed by printable characters. The next step is defining the matcher (what the expression will be run against):
And then, the final step, replacing the with a space. The expression returns the characters as well, so I need group 1 + space + group 2 in my output (what I couldn't do in ReReplace). That RegEx Glossary helped with this too:
The groups in the expression, the bits within parens (), are available to your output of the replaceAll by referencing that part of the expression's value. $1 for the first group, $2 for the second, and so on. The entire thing then looks something like this:
Worked like a charm! Thanks to all who helped me get my head around this one.


#1 by Gus on 2/28/11 - 9:21 AM
reReplaceNoCase(str,'([[:print:]])( )([[:print:]])','\1 \3','all')
Gus
#2 by Peter Boughton on 2/28/11 - 11:19 AM
You get exactly the same result with this:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(\p{Print}) (\p{Print})","$1 $2");
Remember, a CFML String is a Java String, so you can apply Java String operations directly to it.
And before I go into detail of why this is all wrong ;) I'll just show a quick general optimisation:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<=\p{Print}) (?=\p{Print})"," ");
This uses lookbehind and lookahead to ensure that the characters are there, but does not include them in the match, so it does not need to remember the actual characters themselves - it simple remembers the location and then replaces those six characters with a single space.
That's of course trivial in this situation, but if you were matching more than a single character each side and including this expression in a big loop, it might become important, and it's useful to be aware of things like this.
However, DON'T switch to using this second version, because it will highlight why this approach is not actually doing what you think it is, and will infact result in most of your nbsp being replaced.
1. The main problem is that this doesn't explicitly ignore consecutive non-breaking spaces. In a consecutive set it replaces every other one. (Because of the way the matcher/replaceAll works, each replacement continues from the end of the last match, and for the first nbsp of a sequence you match, you're then consuming two characters before looking for the next nbsp of a sequence - at which point you're already on the B.) Given the use-case, this might not actually matter for you, but it's certainly something to be aware of if you tried to apply the same technique in other situations.
2. Something else that might not be significant in this situation, but is important to be aware of for this general type of problem - you're looking specifically for a printable character before and after your nbsp - this means that an nbsp at the start/end of the string will NOT get matched. (May or not be desired behaviour.)
3. The other problem here is that you're matching "any printable character", but seem to be misunderstanding what that actually means, (or perhaps how Regex works) and it's not doing what you think it is.
Regex works against plain text. Regex has no knowledge of HTML at all. It does not treat < any different to any other character, and in regex the print class means "every VISIBLE character, plus space". Basically, everything from ASCII Chr 32 to Chr 126 (not sure about higher chars) is matched by \p{Print} (or [[:print:]]). It will not treat an nbsp inside a tag any different from one outside a tab. (However, it will prevent a single nbsp before or after a newline from being replaced, which may or not be desired.)
If you want to preserve the nbsp in <img alt="x y"/> - then you're getting into the foggy area of HTML parsing, which is something which can easily choke regex.
So those are the three problems, and now you probably just want a solution. :)
To avoid horrible complexity, I'm going to ignore the HTML aspect - if you do need to perform differently inside tags to outside, and this is a recurring issue not a one-off, then you need a solution more complex than a single regex, which is aware of when it is inside/outside a tag and behaves appropriately.
Problem 1: How do you match a single instance of X, but don't match a consecutive set of Xs.
Answer: Use negative lookbehind and negative lookahead.
Like this: (?<!x)x(?!x)
Or this: (?<! ) (?! )
If either or both lookbehind/ahead matches an nbsp the match will fail. Therefor, the match will only succeed when an nbsp is alone (irrespective of what characters are before or after it).
Problem 2: How do you match at the start/end of a string.
Answer: Using the same solution - because a negative lookbehind/lookahead confirm the ABSENSE of something, they work at the start and end of a string too.
But if you did want to exclude the start/end of a string, you can do that like this:
(?<!x|^)x(?!x|$)
Where the "^" indicates start of string and the "$" indicates end of string.
You can also use "(?<!x|\A)x(?!x|\z)" which is slightly more correct, since "^" and "$" can sometimes match start/end of lines, not just the entire string, whereas "\A" and "\z" always apply to the whole string.
Problem 3: How to only match a single X not surrounded by non-visible characters.
As I said, correctly handling HTML is too much for a single expression of this type.
However, if you do want to exclude an nbsp that appears on its own, but surrounded by newlines or spaces, insead of "\p{Print}" you might consider simply using "\s" which indicates "any whitespace" and is probably enough for what you need.
This can work in the same way as above: (?<!x|\s)x(?!x|\s)
That will only match individual Xs which are not preceeded OR succeeded by whitespace. If you want to allow spaces but not newlines or tabs, then using "[\r\n\t\v]" instead of "\s" can do that.
So, in summary, the closest functionality to your original expression (but only replacing non-consecutive ones) is probably:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<! |\A|[\r\n\t\v]) (?! |\z|[\r\n\t\v])"," ");
But personally, I think I'd be inclined to go with just:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<! |^|\s) (?! | )"," ");
(And yeah, they might both be more complex expressions than what you first had, but they're also more accurate/correct.)
Anyway... yikes... this has been a long reply - but hopefully it's been a helpful one!
If I've been unclear about anything, or you just have questions about it, let me know and I'll try to explain/clarify.