In an odd turn, I was given text like below to display on a page.
2&nbsp;&nbsp;&nbsp;&nbsp;A.&nbsp;&nbsp;Some&nbsp;subheader&nbsp;here<br />
3&nbsp;&nbsp;&nbsp;&nbsp;B.&nbsp;&nbsp;Some&nbsp;other&nbsp;subheader&nbsp;here<br />
Two issues here (aside from the fact that it should have been an HTML list): 1) it needed to retain the spacing format, and 2)it needed to wrap within a sized element. A non-breaking space, when viewed, appears as a space ( ), but isn't an actual space, so the browser doesn't know where to break the text when wrapping it in an element. Hmmmm....


#1 by Gus on 2/28/11 - 9:21 AM
reReplaceNoCase(str,'([[:print:]])( )([[:print:]])','\1 \3','all')
Gus
#2 by Peter Boughton on 2/28/11 - 11:19 AM
You get exactly the same result with this:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(\p{Print}) (\p{Print})","$1 $2");
Remember, a CFML String is a Java String, so you can apply Java String operations directly to it.
And before I go into detail of why this is all wrong ;) I'll just show a quick general optimisation:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<=\p{Print}) (?=\p{Print})"," ");
This uses lookbehind and lookahead to ensure that the characters are there, but does not include them in the match, so it does not need to remember the actual characters themselves - it simple remembers the location and then replaces those six characters with a single space.
That's of course trivial in this situation, but if you were matching more than a single character each side and including this expression in a big loop, it might become important, and it's useful to be aware of things like this.
However, DON'T switch to using this second version, because it will highlight why this approach is not actually doing what you think it is, and will infact result in most of your nbsp being replaced.
1. The main problem is that this doesn't explicitly ignore consecutive non-breaking spaces. In a consecutive set it replaces every other one. (Because of the way the matcher/replaceAll works, each replacement continues from the end of the last match, and for the first nbsp of a sequence you match, you're then consuming two characters before looking for the next nbsp of a sequence - at which point you're already on the B.) Given the use-case, this might not actually matter for you, but it's certainly something to be aware of if you tried to apply the same technique in other situations.
2. Something else that might not be significant in this situation, but is important to be aware of for this general type of problem - you're looking specifically for a printable character before and after your nbsp - this means that an nbsp at the start/end of the string will NOT get matched. (May or not be desired behaviour.)
3. The other problem here is that you're matching "any printable character", but seem to be misunderstanding what that actually means, (or perhaps how Regex works) and it's not doing what you think it is.
Regex works against plain text. Regex has no knowledge of HTML at all. It does not treat < any different to any other character, and in regex the print class means "every VISIBLE character, plus space". Basically, everything from ASCII Chr 32 to Chr 126 (not sure about higher chars) is matched by \p{Print} (or [[:print:]]). It will not treat an nbsp inside a tag any different from one outside a tab. (However, it will prevent a single nbsp before or after a newline from being replaced, which may or not be desired.)
If you want to preserve the nbsp in <img alt="x y"/> - then you're getting into the foggy area of HTML parsing, which is something which can easily choke regex.
So those are the three problems, and now you probably just want a solution. :)
To avoid horrible complexity, I'm going to ignore the HTML aspect - if you do need to perform differently inside tags to outside, and this is a recurring issue not a one-off, then you need a solution more complex than a single regex, which is aware of when it is inside/outside a tag and behaves appropriately.
Problem 1: How do you match a single instance of X, but don't match a consecutive set of Xs.
Answer: Use negative lookbehind and negative lookahead.
Like this: (?<!x)x(?!x)
Or this: (?<! ) (?! )
If either or both lookbehind/ahead matches an nbsp the match will fail. Therefor, the match will only succeed when an nbsp is alone (irrespective of what characters are before or after it).
Problem 2: How do you match at the start/end of a string.
Answer: Using the same solution - because a negative lookbehind/lookahead confirm the ABSENSE of something, they work at the start and end of a string too.
But if you did want to exclude the start/end of a string, you can do that like this:
(?<!x|^)x(?!x|$)
Where the "^" indicates start of string and the "$" indicates end of string.
You can also use "(?<!x|\A)x(?!x|\z)" which is slightly more correct, since "^" and "$" can sometimes match start/end of lines, not just the entire string, whereas "\A" and "\z" always apply to the whole string.
Problem 3: How to only match a single X not surrounded by non-visible characters.
As I said, correctly handling HTML is too much for a single expression of this type.
However, if you do want to exclude an nbsp that appears on its own, but surrounded by newlines or spaces, insead of "\p{Print}" you might consider simply using "\s" which indicates "any whitespace" and is probably enough for what you need.
This can work in the same way as above: (?<!x|\s)x(?!x|\s)
That will only match individual Xs which are not preceeded OR succeeded by whitespace. If you want to allow spaces but not newlines or tabs, then using "[\r\n\t\v]" instead of "\s" can do that.
So, in summary, the closest functionality to your original expression (but only replacing non-consecutive ones) is probably:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<! |\A|[\r\n\t\v]) (?! |\z|[\r\n\t\v])"," ");
But personally, I think I'd be inclined to go with just:
REQUEST.finalValue = REQUEST.matchThis.replaceAll("(?<! |^|\s) (?! | )"," ");
(And yeah, they might both be more complex expressions than what you first had, but they're also more accurate/correct.)
Anyway... yikes... this has been a long reply - but hopefully it's been a helpful one!
If I've been unclear about anything, or you just have questions about it, let me know and I'll try to explain/clarify.