May 14, 2003

Borland and Ox output parsing

Thanks very much to Patrick of OnlyTheBestFreeware fame for providing sample output for both Borland C++ 5.5.1 and Ox 3.20. I've decided (for the moment) to add built-in support for Borland C++ and have checked that custom parsing supports Ox.

So, Borland C++ messages are now handled by default using this regular expression:

(Error|Warning) ((E|W)[0-9]{4} )?(?U)(?P<f>.+) ((?P<c>[0-9]+) )?(?P<l>[0-9]+): [^\s]

This assumes that error or warning codes always start with E or W and are followed by four digits. If you know different, then obviously let me know. This one regular expression matches both the error and warning formats from GCC, and also the error output from the borland resource compiler. I'm beginning to really like regular expressions - I've never had much cause to use them in anger before.

Does anyone who is more experienced at regular expressions have any suggestions for improvements here?

For those not familiar with regular expressions, I'll provide a quick explanation of the large expression above. I'm not going to explain actual regular expression syntax - that is too much information for here. I'll assume that from my description of the functional blocks of the expression you can figure the rest out. First let me show you some trimmed example messages:

Borland C++ Error: Error E2034 main.cpp 207: message...
Borland C++ Warning: Warning W8070 main.cpp 208: message...
Borland Resource Compiler Error: Error resources.rc 14 18: message...

So, the first thing we notice is that all messages begin with either "Error" or "Warning". We can match either of these by using: (Error|Warning)

Next, in the C++ messages there is an error or warning code, which takes the form of E or W followed by four digits. So we build a simple bit of regular expression that matches that:

The resource compiler message doesn't have one of these codes, so we need to make it an optional match. Notice also that if it is present, there is an extra space to be matched before the next item in the string. We add the space, and make the entire phrase optional:
((E|W)[0-9]{4} )?

After this comes a special PCRE processing instruction. (?U) tells PCRE that expressions from here on in should be un-greedy. Without this flag, the resource compiler column number becomes part of the filename because the filename match is greedy. This means that it matches as much as it can. By turning on un-greedy mode, it lets the next expression match the column number before we get to the line number. This took me a while to work out and is, at least I think so, pretty damn smart - if a bit difficult to get your head around. The reason that this is partway through the expression is that putting it at the start causes the E or W code to become part of the filename for some reason. I'm too tired to get my head around why!

Next comes the filename, which we want to attach a "name" to so that we can pluck it out of the regular expression easily later. In PCRE, the library that PN is using for this, names are attached to phrases in the same way that it can be done with python: begin a group with the standard bracket, and then include ?P<name>. We don't bother limiting the filename to any particular characters, it can include most things, so we just match "one or more characters": .+.

I won't describe all of the rest because it's pretty similar - there's an optional match for the column number included in the resource compiler output and then a match for the line number. Both line number and column number are named with "l" and "c" respectively.

Ox messages, which take the general form:

filename (line): message

can be handled simply with:

%f \(%l\): .

I'm basing my decisions for default inclusion on whether the scintilla output lexer currently supports an output format or not. Borland errors (and in fact ones almost the same as the Ox ones) are currently supported by scintilla - which means that I can use Scintilla as an indication of which format to parse.

In cases where scintilla does not support the format with its own built-in error lexer, in order for the error to be recognised as a hotspot the regular expression based lexer needs to be used so these will currently not be "built-in" to PN. I will provide a page on the website with common output formats so that people don't need to work them out themselves.

Whew! There's a devlog entry and a half for you! My thanks again to Patrick for so quickly responding with output examples.

Posted by Simon at May 14, 2003 10:24 PM

Glad I could help! I think it's okay to natively support Borland C++ (very well known and widely used), but let users create their own parser for Ox (hardly know and only used in academia).

Thanks for explaining these horrific Perl regular expressions; I understand half of it now ;-)


P.S. I've posted some more compiler output in the 'request for help' thread.

Posted by Patrick at May 15, 2003 9:30 AM

"This assumes that error or warning codes always start with E or W and are followed by four digits."

As far as I know and could check it, this is indeed correct.

Posted by Patrick at May 16, 2003 8:57 AM