A bunch of cunts, mostly in the Australian sense. Except that one guy.

RegExp insanity

OK so I'm trying to parse a csv in perl, and I'm butting my head against a problem that should be EASY to solve. Clearly lack of sleep is at play, here.

Allright so the file is comma seperated. Some of the fields CONTAIN commas, so that's handled by enclosing the field value in double quotes. I don't need to retain these non-delimiter commas for my purposes, so I'm just eliminating them.

I wrote a regexp:

$row =~ s/"(.*?),(.*?)"/$1$2/g;

Which has the obvious flaw that it will only work if there's only a single comma in the quote-contained field. How do I fix this so that it works for an arbitrary number of non-consecutive (intervening text) commas?

Like I said, this should be cake.
Permalink Generic Error 
January 13th, 2006
Use a Perl CSV library.
Permalink Almost H. Anonymous 
January 13th, 2006
This might help http://rath.ca/Misc/Perl_CSV/index.shtml
Permalink Simon Lucy 
January 13th, 2006
Better still: http://search.cpan.org/search?query=csv&mode=all
Permalink Almost H. Anonymous 
January 13th, 2006
No, it isn't cake. Using a regexp for parsing certain kinds of strings can in the worst case be impossible.

You really really need to find and use the right Perl module for this kind of thing, rather than trying to reinvent wheels.
Permalink Ian Boys 
January 13th, 2006
I don't want a library, I want a regexp. :P
Permalink Generic Error 
January 13th, 2006
I will absolutely guarantee you that the parsing I need to do in this case can ABSOLUTELY be done with a regexp.
Permalink Generic Error 
January 13th, 2006
Go ahead, reinvent the wheel!
Permalink Almost H. Anonymous 
January 13th, 2006
Let's see, what do I want to do here:

write a single line of code that will do exactly what I need

OR

import a 50,000 LOC library module for the one fucking function I want?
Permalink Generic Error 
January 13th, 2006
Regardless it's a fun question. Offhand nothing springs to mind apart from recursive searching, or maybe something to do with making it non-greedy. Hrmm.
Permalink Dennis Forbes 
January 13th, 2006
Waste an hour (or more) trying to write the single line of code (perhaps discovering you cannot do it in a single line).

OR

Waste 5 minutes downloading and installing a module. And they do have SMALL CSV parsing libraries.
Permalink Almost H. Anonymous 
January 13th, 2006
"I will absolutely guarantee you that the parsing I need to do in this case can ABSOLUTELY be done with a regexp."

Well, then do it yourfuckingself Mr. Smartypants. Sheesh!
Permalink Star Wars Kid 
January 13th, 2006
Look at Text::ParseWords

It's standard, so nothing to download and install.
Permalink Ian Boys 
January 13th, 2006
Assuming I understand your problem correctly, why not try this:

$row =~ s/"((.*?),(*.?))*"/$1/g;

(btw, the third option is to get hapless ?off people to write your one line for you)
Permalink Almost H. Anonymous 
January 13th, 2006
Or don't, cause that's not going to work! I don't think you can do that with a search'n'replace regexp but you could match it and extract out the component pieces and recombine them. More than 1 line but still not a terribly large number.

I don't remember enough about perl to bang out a 3-line solution to the problem.
Permalink Almost H. Anonymous 
January 13th, 2006
Any time "arbitrary number of" and "regexp" appear in the same sentence, you should start hearing alarm bells.
Permalink Ian Boys 
January 13th, 2006
I think it could easily be done with a non-greedy search (which is how you can accommodate arbitrary number of matches. It requires the bounding elements, but it doesn't actually include them as a part of the "match"), however there are lots of caveats, such as what if there is a quote in the quoted field?
Permalink Dennis Forbes 
January 13th, 2006
write a single line of code that will do exactly what I need

OR

import a 50,000 LOC library module for the one fucking function I want


That's like asking - Is 2*2 = 4 OR 3 - 1

Chances are there for you to grow bald overnight though.
Permalink Vineet Reynolds 
January 13th, 2006
Vineet = Rick Tang ?
Permalink Ward Bush 
January 13th, 2006
Heck no.
But you made me vist his site.
That's the oldest "updated" page I've seen though.

http://www.geocities.com/SouthBeach/7273/
Permalink Vineet Reynolds 
January 13th, 2006
You sound like him.

I can't tell if you've found the right Rich Tang. Yours has a hot girlfriend, I'd do her...
Permalink Ward Bush 
January 13th, 2006
CSV cannot be parsed by a regular expression. It requires a push-down automaton to recognize the language.
Permalink Devil's Advocate 
January 13th, 2006
I'm not trying to parse the entire CSV with a single regular expression. I'm trying to remove non-delimiter commas from a string.
Permalink Mark Warner 
January 13th, 2006
Let's take the following string:

"Hello, fun world!","Oh, that isn't nice.","Good bye, bad, cruel world!"

There are six commas in that string, all of which are contained between pairs of quotes. So first of all you have to distinguish the commas you want to remove from the ones you don't. Maybe you can say there has to be at least one non-comma character after the quote and before the comma.

Then you might think to use the /g flag on the substitution to catch as many commas as there are in the string. But that doesn't work because there could be more than one comma between a single pair of quotes and the next search will pick up where the previous one left off. So you might be better off repeating the replacement over and over until there are no more matches. One way could be like this:

$string=q/"Hello, fun world!","Oh, that isn't nice.","Good bye, bad, cruel world!"/;
while ($string=~s/("[^,]+),/$1/g) {} # repeat substitution until no more matches

There may be other ways, but this is the first way that comes to mind.
Permalink Ian Boys 
January 13th, 2006
Oh, and you can spot the error in my regex if you like. It doesn't quite work as given.
Permalink Ian Boys 
January 13th, 2006
However, I still think this is a better way to go:

use Text::ParseWords;

$string = q/"Hello, fun world!","Oh, that isn't nice.","Good bye, bad, cruel world!"/;

@words = parse_line(',', 0, $string);

$, = "\n";
print @words;
Permalink Ian Boys 
January 13th, 2006
Right, once you've captured your quoted fields, 'splice()' out the comma(s) from it (delete the comma).
Permalink LinuxOrBust 
January 14th, 2006

This topic was orginally posted to the off-topic forum of the
Joel on Software discussion board.

Other topics: January, 2006 Other topics: January, 2006 Recent topics Recent topics