PDA

View Full Version : awk -- escaping the delimiter



Kennedy
05-26-2010, 09:03 PM
Okay, lets say I have the following file:
var0:0x9453!var1:Some random string!var2:"Stop!" the woman!var3:00432123432123432885etc, etc, etc.

Okay, now what I want to do is to be able to break out the various data points with some delimiter -- in this case it would be the !, however, in the "real" file it would probably be :. I would expect to use awk on this line (after tailing the last line, which is the only line I'm really interested in) by looping through the number of times I need to to get all the data. So, the loop would look something like this:
LINE=`tail -n1 filename`
for i in 0 1 2 3 ; do
let j=i+1
CMDSTR="VAR${i}=`echo $LINE | awk -f! '{print \$${j}}'`"
eval $CMDSTR
done
But, the problem comes in where I have the ! after var2's Stop!. So, what I need to do is to figure a way to allow escaping (or quoting) of the "text" that is in each field so that I don't chunk myself on the awk command.

Any ideas?

MK27
05-27-2010, 08:31 AM
I dunno awk but if I understand your problem correctly it is a common one with parsing say XML or HTML, where you may want to ignore things inside and/or outside of <>. It is not as simple as you think, my approach is to split the line something like this:



[root~] LINE='var0:0x9453!var1:Some random string!var2:"Stop!" the woman!var3:0043212343212343288'
[root~] echo $LINE | perl -ne '@ray=split/"/,$_;$i=-1;foreach(@ray){$i++;next
if($i%2);$_=~s/\!/_MK_/g;};$line=join("\"",@ray);@ray=split/_MK_/,$line;$line=join("
\n",@ray);print $line'
var0:0x9453
var1:Some random string
var2:"Stop!" the woman
var3:0043212343212343288


The _MK_ placeholder is not a satisfying hack but that's what you get in "one line".* But it does work. In reality, if that is not feasible, I'd store the various part and reassemble them without that. In any case it is better done with an external script -- I imagine awk is capable, it has arrays and regexps right? So do it in an external script. I suppose you could do this with a combination of bash and awk in a short function too. I don't enjoy bash that much tho.

I'd love to hear if someone has a simpler alternative to my divide and conquer algorithm**. The only other solution I see is full blown multipass parsing (which that actually does involve multiple passes).

*also, that one presumes the line does not begin with ". Again, that is something better dealt with in a real function or stdin->stdout script.
** which in case that is not clear: split the line into an array on ". Replace your delimiter with a placeholder that cannot be present (eg, _MK_) but only in the even numbered elements (0,2,4, etc). Join the array back into a line. Now use the placeholder as a delimiter.

Kennedy
05-27-2010, 12:59 PM
::sighs heavily:: I don't have perl.

I may have a way to do this, however, without having to look at the fragments of the lines, but compare the whole lines themselves. The split would then be a header with a colon, then the data. This is easy enough as I can get the header through an awk script, then sed out the header for the data.

And, yes, you do understand the problem and YES it is a pain in the butt to do this in a C app, much more so in "simple" bash scripting (I don't have a full blown bash as this is an embedded system).

zacs7
05-27-2010, 08:20 PM
awk to the rescue!

Somewhat of a hack, but it works as far as my short testing goes:


BEGIN {
inq = 0;
RS="!"; # record separator
FS="\n";
}

/"/ {
if(inq) {
inq = 0;
}else{
inq = 1;
}
}

/$/ {
if(inq) {
printf "%s%s", $0, RS;
next;
}else{
print $0;
}
}


Running it:


zac@breeze:cboard (0) $ cat line | awk -f test.awk
var0:0x9453
var1:Some random string
var2:"Stop!" the woman
var3:00432123432123432885

zac@breeze:cboard (0) $ cat line
var0:0x9453!var1:Some random string!var2:"Stop!" the woman!var3:00432123432123432885

MK27
05-27-2010, 09:03 PM
Hmmm. Me likes the look of that. Should grok awk I guess.

I think that BS about requiring multipass parsing I spouted earlier wuz because byte by byte is actually a pain in perl (so I wouldn't bother using a "state" flag) and I've never bothered to do this in C. After that I just stopped thinking.

zacs7
05-27-2010, 10:12 PM
I don't think it's a pain the butt in a C program at all... Finite state machine anyone? :)

Proof-of-concept:


zac@neux:cboard (0) $ cat line | ./fs
var0:0x9453
var1:Some random string
var2:"Stop!" the woman
var3:00432123432123432885
zac@neux:cboard (0) $ cat fs.c
#include <stdio.h>

void magic(const char * line)
{
int state = 0; /* start state */

while(*line)
{
switch(state)
{
case 0:

/* start quote */
if(*line == '"')
{
state = 1;
putchar(*line);
}else if(*line == '!'){
state = 2;
}else{
putchar(*line);
}
break;

case 1:
putchar(*line);

if(*line == '"')
{
state = 0;
}
break;

case 2:
putchar('\n');
putchar(*line);
state = 0;
break;

default:
puts("Zac can't program");
}

++line;
}
}

int main(void)
{
char line[256];

fgets(line, sizeof line, stdin);

magic(line);

return 0;
}


It's the same principle for escaping...

MK27
05-27-2010, 10:54 PM
I don't think it's a pain the butt in a C program at all... Finite state machine anyone? :)

Probably the best idea.

itCbitC
06-07-2010, 12:14 AM
just my 2c


awk -F! '{
for (i=1;i<=NF;i++) {
if ($i~"\"") {
if (flag) printf("%s\n", $i)
else {
printf("%s%s", $i, FS)
++flag
}
}
else
print $i
}
}' file

output

var0:0x9453
var1:Some random string
var2:"Stop!" the woman
var3:00432123432123432885