How to separate a string which contains white space for different instructions

**RagingOrc** · 01-03-2013

Hey ppl,

So i'm playing around with C and I fancied creating an assembler as a little side project.

I'm going to pull in text, in machine simple format from files in the directory. I want to separate my instructions with whitespace aka one space.

Now I was thinking of scanning the file with fgetc to scan each char at a time until it hits whitespace. That said, i'm having alot of issues with it. As in how do I hold and now work with that part of the text, how to ioi progress to the next part after the space etc.

Have I got the right idea? How would you do it?

Thanks

**laserlight** · 01-03-2013

Well, you are on one possibly right track with respect to parsing the input. Of course, for what you have described, a simpler alternative could be to designate a maximum length for a token, then just scanf your way through the entire input, but then if you want to allow for comments, this might not be so easy to adapt as your character by character read idea.

**RagingOrc** · 01-03-2013

Indeed, i would prerably like to add comments as they are seen in code everyday so it seems logical to include them. I'll also be using multiple lines just to clarify if that wasn't obvious.

So is fgetc the best course of action for my code if i want comments and spaces to separate instructions etc? Any more tips?

Many thanks

**laserlight** · 01-03-2013

I wouldn't say best, but it can work. In any case, you need to be clear on what exactly is the format of the input that you are going to parse.

**Malcolm McLean** · 01-03-2013

Treating input as a stream and pulling put each token with fgetc() will work. You will also need ungetc() to push back the tokens. For example, you'll want to read all the whitespace until you hit a non-whitespace character, then put it back for the instruction getter to read.
The snag is that you can't easily extend the program to use multi-character tokens, which is often a more intuittive way of parsing the stream.

You might want to download a copy of MiniBasic, how to write a script interpreter, if you are serious about creating an assembler.

**RagingOrc** · 01-03-2013

An example of a command parsed could be

ldc 5

a more complicated command could be something like...

Load Accum: ldc 5 ; a label and an instruction

The first part being the label, the second the instruction and the third a comment.

The first and third would be ignored etc.

I'm a little confused by what you mean Malcolm could you elaborate a bit or show an example?

Laserlight I hope that clarifys things.

I'm still having trouble breaking it up and then using the first section befgore the break and the section after the break..

I don't really understand this..

It's weird i've written a fair few C programs but creating this assembler just points out a large gaping whole in my knowledge base. It's good thouh in a way as I get to learn more and can build this into my knowledge bank

Any help would be very much appreciated

Many Thanks

**laserlight** · 01-03-2013

Originally Posted by RagingOrc

An example of a command parsed could be

ldc 5

a more complicated command could be something like...

Load Accum: ldc 5 ; a label and an instruction

The first part being the label, the second the instruction and the third a comment.

The first and third would be ignored etc.

(...)

Laserlight I hope that clarifys things.

The clarification is not for me; it is for you

Beyond examples, it may be helpful to express the syntax in say, some variant of BNF notation, even though assembly languages tend to have pretty simple syntax.

Also, note that if you are going to allow for jumps, then it might not be feasible to ignore labels.

Originally Posted by RagingOrc

I'm a little confused by what you mean Malcolm could you elaborate a bit or show an example?

It depends on how you want to do it, so I don't agree that "you will also need ungetc() to push back the tokens", even though that is an approach that I might take myself.

That said, imagine that you have a piece of code responsible for discarding whitespace. So, it just keeps reading and doing nothing with the characters until it detects a non-whitespace character. At this point, you could have it process the character, but it can be simpler to just put the character back into the input stream (hence the use of ungetc()) then exit this piece of code to return to the main loop that reads in characters and processes them.

Originally Posted by RagingOrc

I'm still having trouble breaking it up and then using the first section befgore the break and the section after the break.

One approach is to have lexical analysis as a separate step: you first break up the input into a list of tokens (e.g., a dynamic array). In this, I also disagree with the assertion that "you can't easily extend the program to use multi-character tokens". You can do it if you keep track of the current token, and store it in the list when you detect a character that is whitespace or the start of the next token.

**Barney McGrew** · 01-03-2013

Well, what you asked for in your first post is done fairly simply with scanf, like so:

Code:

char token[64 + 1];

while (scanf("%64s", token) == 1) /* handle token */

The downside of this method is that you can't discriminate between the newline and space characters, so you won't be able to handle single line comments. I can think of two good ways to get around this:
• Implement a function that behaves similarly to scanf's 's' format specifier but also indicates the value of the whitespace characters it eats, then design a finite state machine based on the class of the token you've read.
• Read an entire line using fgets and break up your input using sscanf, perhaps storing all the useful information of the line in a struct.

If you decide to try the first method, the function you'd use for reading a token could be as simple as this:

Code:

int readtoken(char *s, size_t n)
{
    int c;

    while ((c = getchar()) != EOF && !isspace(c))
        if (n) {
            *s++ = c;
            n--;
        }
    *s = '\0';
    return c;
}

**c99tutorial** · 01-04-2013

Originally Posted by RagingOrc

An example of a command parsed could be

ldc 5

a more complicated command could be something like...

Load Accum: ldc 5 ; a label and an instruction

[/CODE]

You could use a regular expression engine. First write out some sample commandlines that might be interpreted by your assembler:

Code:

label1: mov ax 4 ; here is a comment
mov bx 123;here is another comment
a:push 456 ;...

Generalize each possible commandline with the different combinations of components that might occur:
(1) LABEL: CMD ARG1 ARG2 ;COMMENT
(2) CMD ARG1 ARG2 ;COMMENT
(3) LABEL: CMD ARG1 ;COMMENT
...
(N)
If a commandline matches any of these N cases, it is a valid commandline. Otherwise it is invalid.

Translate each case 1..N into a regular expression and use capture groups to pick out each component you are interested in. Here is a pseudocode outline:

Code:

// (Case 1) LABEL: CMD ARG1 ARG2 ;COMMENT
if Commandline matches regexp ^\s*(\w+):\s*(\w+)\s+(\w+)\s+(\w+)\s*;(.*)$
    Label := capture[1]
    Cmd := capture[2]
    Arg1 := capture[3]
    Arg2 := capture[4]
    Comment := capture[5]
else 
// (Case 2) CMD ARG1 ARG2 ;COMMENT
if Commandline matches regexp ^\s*(\w+)\s+(\w+)\s+(\w+)\s*;(.*)$
    Label := NONE
    Cmd := capture[1]
    Arg1 := capture[2]
    Arg2 := capture[3]
    Comment := capture[4]
else
// ...
else
    print "Invalid commandline: ", Commandline
end if

There are several regex libraries which provides these matching abilities and are callable from C. An example is PCRE.

Thread: How to separate a string which contains white space for different instructions

Thread Tools

Search Thread

Display

How to separate a string which contains white space for different instructions

Similar Threads

string splitting white space, comparing to strings, determine int intentions etc.

Strings and white space

Reading string with white space in C

white space and fscanf

File I/O (white-space)