Thread: How to separate a string which contains white space for different instructions

  1. #1
    Registered User
    Join Date
    Jan 2013
    Posts
    3

    How to separate a string which contains white space for different instructions

    Hey ppl,

    So i'm playing around with C and I fancied creating an assembler as a little side project.

    I'm going to pull in text, in machine simple format from files in the directory. I want to separate my instructions with whitespace aka one space.

    Now I was thinking of scanning the file with fgetc to scan each char at a time until it hits whitespace. That said, i'm having alot of issues with it. As in how do I hold and now work with that part of the text, how to ioi progress to the next part after the space etc.

    Have I got the right idea? How would you do it?

    Thanks

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Well, you are on one possibly right track with respect to parsing the input. Of course, for what you have described, a simpler alternative could be to designate a maximum length for a token, then just scanf your way through the entire input, but then if you want to allow for comments, this might not be so easy to adapt as your character by character read idea.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Jan 2013
    Posts
    3
    Indeed, i would prerably like to add comments as they are seen in code everyday so it seems logical to include them. I'll also be using multiple lines just to clarify if that wasn't obvious.

    So is fgetc the best course of action for my code if i want comments and spaces to separate instructions etc? Any more tips?

    Many thanks

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I wouldn't say best, but it can work. In any case, you need to be clear on what exactly is the format of the input that you are going to parse.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User
    Join Date
    May 2012
    Posts
    505
    Treating input as a stream and pulling put each token with fgetc() will work. You will also need ungetc() to push back the tokens. For example, you'll want to read all the whitespace until you hit a non-whitespace character, then put it back for the instruction getter to read.
    The snag is that you can't easily extend the program to use multi-character tokens, which is often a more intuittive way of parsing the stream.

    You might want to download a copy of MiniBasic, how to write a script interpreter, if you are serious about creating an assembler.
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  6. #6
    Registered User
    Join Date
    Jan 2013
    Posts
    3
    An example of a command parsed could be

    ldc 5

    a more complicated command could be something like...

    Load Accum: ldc 5 ; a label and an instruction

    The first part being the label, the second the instruction and the third a comment.

    The first and third would be ignored etc.

    I'm a little confused by what you mean Malcolm could you elaborate a bit or show an example?

    Laserlight I hope that clarifys things.

    I'm still having trouble breaking it up and then using the first section befgore the break and the section after the break..

    I don't really understand this..

    It's weird i've written a fair few C programs but creating this assembler just points out a large gaping whole in my knowledge base. It's good thouh in a way as I get to learn more and can build this into my knowledge bank

    Any help would be very much appreciated

    Many Thanks

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by RagingOrc
    An example of a command parsed could be

    ldc 5

    a more complicated command could be something like...

    Load Accum: ldc 5 ; a label and an instruction

    The first part being the label, the second the instruction and the third a comment.

    The first and third would be ignored etc.

    (...)

    Laserlight I hope that clarifys things.
    The clarification is not for me; it is for you

    Beyond examples, it may be helpful to express the syntax in say, some variant of BNF notation, even though assembly languages tend to have pretty simple syntax.

    Also, note that if you are going to allow for jumps, then it might not be feasible to ignore labels.

    Quote Originally Posted by RagingOrc
    I'm a little confused by what you mean Malcolm could you elaborate a bit or show an example?
    It depends on how you want to do it, so I don't agree that "you will also need ungetc() to push back the tokens", even though that is an approach that I might take myself.

    That said, imagine that you have a piece of code responsible for discarding whitespace. So, it just keeps reading and doing nothing with the characters until it detects a non-whitespace character. At this point, you could have it process the character, but it can be simpler to just put the character back into the input stream (hence the use of ungetc()) then exit this piece of code to return to the main loop that reads in characters and processes them.

    Quote Originally Posted by RagingOrc
    I'm still having trouble breaking it up and then using the first section befgore the break and the section after the break.
    One approach is to have lexical analysis as a separate step: you first break up the input into a list of tokens (e.g., a dynamic array). In this, I also disagree with the assertion that "you can't easily extend the program to use multi-character tokens". You can do it if you keep track of the current token, and store it in the list when you detect a character that is whitespace or the start of the next token.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    Stoned Witch Barney McGrew's Avatar
    Join Date
    Oct 2012
    Location
    astaylea
    Posts
    420
    Well, what you asked for in your first post is done fairly simply with scanf, like so:
    Code:
    char token[64 + 1];
    
    while (scanf("%64s", token) == 1) /* handle token */
    The downside of this method is that you can't discriminate between the newline and space characters, so you won't be able to handle single line comments. I can think of two good ways to get around this:
    • Implement a function that behaves similarly to scanf's 's' format specifier but also indicates the value of the whitespace characters it eats, then design a finite state machine based on the class of the token you've read.
    • Read an entire line using fgets and break up your input using sscanf, perhaps storing all the useful information of the line in a struct.

    If you decide to try the first method, the function you'd use for reading a token could be as simple as this:

    Code:
    int readtoken(char *s, size_t n)
    {
        int c;
    
        while ((c = getchar()) != EOF && !isspace(c))
            if (n) {
                *s++ = c;
                n--;
            }
        *s = '\0';
        return c;
    }

  9. #9
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Quote Originally Posted by RagingOrc View Post
    An example of a command parsed could be

    ldc 5

    a more complicated command could be something like...

    Load Accum: ldc 5 ; a label and an instruction

    [/CODE]
    You could use a regular expression engine. First write out some sample commandlines that might be interpreted by your assembler:
    Code:
    label1: mov ax 4 ; here is a comment
    mov bx 123;here is another comment
    a:push 456 ;...
    Generalize each possible commandline with the different combinations of components that might occur:
    (1) LABEL: CMD ARG1 ARG2 ;COMMENT
    (2) CMD ARG1 ARG2 ;COMMENT
    (3) LABEL: CMD ARG1 ;COMMENT
    ...
    (N)
    If a commandline matches any of these N cases, it is a valid commandline. Otherwise it is invalid.

    Translate each case 1..N into a regular expression and use capture groups to pick out each component you are interested in. Here is a pseudocode outline:

    Code:
    // (Case 1) LABEL: CMD ARG1 ARG2 ;COMMENT
    if Commandline matches regexp ^\s*(\w+):\s*(\w+)\s+(\w+)\s+(\w+)\s*;(.*)$
        Label := capture[1]
        Cmd := capture[2]
        Arg1 := capture[3]
        Arg2 := capture[4]
        Comment := capture[5]
    else 
    // (Case 2) CMD ARG1 ARG2 ;COMMENT
    if Commandline matches regexp ^\s*(\w+)\s+(\w+)\s+(\w+)\s*;(.*)$
        Label := NONE
        Cmd := capture[1]
        Arg1 := capture[2]
        Arg2 := capture[3]
        Comment := capture[4]
    else
    // ...
    else
        print "Invalid commandline: ", Commandline
    end if
    There are several regex libraries which provides these matching abilities and are callable from C. An example is PCRE.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 3
    Last Post: 09-11-2012, 08:11 AM
  2. Strings and white space
    By ginom71 in forum C Programming
    Replies: 13
    Last Post: 07-12-2009, 09:48 AM
  3. Reading string with white space in C
    By gep in forum C Programming
    Replies: 2
    Last Post: 05-30-2009, 05:18 PM
  4. white space and fscanf
    By DMaxJ in forum C Programming
    Replies: 2
    Last Post: 06-10-2003, 09:18 AM
  5. File I/O (white-space)
    By yougowego in forum C++ Programming
    Replies: 2
    Last Post: 10-26-2001, 10:35 PM