Simple C Parser (-> XHTML)

**SMurf** · 02-13-2013

Hello,

I've knocked up a rough C parser for the purpose of colourizing code into XHTML/CSS, which makes me feel fancy.

However, it doesn't quite handle comments properly. I can't quite work out how to deal with the slashes. Plus I'm sure there are other places that it slips up that don't feature in my simple tests, so here you go:-

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define FLAG_OUTPUT	1
#define FLAG_EVEN	2
#define NUMKEYWORDS	32

typedef struct {
    size_t nLen;
    char *sz;
} t_string;

const t_string c_keywords[NUMKEYWORDS] = { 4, "auto", 5, "break", 4, "case", 4, "char", 5, "const", 8, "continue", 7, "default", 2, "do",
                                        6, "double", 4, "else", 4, "enum", 6, "extern", 5, "float", 3, "for", 4, "goto", 2, "if",
                                        3, "int", 4, "long", 8, "register", 6, "return", 5, "short", 6, "signed", 6, "sizeof", 6, "static",
                                        6, "struct", 6, "switch", 7, "typedef", 5, "union", 8, "unsigned", 4, "void", 8, "volatile", 5, "while" };

int main(int argc, char **argv)
{
    char cEscape, cLast, *pBuffer, *pEnd, *pPtr;
    int c;
    unsigned int fFlags, n;

    pBuffer = malloc(BUFSIZ);
    if (!pBuffer)
        return -1;

    pPtr = pBuffer;
    pEnd = pBuffer + ((BUFSIZ - 1) * sizeof(*pBuffer));
    fFlags = FLAG_EVEN;
    cEscape = '\0';
    cLast = '\0';
    printf("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\r\n    \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\r\n<head>\r\n<link rel=\"stylesheet\" type=\"text/css\" href=\"source.css\"/>\r\n<title>Source code</title>\r\n</head>\r\n<body>\r\n<div>\r\n");
    while ((c = getchar()) != EOF)
    {
        // HACK: if we ate a comment escape char, but haven't started a comment, put it back!
        if ((cLast == '/' && c != '/' && c != '*') || (cLast == '*' && c != '/'))
            putchar(cLast);

        if (strchr("\t\n\r !\"&()*+,-./:;<=>?@[\\]^`{|}~", c) || cEscape != '\0')
        {
            if (pPtr > pBuffer)
            {
                *pPtr = '\0';
                for (n=0;n<NUMKEYWORDS;n++)
                {
                    if (!strncmp(pBuffer, c_keywords[n].sz, (c_keywords[n].nLen >= (size_t)(pPtr - pBuffer)) ? c_keywords[n].nLen : (pPtr - pBuffer)))
                    {
                        printf("<span class=\"keyword\">%s</span>", pBuffer);
                        break;
                    }

                }

                if (n == NUMKEYWORDS)
                    printf("%s", pBuffer);

                pPtr = pBuffer;
                fFlags |= FLAG_OUTPUT;
            }

            if (c == '\t')
                printf("&nbsp;&nbsp;&nbsp;&nbsp;");
            else if (c == '\n')
            {
                if (!(fFlags & FLAG_OUTPUT))
                    printf("&nbsp;");

                printf((fFlags & FLAG_EVEN) ? "</div>\r\n<div class=\"evenline\">" : "</div>\r\n<div>");
                fFlags &= ~FLAG_OUTPUT;
                fFlags ^= FLAG_EVEN;
                if (cEscape == '/')
                    cEscape = '\0';

            }
            else if (c == ' ' && cLast == c)
                printf("&nbsp;");
            else if (c == '&')
                printf("&amp;");
            else if ((c == '*' || c == '/'))
            {
                if (cLast == '/' && cEscape == '\0')
                {
                    printf("<span class=\"comment\">/");
                    cEscape = (char)c;
                }
                else if (c == '/' && cLast == '*' && cEscape != '\0')
                {
                    printf("*/</span>");
                    cEscape = '\0';
                }

            }
            else if (c == '<')
                printf("&lt;");
            else if (c == '>')
                printf("&gt;");
            else if (c >= ' ')
            {
                if (c == '\"' && cLast != '\\')
                    cEscape = (cEscape == '\0') ? (char)c : '\0';

                putchar(c);
            }

            if (c != '\n')
                fFlags |= FLAG_OUTPUT;

        }
        else if (pPtr < pEnd)
            *(pPtr++) = (char)c;

        cLast = (char)c;
    }

    if (cEscape == '/' || cEscape == '*')
        printf("</span>");

    printf("</div>\r\n</body>\r\n</html>");
    free(pBuffer);
    return 0;
}

Oh, and in case you run it and don't understand why it hangs - it's reading from stdin

**anduril462** · 02-13-2013

Code:

// HACK: if we ate a comment escape char, but haven't started a comment, put it back!
if ((cLast == '/' && c != '/' && c != '*') || (cLast == '*' && c != '/'))
    putchar(cLast);

I'm pretty sure that doesn't do what you want. Read the documentation for putchar (link), it prints to stdout. You probably want ungetc (link). Note, it is only guaranteed that you can put one character back, so if you need to put back more, you need to develop your own little LIFO stack for putting back chars.

Also, I would not take this general approach of mashing everything into one loop with a bunch of awkward, nested if/else statements and the like. I would recommend a state machine here:

Code:

while ((c = getchar()) != EOF) {
    if (c == '\n') {
        // toggle odd/even row highlighting
    }
    switch (state) {
        case NORMAL:
            if (c == '/')
                state = START_COMMENT;
            else if (c == '"')
                state = START_STRING;
            ....
            else  // not going into some special state
                print out char normally
            break;
        case START_COMMENT:  // prev char was a '/'
            if (c == '/')
                state = LINE_COMMENT;
            else if (c == '*')
                state = BLOCK_COMMENT;
            ....
            break;
        case LINE_COMMENT:
            if (c == '\n')  // reached end of line, back to normal processing
                state = NORMAL;
            break;
        case START_COMMENT:  // prev char was a '/'
            // read until */ -- you may want a separate function to handle this
            break;
        case IN_STRING:
            // read until closing ", watch out for escaped quotes like \" and maybe other escaped chars like \\, \", \t, \n, \042, etc
            break;
        ...
    }
}

You can break the one big loop and case into several smaller state machines if need be.

**SMurf** · 02-13-2013

Originally Posted by anduril462

I'm pretty sure that doesn't do what you want.

I'm pretty sure it does! When I write "put it back", I mean "output in advance of the current character".
That part of the loop is focused on echoing individual characters/escaping HTML entities, in the case of comment tags it "eats" one character and does something dependent on the next character. If that next character isn't what was expected (it can't see it in advance), it spits it back out.

switch statement does look pretty sweet though.

**anduril462** · 02-13-2013

Ahh, yeah, i was a bit confused about the whole "put it back thing". In any case, I would strongly recommend the state machine approach, it's much more robust, and easier to maintain and modify. It also makes it easier to handle the C keywords, you can have a state like IN_WORD, that handles any variable or function names, keywords, etc. If that word matches your keyword list, then it's a keyword, otherwise it's some sort of identifier. Note, I don't see you handling pre-processor code.

You could even extend it to support standard C functions, variables and constants like printf, scanf, stdout, NULL, etc, and separate those from user functions/variables. Then you can potentially highlight all 3 different colors. Something at that level would probably require (or be easier with) a "second level" of state machine, i.e. if you're in NORMAL, and you get a valid identifier/keyword char, then you go into a function that reads until the end of the word (until space or punctuation or EOF), and returns a token type, say TOK_KEYWORD, TOK_STD_IDENTIFIER or TOK_USER_IDENTIFIER, and stores the keyword/identifier in a buffer you supply to the function. Then you use the token type to pick a highlight color, and print out the contents of the buffer.

**phantomotap** · 02-13-2013

However, it doesn't quite handle comments properly.

O_o

This seems like a silly thing to do in C. You are, I assume, planning on serving this to a browser, either locally or over the web, so why not just do the highlighting in Javascript? (Or whatever you may use to serve dynamic pages?) Even if you'd rather bake the source with HTML markup in advance, you'd be better served producing a model that can be dropped into a `code' or `div' in any page so you can get more mileage with view templates.

Also, are you really, really sure you want to build XHTML? Have you read any of the millions of pages that discuss why you should not be using XHTML?

*shrug*

Anyway, a casual glance says that you aren't even trying to catch the terminator for a "//" style comment ('\n').

Start with breaking things up as anduril462 has suggested. Once you have a state for "parsing a comment" it will be trivial to shift a state when you find the terminator.

Soma

**SMurf** · 02-13-2013

I'll admit it's a bit weird. I've been offered virtually unlimited space and bandwidth on a web server. Caveat: no PHP.

I would prefer not to use Javascript, to allow web crawlers to read the page without confusion.

What's wrong with XHTML? Does it cause fires? Searches just seem to turn up weird Americans/Indians/other (photo in the corner) who seem to be paid money(?) to churn out Wikipedia articles in their own words. I just thought it was what all the cool kids use these days. Could switch to HTML 4.01 I suppose, shorter code.

Your glance was waaaaaay too casual; under the "if" for '\n':-

Code:

if (cEscape == '/')
                    cEscape = '\0'; // not escaped any more!  Woo!

**phantomotap** · 02-13-2013

I would prefer not to use Javascript, to allow web crawlers to read the page without confusion.

O_o

This has nothing to do with spiders.

Sure, a spider isn't likely going to run Javascript.

However, a spider also isn't going to care about a few extra bits of markup or CSS.

In other words, a spider will see the content, the relevant bits of source, in any event; the Javascript would just do styling.

What's wrong with XHTML?

I've just checked. Those millions of pages are still available.

I just thought it was what all the cool kids use these days.

I don't know about the cool kids, but people who care about their pages rendering correctly and consistently in a wide variety of browsers don't use XHTML.

Your glance was waaaaaay too casual; under the "if" for '\n':

O_o

Let me know when you find the problem I was referencing. I'll wait.

Soma

Thread: Simple C Parser (-> XHTML)

Thread Tools

Search Thread

Display

Simple C Parser (-> XHTML)

Similar Threads

Simple parser - K&R ex6-2

simple question XML parser C

A simple C parser problem

Simple parser

IE support for xhtml