Could this be implemented a simpler way?

**TropicalStarfis** · 08-05-2012

Working on a simple C program to process exported .html bookmark files. Specifically to eliminate everything but certain tags and their contents.

I'm actually pretty pleased with what I've come up with so far as it does what it's intended to do.

However, in the spirit of writing efficient code, I'm wondering if this is the best way to go about the business.

Specifically I'm focusing on 3 particular html tags and the code just looks kind of messy despite seeming to work with out any problems.

Code:

void processtags_n_call(void)
{
    const char new_line=10;

    /*various counters with different functions*/
    int href_cntr1=0,/*char matching confirmation counter for <A HREF="HTTP...> tags*/
        h3_cntr1=0,/*char matching confirmation counter for <H3...> tags*/
        dd_cntr1=0;/*char matching confirmation counter for <DD...> tags*/

    while ((cur_char=getc(input_html))!=EOF)/*main loop for filtering out relevant tags for processing*/
    {
        if ((cur_char=='<' && href_cntr1==0) ||
            (cur_char=='<' && h3_cntr1==0)   ||
            (cur_char=='<' && dd_cntr1==0))/*looking for tags matching requirements for conversion, if found, convert tags and scrape*/
        {
            href_cntr1++;
            h3_cntr1++;
            dd_cntr1++;
            continue;
        }
        else if ((cur_char=='A' && href_cntr1==1) ||
                 (cur_char=='H' && h3_cntr1==1)   ||
                 (cur_char=='D' && dd_cntr1==1))
        {
            if (cur_char=='A')
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='H')
            {
                h3_cntr1++;
                href_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='D')
            {
                dd_cntr1++;
                href_cntr1=h3_cntr1=0;
            }
            continue;
        }
        else if ((cur_char==' ' && href_cntr1==2) ||
                 (cur_char=='3' && h3_cntr1==2)   ||
                 (cur_char=='D' && dd_cntr1==2))
        {
            if (cur_char==' ')
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='3')
            {
                h3_cntr1++;
                href_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='D')
            {
                dd_cntr1++;
                href_cntr1=h3_cntr1=0;
            }
            continue;
        }
        else if ((cur_char=='H' && href_cntr1==3) ||
                 (h3_cntr1==3)                    ||
                 (dd_cntr1==3))
        {
            if (h3_cntr1==3)/*if true we have a <H3> tag to process and scrape*/
            {
                scrape_h3();/*make call to the specific scraper and let it do it's thing*/

                h3_cntr1=href_cntr1=dd_cntr1=0;/*zero out our counters*/

                putc(new_line, output_html);/*go to the next line*/

                continue;
            }

            else if (dd_cntr1==3)/*If true we have a <DD> tag to process and scrape*/
            {
                scrape_dd();
                href_cntr1=h3_cntr1=dd_cntr1=0;
                putc(new_line, output_html);

                continue;
            }

            else if (href_cntr1==3)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;

                continue;
            }
        }
        else if (cur_char=='R' && href_cntr1==4)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='E' && href_cntr1==5)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='F' && href_cntr1==6)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='=' && href_cntr1==7)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='"' && href_cntr1==8)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='h' && href_cntr1==9)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='t' && href_cntr1==10)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='t' && href_cntr1==11)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='p' && href_cntr1==12)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char==':' && href_cntr1==13)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='/' && href_cntr1==14)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (cur_char=='/' && href_cntr1==15)
        {
            href_cntr1++;
            h3_cntr1=dd_cntr1=0;

            continue;
        }
        else if (href_cntr1==16) /*If true we have a <a href="http://...> instance to process and scrape*/
        {
            scrape_href();
            href_cntr1=h3_cntr1=dd_cntr1=0;
            putc(new_line, output_html);

            continue;
        }
        else
        {
            href_cntr1=h3_cntr1=dd_cntr1=0;
        }
    }
}

Any input is valued and appreciated.

**Click_here** · 08-05-2012

Code:

else if ((cur_char=='A' && href_cntr1==1) ||
                 (cur_char=='H' && h3_cntr1==1)   ||
                 (cur_char=='D' && dd_cntr1==1))
        {
            if (cur_char=='A')
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='H')
            {
                h3_cntr1++;
                href_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='D')
            {
                dd_cntr1++;
                href_cntr1=h3_cntr1=0;
            }
            continue;
        }

You could just say

Code:

else if (( href_cntr1==1) ||
           ( h3_cntr1==1)   ||
            (dd_cntr1==1))
        {
            if (cur_char=='A')
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='H')
            {
                h3_cntr1++;
                href_cntr1=dd_cntr1=0;
            }
            else if (cur_char=='D')
            {
                dd_cntr1++;
                href_cntr1=h3_cntr1=0;
            }
            continue;
        }

and I suppose you could turn the cur_char into a switch statement

**Adak** · 08-05-2012

First, congrats on your program - you've broken down the problem quite well it seems. Second, your intuition that there is a simpler way to code this, is also my own. If you'll post up an example problem, we can take a look at it, from another perspective entirely. I wouldn't try to improve your code, because it needs a completely different approach to the problem.

Give a sample you know the correct answer to already, and let's see what can be done.

**oogabooga** · 08-06-2012

The same char-by-char algorithm can be made quite a bit shorter (I haven't tested this code) :

Code:

void processtags_n_call(void) {
    int ch;
    char *http = "A HREF=\"http://";
    char *p = NULL;
    enum { TAG_NONE, TAG_SOME, TAG_A, TAG_H3, TAG_DD } tag = NONE;

    while ((ch = getc(input_html)) != EOF) {
        switch (tag) {
        case TAG_NONE:
            if (ch == '<')
                tag = TAG_SOME;
            break;
        case TAG_SOME:
            switch (ch) {
            case 'A': tag = TAG_A; p = http + 1; break;
            case 'H': tag = TAG_H3; break;
            case 'D': tag = TAG_DD; break;
            default:  tag = TAG_NONE;
            }
            break;
        case TAG_DD:
            if (ch == 'D') {
                scrape_dd();
                putc('\n', output_html);
            }
            tag = TAG_NONE;
            break;
        case TAG_H3:
            if (ch == '3') {
                scrape_h3();
                putc('\n', output_html);
            }
            tag = TAG_NONE;
            break;
        case TAG_A:
            if (ch == *p++) {
                if (*p == 0) {
                    scrape_href();
                    putc('\n', output_html);
                    tag = TAG_NONE;
                }
            }
            else
                tag = TAG_NONE;
            break;
        }
    }
}

But overall this is a terrible algorithm since it has to be hand-crafted to the data. Case TAG_A above points the way out of this. You could represent all the tags you're searching for as strings in an array and follow along in them until you hit the nul, meaning you've found that string.

An alternative is to scan the file token-by-token instead of char-by-char. This would give the algorithm a different feel since you'd have whole tokens (A, DD, H3, HREF) to compare. At the expense of writing the tokenizer (which would have to handle quotes properly) the main algorithm would be much simpler.

**Click_here** · 08-06-2012

Ok, so I've had time to have a second look...

I think that you should take advantage of string functions.
*Look for '<'
*Fill up a string buffer while the input is not '>'
*Compare the string with known values using strncmp()

**Adak** · 08-06-2012

^^^ Yes, string functions would be my first choice, also. Use char functions only when no string function will do what you need.

**christop** · 08-06-2012

Assuming the generated HTML is well-formed (not necessarily a safe assumption when it comes to Microsoft), you could try to use an existing XML parser. I have recently used a parser called Expat. This is a lightweight XML parser that scans an XML document from beginning to end, and for each start tag, end tag, and text between tags that it encounters, it calls a custom callback function that you provide.

Edit: oops, hit the Post button too soon. :/

Anyway, for this project you can set up a callback for open tags and look for "a" tags, and then handle the "href" attribute as you wish, and also any text that is contained inside the "a" element.

Then again, expat might be overkill for this project.

Thread: Could this be implemented a simpler way?

Thread Tools

Search Thread

Display

Could this be implemented a simpler way?

Similar Threads

How to make this simpler

virtual serial port- is there simpler way?

Bulky...can you make it simpler

DirectPlay help...or maybe simpler?

A simpler way...