Thread: Could this be implemented a simpler way?

  1. #1
    Registered User
    Join Date
    Aug 2011
    Posts
    4

    Question Could this be implemented a simpler way?

    Working on a simple C program to process exported .html bookmark files. Specifically to eliminate everything but certain tags and their contents.

    I'm actually pretty pleased with what I've come up with so far as it does what it's intended to do.

    However, in the spirit of writing efficient code, I'm wondering if this is the best way to go about the business.

    Specifically I'm focusing on 3 particular html tags and the code just looks kind of messy despite seeming to work with out any problems.

    Code:
    void processtags_n_call(void)
    {
        const char new_line=10;
    
        /*various counters with different functions*/
        int href_cntr1=0,/*char matching confirmation counter for <A HREF="HTTP...> tags*/
            h3_cntr1=0,/*char matching confirmation counter for <H3...> tags*/
            dd_cntr1=0;/*char matching confirmation counter for <DD...> tags*/
    
        while ((cur_char=getc(input_html))!=EOF)/*main loop for filtering out relevant tags for processing*/
        {
            if ((cur_char=='<' && href_cntr1==0) ||
                (cur_char=='<' && h3_cntr1==0)   ||
                (cur_char=='<' && dd_cntr1==0))/*looking for tags matching requirements for conversion, if found, convert tags and scrape*/
            {
                href_cntr1++;
                h3_cntr1++;
                dd_cntr1++;
                continue;
            }
            else if ((cur_char=='A' && href_cntr1==1) ||
                     (cur_char=='H' && h3_cntr1==1)   ||
                     (cur_char=='D' && dd_cntr1==1))
            {
                if (cur_char=='A')
                {
                    href_cntr1++;
                    h3_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='H')
                {
                    h3_cntr1++;
                    href_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='D')
                {
                    dd_cntr1++;
                    href_cntr1=h3_cntr1=0;
                }
                continue;
            }
            else if ((cur_char==' ' && href_cntr1==2) ||
                     (cur_char=='3' && h3_cntr1==2)   ||
                     (cur_char=='D' && dd_cntr1==2))
            {
                if (cur_char==' ')
                {
                    href_cntr1++;
                    h3_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='3')
                {
                    h3_cntr1++;
                    href_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='D')
                {
                    dd_cntr1++;
                    href_cntr1=h3_cntr1=0;
                }
                continue;
            }
            else if ((cur_char=='H' && href_cntr1==3) ||
                     (h3_cntr1==3)                    ||
                     (dd_cntr1==3))
            {
                if (h3_cntr1==3)/*if true we have a <H3> tag to process and scrape*/
                {
                    scrape_h3();/*make call to the specific scraper and let it do it's thing*/
    
                    h3_cntr1=href_cntr1=dd_cntr1=0;/*zero out our counters*/
    
                    putc(new_line, output_html);/*go to the next line*/
    
                    continue;
                }
    
                else if (dd_cntr1==3)/*If true we have a <DD> tag to process and scrape*/
                {
                    scrape_dd();
                    href_cntr1=h3_cntr1=dd_cntr1=0;
                    putc(new_line, output_html);
    
                    continue;
                }
    
                else if (href_cntr1==3)
                {
                    href_cntr1++;
                    h3_cntr1=dd_cntr1=0;
    
                    continue;
                }
            }
            else if (cur_char=='R' && href_cntr1==4)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='E' && href_cntr1==5)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='F' && href_cntr1==6)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='=' && href_cntr1==7)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='"' && href_cntr1==8)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='h' && href_cntr1==9)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='t' && href_cntr1==10)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='t' && href_cntr1==11)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='p' && href_cntr1==12)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char==':' && href_cntr1==13)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='/' && href_cntr1==14)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (cur_char=='/' && href_cntr1==15)
            {
                href_cntr1++;
                h3_cntr1=dd_cntr1=0;
    
                continue;
            }
            else if (href_cntr1==16) /*If true we have a <a href="http://...> instance to process and scrape*/
            {
                scrape_href();
                href_cntr1=h3_cntr1=dd_cntr1=0;
                putc(new_line, output_html);
    
                continue;
            }
            else
            {
                href_cntr1=h3_cntr1=dd_cntr1=0;
            }
        }
    }
    Any input is valued and appreciated.

  2. #2
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,907
    Code:
    else if ((cur_char=='A' && href_cntr1==1) ||
                     (cur_char=='H' && h3_cntr1==1)   ||
                     (cur_char=='D' && dd_cntr1==1))
            {
                if (cur_char=='A')
                {
                    href_cntr1++;
                    h3_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='H')
                {
                    h3_cntr1++;
                    href_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='D')
                {
                    dd_cntr1++;
                    href_cntr1=h3_cntr1=0;
                }
                continue;
            }
    


    You could just say

    Code:
    else if (( href_cntr1==1) ||
               ( h3_cntr1==1)   ||
                (dd_cntr1==1))
            {
                if (cur_char=='A')
                {
                    href_cntr1++;
                    h3_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='H')
                {
                    h3_cntr1++;
                    href_cntr1=dd_cntr1=0;
                }
                else if (cur_char=='D')
                {
                    dd_cntr1++;
                    href_cntr1=h3_cntr1=0;
                }
                continue;
            }
    


    and I suppose you could turn the cur_char into a switch statement


  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    First, congrats on your program - you've broken down the problem quite well it seems. Second, your intuition that there is a simpler way to code this, is also my own. If you'll post up an example problem, we can take a look at it, from another perspective entirely. I wouldn't try to improve your code, because it needs a completely different approach to the problem.

    Give a sample you know the correct answer to already, and let's see what can be done.

  4. #4
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    The same char-by-char algorithm can be made quite a bit shorter (I haven't tested this code) :
    Code:
    void processtags_n_call(void) {
        int ch;
        char *http = "A HREF=\"http://";
        char *p = NULL;
        enum { TAG_NONE, TAG_SOME, TAG_A, TAG_H3, TAG_DD } tag = NONE;
    
        while ((ch = getc(input_html)) != EOF) {
            switch (tag) {
            case TAG_NONE:
                if (ch == '<')
                    tag = TAG_SOME;
                break;
            case TAG_SOME:
                switch (ch) {
                case 'A': tag = TAG_A; p = http + 1; break;
                case 'H': tag = TAG_H3; break;
                case 'D': tag = TAG_DD; break;
                default:  tag = TAG_NONE;
                }
                break;
            case TAG_DD:
                if (ch == 'D') {
                    scrape_dd();
                    putc('\n', output_html);
                }
                tag = TAG_NONE;
                break;
            case TAG_H3:
                if (ch == '3') {
                    scrape_h3();
                    putc('\n', output_html);
                }
                tag = TAG_NONE;
                break;
            case TAG_A:
                if (ch == *p++) {
                    if (*p == 0) {
                        scrape_href();
                        putc('\n', output_html);
                        tag = TAG_NONE;
                    }
                }
                else
                    tag = TAG_NONE;
                break;
            }
        }
    }
    But overall this is a terrible algorithm since it has to be hand-crafted to the data. Case TAG_A above points the way out of this. You could represent all the tags you're searching for as strings in an array and follow along in them until you hit the nul, meaning you've found that string.

    An alternative is to scan the file token-by-token instead of char-by-char. This would give the algorithm a different feel since you'd have whole tokens (A, DD, H3, HREF) to compare. At the expense of writing the tokenizer (which would have to handle quotes properly) the main algorithm would be much simpler.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  5. #5
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,907
    Ok, so I've had time to have a second look...

    I think that you should take advantage of string functions.
    *Look for '<'
    *Fill up a string buffer while the input is not '>'
    *Compare the string with known values using strncmp()

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    ^^^ Yes, string functions would be my first choice, also. Use char functions only when no string function will do what you need.

  7. #7
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    948
    Assuming the generated HTML is well-formed (not necessarily a safe assumption when it comes to Microsoft), you could try to use an existing XML parser. I have recently used a parser called Expat. This is a lightweight XML parser that scans an XML document from beginning to end, and for each start tag, end tag, and text between tags that it encounters, it calls a custom callback function that you provide.

    Edit: oops, hit the Post button too soon. :/

    Anyway, for this project you can set up a callback for open tags and look for "a" tags, and then handle the "href" attribute as you wish, and also any text that is contained inside the "a" element.

    Then again, expat might be overkill for this project.
    Last edited by christop; 08-06-2012 at 08:19 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to make this simpler
    By ftball22 in forum C++ Programming
    Replies: 6
    Last Post: 09-20-2010, 09:30 AM
  2. virtual serial port- is there simpler way?
    By stefan63 in forum Windows Programming
    Replies: 1
    Last Post: 11-23-2009, 12:17 PM
  3. Bulky...can you make it simpler
    By freddyvorhees in forum C++ Programming
    Replies: 3
    Last Post: 08-11-2008, 08:35 AM
  4. DirectPlay help...or maybe simpler?
    By Trireme in forum Windows Programming
    Replies: 0
    Last Post: 02-14-2003, 06:55 PM
  5. A simpler way...
    By face_master in forum C++ Programming
    Replies: 5
    Last Post: 11-28-2002, 05:11 AM