Thread: Skipping useless lines?

  1. #1
    Registered User
    Join Date
    Oct 2012
    Posts
    158

    Skipping useless lines?

    Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
            <!-- ******************** Page title ******************** -->
            <title>UC Davis: Summer Sessions 2008</title>
            <!-- ******************** End page title ******************** -->
            <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
    
    
    <!-- A Modern CSS -->
    <style type="text/css" media="screen">
            @import url(includes/main.css);
    .style16 {      color: #FFFFFF;
            font-weight: bold;
    }
    #page_content #calendar #disclaimerinfo {
            margin-right: 3em;
    }
    /* If class is cancelled, use this style: */
    .cancel {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 125%; font-weight:bold; color:#FF0000;}
    </style>
    
    
    <style type="text/css" media="print">
            @import url(includes/print.css);
    </style>
    <link href="includes/substyles.css" rel="stylesheet" type="text/css" />
    </head>
    
    
    <body>
    
    
    <div id="page_content">
    
    
    <!--
            This will show only in text-only browsers and allows the user
            to skip the primary navigation and go directly to either
            the second-level navigation or the main content of the page
    -->
            <div id="skipnav">Skip directly to: <a href="#content">Main page content</a></div>
    
    
            <!-- ******************** Header ******************** -->
            <div id="header">
            <!-- The actual logo image is the background of the logo_wrapper DIV.
                     This makes the page more compatible with text-only browsers. -->
                    <div id="logo_wrapper">
    
    
    
    
    
     81104           AAS     052      //    <--start of useful line 
     61754           AAS     050
    What method can i use to process the data in the useful line above?
    Someone mentioned using strncmp. But the entire file is very long and doing it manually will take a long time. Basically skip the html junk above and jump into the line with 81104.
    Last edited by tmac619619; 11-24-2012 at 04:03 PM.

  2. #2
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    I won't agree with that someone.How would you do it?You would compare with what?This implies that you know what the useful data can be.

    The <!-- ... --> are for comments, so I would eat the comments.
    What is inside < > should be eaten too.

    As a result the useful line remains.

    But I do not see any obvious way of not reading all the data.In fact I think it is a must.

  3. #3
    Registered User
    Join Date
    Oct 2012
    Posts
    158
    i'm not sure how i can use strncmp method for this...

  4. #4
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    This is what I am saying I think that this is not the correct approach.

    You should write code that uncomments the data.Also your code should remove what is enclosed into < >
    In other words you should just skip them and only when you read useful data do something with it .

    For example this code will uncomment code in C ( well it is not completely correct but it works with simple cases)
    Code:
    #include <stdio.h>
    
    int main(void)
    {
        int c=0;/*Must declare as int because of EOF*/
        int next=0;/*We have to check the next character to find out if it is a comment*/
        while ((c = getchar()) != EOF)
        {
             if(c == '/' )/*Possible start of comment*/
             {
                /*Notice were returned values of getchar are stored.This happens in order to be able to print them if they are not a comment*/
                next = getchar();
                if(next == '*')/*Is it really a start of comment?*/
                {
                    c = getchar();
                    while(c == '*' )/*Eat all the asterics until the first occurence of a non-asterisc character*/
                       c = getchar();
                    while(c != '*')/*Eat letters,spaces,etc.*/
                       c = getchar();
                    next = getchar();
                    while(next == '*' )/*Eat all the asteriscs that might lie at the end of the comment*/
                       next = getchar();
                    if(next == '/')/*Eat the last slash of the comment.Careful not to eat them all*/
                       continue;
                    else
                    {
                       while(1)
                       {
                           c = getchar();
                           if(c == '*')
                           {
                               c = getchar();
                               if(c == '/')
                                  break;
                           }
                       }
                    }
                }
                else/*It turned out it is not a comment,so just print what you get from buffer of stdin*/
                {
                    putchar(c);
                    putchar(next);
                }
             }
             else/*Sure not to be a comment so just print what you read*/
                putchar(c);
        }
    
        return 0;
    }
    Run
    Code:
    Macintosh-c8bcc88e5669-9:hw2 usi$ ./u
    /*jkdhsdjfh*/ int a; /*fjkdsfhjkds*/
     int a;

  5. #5
    Registered User
    Join Date
    Oct 2012
    Posts
    158
    bump.

    STD your method seems correct, but i think its a bit complex for this assignment

  6. #6
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    Since you seem to have a way of determining whether a line is "useful", why not just write a validation function to tell you if the current line is one of them and discard it if not:
    Code:
    while (fgets(line, sizeof line, in) != NULL) {
        if (is_valid_line(line))
            process(line);
    }
    I'm guessing the number of false positives would be minimal with an appropriately written validation routine.
    My best code is written with the delete key.

  7. #7
    Registered User
    Join Date
    Oct 2012
    Posts
    158
    So basically... almost of the junk lines start with either a comment or <.

    So thats the pattern i should use to discard them?

  8. #8
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    This is the part where you code something to try out the idea and see if it works.

  9. #9
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    Quote Originally Posted by tmac619619 View Post
    So basically... almost of the junk lines start with either a comment or <.

    So thats the pattern i should use to discard them?
    I'd probably go with searching for the specific format of your "useful" lines (ie. number TAB code TAB number), but looking for obviously wrong line starters works too. You just need to consider all of the possible starters and also make sure to handle invalid lines that get through as false positives.
    My best code is written with the delete key.

  10. #10
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Quote Originally Posted by Prelude View Post
    Since you seem to have a way of determining whether a line is "useful", why not just write a validation function to tell you if the current line is one of them and discard it if not:
    Code:
    while (fgets(line, sizeof line, in) != NULL) {
        if (is_valid_line(line))
            process(line);
    }
    I'm guessing the number of false positives would be minimal with an appropriately written validation routine.
    While I am not against this approach, I would like to warn for the fact that this data may exist
    Code:
    <!-- I am a comment --> <div classs="myClass">USEFUL DATA</div>

  11. #11
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    I'd just skip the input until a known marker string, say <div id="logo_wrapper">, and consume all whitespace including newlines that follow.

    One way to approach it is to write a simple state machine, that consume input until it has processed the marker. An easier way is to use the %n fscanf() conversion pattern, and let it do the heavy work.

    If the scan pattern matches until %n, the number of characters consumed thus far (by this pattern) is saved to the integer parameter.

    The reason this is rarely seen in the wild is that the standards disagree whether it is counted as a conversion or not -- the return value of the function does not tell you whether it was saved or not. I avoid the issue by initializing the integer to -1, then check if it has changed due to %n.

    Code:
    /* Skip to next "<div id="logo_wrapper">".
     * Returns 0 if success, EOF if end of input.
    */
    int skip_to_marker(FILE *const in)
    {
        int  ignore, length;
    
        while (1) {
    
            /* Skip to next '<'. */
            ignore = fscanf(in, "%*[^<]");
            if (ignore == EOF)
                return EOF;
    
            /* Is it '<div id="logo_wrapper">'? */
            length = -1;
            ignore = fscanf(in, "<div id = \"logo_wrapper\" > %n", &length);
            if (length > 0)
                return 0; /* Found! */
    
            /* Not found. Consume the '<'. */
            ignore = getc(in);
        }
    }
    Note that if there are more than one possible marker, you can check for each of them. If the constant pattern does not match, it is not consumed from the input. Therefore, you can just add more "Is it '<...>'?" paragraphs like above.

    Note that space in the scan pattern is special: it matches any amount of whitespace, including none. In other words, the above will accept <divid="logo_wrapper"> as the marker.

    If this approach is too crude for you, do not start writing your HTML parser, but use one of the existing ones instead, and use the document object model or something to determine where your interesting data lies.

    For anything more than a quick hack, I'd definitely use a HTML parser library myself.

  12. #12
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    Quote Originally Posted by std10093 View Post
    While I am not against this approach, I would like to warn for the fact that this data may exist
    Code:
    <!-- I am a comment --> <div classs="myClass">USEFUL DATA</div>
    I didn't get the impression from the question that we're looking at any possible valid HTML, but if that's the case then any ad hoc solution less than a full XML/HTML parser is not recommended.
    My best code is written with the delete key.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. fgets skipping lines
    By wassat676 in forum C Programming
    Replies: 4
    Last Post: 05-29-2011, 11:45 PM
  2. reading a file with getline and skipping lines..
    By kocmohabt33 in forum C++ Programming
    Replies: 2
    Last Post: 01-29-2011, 12:37 AM
  3. skipping lines with an input text file
    By kwikness in forum C++ Programming
    Replies: 7
    Last Post: 12-12-2006, 09:11 AM
  4. skipping lines
    By LightKnight86 in forum C++ Programming
    Replies: 2
    Last Post: 09-20-2003, 08:26 PM
  5. compiler skipping lines..?
    By Linette in forum C++ Programming
    Replies: 6
    Last Post: 04-12-2002, 11:59 PM