Skipping useless lines?

**tmac619619** · 11-24-2012

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        <!-- ******************** Page title ******************** -->
        <title>UC Davis: Summer Sessions 2008</title>
        <!-- ******************** End page title ******************** -->
        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


<!-- A Modern CSS -->
<style type="text/css" media="screen">
        @import url(includes/main.css);
.style16 {      color: #FFFFFF;
        font-weight: bold;
}
#page_content #calendar #disclaimerinfo {
        margin-right: 3em;
}
/* If class is cancelled, use this style: */
.cancel {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 125%; font-weight:bold; color:#FF0000;}
</style>


<style type="text/css" media="print">
        @import url(includes/print.css);
</style>
<link href="includes/substyles.css" rel="stylesheet" type="text/css" />
</head>


<body>


<div id="page_content">


<!--
        This will show only in text-only browsers and allows the user
        to skip the primary navigation and go directly to either
        the second-level navigation or the main content of the page
-->
        <div id="skipnav">Skip directly to: <a href="#content">Main page content</a></div>


        <!-- ******************** Header ******************** -->
        <div id="header">
        <!-- The actual logo image is the background of the logo_wrapper DIV.
                 This makes the page more compatible with text-only browsers. -->
                <div id="logo_wrapper">





 81104           AAS     052      //    <--start of useful line 
 61754           AAS     050

What method can i use to process the data in the useful line above?
Someone mentioned using strncmp. But the entire file is very long and doing it manually will take a long time. Basically skip the html junk above and jump into the line with 81104.

**std10093** · 11-24-2012

I won't agree with that someone.How would you do it?You would compare with what?This implies that you know what the useful data can be.

The  are for comments, so I would eat the comments.
What is inside < > should be eaten too.

As a result the useful line remains.

But I do not see any obvious way of not reading all the data.In fact I think it is a must.

**tmac619619** · 11-24-2012

i'm not sure how i can use strncmp method for this...

**std10093** · 11-24-2012

This is what I am saying

I think that this is not the correct approach.

You should write code that uncomments the data.Also your code should remove what is enclosed into < >
In other words you should just skip them and only when you read useful data do something with it .

For example this code will uncomment code in C ( well it is not completely correct but it works with simple cases)

Code:

#include <stdio.h>

int main(void)
{
    int c=0;/*Must declare as int because of EOF*/
    int next=0;/*We have to check the next character to find out if it is a comment*/
    while ((c = getchar()) != EOF)
    {
         if(c == '/' )/*Possible start of comment*/
         {
            /*Notice were returned values of getchar are stored.This happens in order to be able to print them if they are not a comment*/
            next = getchar();
            if(next == '*')/*Is it really a start of comment?*/
            {
                c = getchar();
                while(c == '*' )/*Eat all the asterics until the first occurence of a non-asterisc character*/
                   c = getchar();
                while(c != '*')/*Eat letters,spaces,etc.*/
                   c = getchar();
                next = getchar();
                while(next == '*' )/*Eat all the asteriscs that might lie at the end of the comment*/
                   next = getchar();
                if(next == '/')/*Eat the last slash of the comment.Careful not to eat them all*/
                   continue;
                else
                {
                   while(1)
                   {
                       c = getchar();
                       if(c == '*')
                       {
                           c = getchar();
                           if(c == '/')
                              break;
                       }
                   }
                }
            }
            else/*It turned out it is not a comment,so just print what you get from buffer of stdin*/
            {
                putchar(c);
                putchar(next);
            }
         }
         else/*Sure not to be a comment so just print what you read*/
            putchar(c);
    }

    return 0;
}

Run

Code:

Macintosh-c8bcc88e5669-9:hw2 usi$ ./u
/*jkdhsdjfh*/ int a; /*fjkdsfhjkds*/
 int a;

**tmac619619** · 11-24-2012

bump.

STD your method seems correct, but i think its a bit complex for this assignment

**Prelude** · 11-24-2012

Since you seem to have a way of determining whether a line is "useful", why not just write a validation function to tell you if the current line is one of them and discard it if not:

Code:

while (fgets(line, sizeof line, in) != NULL) {
    if (is_valid_line(line))
        process(line);
}

I'm guessing the number of false positives would be minimal with an appropriately written validation routine.

**tmac619619** · 11-24-2012

So basically... almost of the junk lines start with either a comment or <.

So thats the pattern i should use to discard them?

**whiteflags** · 11-24-2012

This is the part where you code something to try out the idea and see if it works.

**Prelude** · 11-24-2012

Originally Posted by tmac619619

So basically... almost of the junk lines start with either a comment or <.

So thats the pattern i should use to discard them?

I'd probably go with searching for the specific format of your "useful" lines (ie. number TAB code TAB number), but looking for obviously wrong line starters works too. You just need to consider all of the possible starters and also make sure to handle invalid lines that get through as false positives.

**std10093** · 11-24-2012

Originally Posted by Prelude

Since you seem to have a way of determining whether a line is "useful", why not just write a validation function to tell you if the current line is one of them and discard it if not:

Code:

while (fgets(line, sizeof line, in) != NULL) {
    if (is_valid_line(line))
        process(line);
}

I'm guessing the number of false positives would be minimal with an appropriately written validation routine.

While I am not against this approach, I would like to warn for the fact that this data may exist

Code:

<!-- I am a comment --> <div classs="myClass">USEFUL DATA</div>

**Nominal Animal** · 11-24-2012

I'd just skip the input until a known marker string, say <div id="logo_wrapper">, and consume all whitespace including newlines that follow.

One way to approach it is to write a simple state machine, that consume input until it has processed the marker. An easier way is to use the %n fscanf() conversion pattern, and let it do the heavy work.

If the scan pattern matches until %n, the number of characters consumed thus far (by this pattern) is saved to the integer parameter.

The reason this is rarely seen in the wild is that the standards disagree whether it is counted as a conversion or not -- the return value of the function does not tell you whether it was saved or not. I avoid the issue by initializing the integer to -1, then check if it has changed due to %n.

Code:

/* Skip to next "<div id="logo_wrapper">".
 * Returns 0 if success, EOF if end of input.
*/
int skip_to_marker(FILE *const in)
{
    int  ignore, length;

    while (1) {

        /* Skip to next '<'. */
        ignore = fscanf(in, "%*[^<]");
        if (ignore == EOF)
            return EOF;

        /* Is it '<div id="logo_wrapper">'? */
        length = -1;
        ignore = fscanf(in, "<div id = \"logo_wrapper\" > %n", &length);
        if (length > 0)
            return 0; /* Found! */

        /* Not found. Consume the '<'. */
        ignore = getc(in);
    }
}

Note that if there are more than one possible marker, you can check for each of them. If the constant pattern does not match, it is not consumed from the input. Therefore, you can just add more "Is it '<...>'?" paragraphs like above.

Note that space in the scan pattern is special: it matches any amount of whitespace, including none. In other words, the above will accept <divid="logo_wrapper"> as the marker.

If this approach is too crude for you, do not start writing your HTML parser, but use one of the existing ones instead, and use the document object model or something to determine where your interesting data lies.

For anything more than a quick hack, I'd definitely use a HTML parser library myself.

**Prelude** · 11-24-2012

Originally Posted by std10093

While I am not against this approach, I would like to warn for the fact that this data may exist

Code:

<!-- I am a comment --> <div classs="myClass">USEFUL DATA</div>

I didn't get the impression from the question that we're looking at any possible valid HTML, but if that's the case then any ad hoc solution less than a full XML/HTML parser is not recommended.

Thread: Skipping useless lines?

Thread Tools

Search Thread

Display

Skipping useless lines?

Similar Threads

fgets skipping lines

reading a file with getline and skipping lines..

skipping lines with an input text file

skipping lines

compiler skipping lines..?