Parsing multiple pages of dynamically created Websites

This is a discussion on Parsing multiple pages of dynamically created Websites within the C Programming forums, part of the General Programming Boards category; Howdy I'm considering writing a simple program which grabs a specific piece of information from a webpage online, well, multiple ...

  1. #1
    Registered User
    Join Date
    May 2009
    Posts
    2

    Parsing multiple pages of dynamically created Websites

    Howdy

    I'm considering writing a simple program which grabs a specific piece of information from a webpage online, well, multiple pages online. The page is dynamically created by polling a database through variables passed in the web address through the "get" method, i.e.

    http://www.thispage.com/index.aspx?variable=1234

    The variable increments by a specified amount which can be accounted for in the program. As far as I'm thinking, the best way to go about this would be using a loop to handle the increments for the variable that's changing. Create a new string containing "wget http://www.thispage.com/index.aspx?variable=%s", variable.

    Then system(string) where string is that combined string. To extract what I'm looking for, then create a new string with something such as "grep value *.aspx* > output_file", system(secondstring) and then simply open that file and parse the single string (which was returned from grep and stored in a file named output_file). Finally, unlink the file before the beginning of each subsequent loop (for purposes of identifying if a page wasn't downloaded).

    But I'm wondering if this is the best way to go about it? I'm comfortable with the code behind what I've described, but are there libraries available where I wouldn't have to use Unix for the wget/grep aspect of this?

  2. #2
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    IMO you do not actually make much sense. I would also guess that

    I'm comfortable with the code behind what I've described
    does not refer your (obviously non-existent) knowledge of the C programming language. Please be honest, life is much easier that way.

    Anyway, w/r/t options instead of wget: you will need to use the socket API, and process tcp/ip packets, which is far from simple. Just to grab a page is like 50-100 lines of code.

    w/r/t options instead of grep: there is a regexp library for C (regex.h) but you don't need to do that, once you have the page (you don't have to save anything to a file, you can keep it in memory) this would be a reasonably basic parsing exercise, difficult but not impossible for a beginner.

    tip: if this is your area of interest, you will be better off pursuing bash scripting or *perl*. To do this completely in C without any system calls will be several hundred lines, in perl <50.
    Last edited by MK27; 05-24-2009 at 10:16 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  3. #3
    Registered User
    Join Date
    May 2009
    Posts
    2

    Response..

    I'm thick skinned, no harm done in your opinion. Though I was surprised to read your attitude towards new posters.

    Code:
    int main(int argc, char **argv) {
    
    	char http_reference[100];
    	int i = 0;
    
    	while (i < 100) 
    		{ // grab and parse 100 pages
    		sprintf(http_reference, "wget http://www.somedomain.com/index.aspx?variable=%d", i * 12);
    		system(http_reference);
    		sprintf(http_reference, "grep mystring *.aspx* > my_output_file");
    		system(http_reference);
    		parseMyFile();
    		sprintf(http_reference, "index.aspx?variable=%d", i * 12);
    		unlink(http_reference);
    		unlink("my_output_file");
    		i++; 
    	}
    
    return 0;
    }
    
    void parseMyFile() {
    
    	FILE *mof; // my_output_file
    	char c;
    
    	if ((mof = fopen("my_output_file", "r")) == NULL) 
    	{
    		exit(0); 
    	}
    
    	while ((c = getc(mof)) != 'Z') {	}
    
    	// Do something meaningful with c and the next n characters.
    
    return;
    }
    I'd like some alternatives or a reference to a C library to simplify this. (And I wouldn't be surprised if there were a few problems with the above code as it's something thrown together as a concept. Trying to offer it as a quick idea to what I'm aiming for.)
    Last edited by Quarlash; 05-24-2009 at 10:59 PM.

  4. #4
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Your skin ain't thick, you're paranoid if you think I'm even a little bit mean. Or else I am turning into exactly the kind of insensitive bastard I despised when I was I new poster Maybe you thot this was the "Rainbow Brite" forum ?

    That does clarify your intent, well done. Take a crack at sockets if you want; they are challanging. Here is a thing I wrote last year when learning; it downloads an (eg jpg) image off the net as raw data and writes it back into a file, if you want some idea of what's involved in such an apparently simple task (after that, find a tutorial). You will notice the GET HTTP request in main() at the bottom*.

    But first, work on taking the grep out. That will mean expanding parseMyFile() to open the file, locate the string and do what you will. As I mentioned before, there is a regex.h, if you have some knowledge or experience of regular expressions (grep = global regular expression). That is not necessary, however. It can be done fairly easily by processing one line at a time, in a loop, using general string functions like strcmp() and strstr().

    * Here is the "mine.h" if you actually want to compile and try that, but you don't really need to bother -- this is just a clue as to what you are getting yourself into, sans wget.
    Last edited by MK27; 05-25-2009 at 12:00 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  5. #5
    Registered User
    Join Date
    Jul 2009
    Posts
    3
    MK27 very kindly answered the OP's question (a good one, IMO, since I'm trying to solve the same problem):

    "IMO you do not actually make much sense. I would also guess that
    Quote: I'm comfortable with the code behind what I've described
    does not refer your (obviously non-existent) knowledge of the C programming language. Please be honest, life is much easier that way."

    So naturally, always wanting to learn new stuff from one who has knowledge of the C programming language (KCPL), I took a look at the code that MK27 generously gave us:

    Code:
    void pretrunc (char *string, int chs) {
    	int i;
    	for (i=0;i<strlen(string);i++) string[i]=string[i+chs];
    	string[i]='\0';
    }
    
    struct in_addr parseaddr (char *addr) {
    	struct hostent *info;
    	struct in_addr address, *ptr;
    	
    	if ((strncmp(addr,"http",4))==0) pretrunc(addr,7);
    	...
    }
    Excellent! D'ya suppose there's some way to get rid of pretrunc()? It's called only once in grabimage.c so maybe it doesn't have to be a function at all. Let's give it a shot:

    Code:
    struct in_addr parseaddr (char *addr) {
      int i;
      struct hostent *info;
      struct in_addr address, *ptr;
    
      if ((strncmp(addr,"http",4))==0) {
        for (i=0;i<strlen(string);i++) string[i]=string[i+chs];
        string[i]='\0';
      }
      ...
    }
    Great! We got rid of an unneeded function and the whole thing is easier to read and follow. But wait a sec: isn't there a better way to delete the constant string "http://" than to move all the chars in the URI one-by-one? Let's see:

    Code:
    struct in_addr parseaddr (char *addr) {
      int i;
      struct hostent *info;
      struct in_addr address, *ptr;
    
      if ((strncmp(addr,"http",4))==0) {
        memmove(addr, &addr[7], strlen(addr) + 1);
      }
      ...
    }
    Hey, that looks OK! memmove() copies a bunch of chars from place to place in one shot, and without that loop and the pesky call to strlen() every time through. Thanks, MK27, for a great lesson from one who has KCPL.

    I haven't done too much with the togbool() function that is in mine.h:

    Code:
    int togbool (int *boo) {
    	if (*boo==0) *boo=1;
    	else *boo=0;
    	return *boo;
    }
    I guess this a function that toggles a boolean between TRUE and FALSE. Does this really have to be a function? We're talking about doing an XOR on one bit:

    Code:
      *boo ^= 1;
    Would that work? Or something like that?

    Don'cha love to learn new stuff? Thanks a million MK27!

    -- pt

  6. #6
    Registered User linuxdude's Avatar
    Join Date
    Mar 2003
    Location
    Louisiana
    Posts
    926
    Meh. I'd use python. It has a great liburl2 library and their regex library is good too. This isn't a troll, but I like to use tools that are good for the job. Are you required to use C?

  7. #7
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by pete142 View Post
    Don'cha love to learn new stuff? Thanks a million MK27!
    I hope you are not being sarcastic or something pete Wow was I cranky on 5/25/09.

    Anyway, I did try to introduce that with the caveat that it is something I wrote while learning last year. I put stuff like that on-line as a reference mostly for myself, but occasionally it seems pertinent to others -- it's nice to see someone made some use of it! One day in the near future I may go thru those and update them; for example I now never never put strlen in a for() condition, something that you caught. I still believe it is probably a decent demo of how to use the socket API functions, eg, the syntax is correct and it does work. Unfortunately I was obviously not interested in /* adding comments */ much at the time.

    I'm also not using that header anymore; almost certainly I did write "pretrunc" for use there, but made it a generalized function (hence it's in that header). So my expectation was that "pretrunc'ing" a string was something I might do in various programs.

    Time has revealed, however, that I don't really need a library of "generalized functions" like that much, since I now think C functions are better the more they are specialized. Which might explain why there are surprisingly few C libraries for anything, compared to many languages where half the programming can be done by including the right::modules or whatever.

    Thanks btw pete for the x ^= 1 toggle.

    Quote Originally Posted by linuxdude View Post
    Meh. I'd use python. Are you required to use C?
    No, I think I recommended perl to the OP. Or just a bash script, since he/she likes wget.

    Keep in mind tho, the people who wrote python and perl for your convenience used C to do it. So behind your ".url" functions there is probably some hundreds of lines of code much like that.
    Last edited by MK27; 07-01-2009 at 02:23 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #8
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Wow! Guess what I did this afternoon: I re-wrote that "grabimage" program and added more comments, since a few different people seemed interested, and I have recieved pm's about it, etc.

    grabimage.c

    It is much more comprehensible now and doesn't require a non-standard header. Plus I found a buffer overflow in the old one, and my method for determining the end of the HTTP header was flawed.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  9. #9
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,246
    There are still some things you may want to look into fixing:
    - You assume that HTTP headers are case sensitive. They are not.
    - What happens if no Content-Length header is sent back?
    - while (buffer[0]!='\r') { The first time through this loop, you are accessing undefined memory. The contents of buffer[0] are undefined.
    - You don't check the return values of fopen() and malloc().

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by bithub View Post
    There are still some things you may want to look into fixing:
    - You assume that HTTP headers are case sensitive. They are not.
    Good point.

    - What happens if no Content-Length header is sent back?
    - You don't check the return values of fopen() and malloc().
    I believe Content-Length is required, but in any case the header prints out so the user will notice this. It's not really intended to be a program anyone would actually use, more just to demonstrate some things that IMO slipped thru the cracks of a lot of socket tutorials. For that, I ain't adding clutter checking fopen() and malloc().

    - while (buffer[0]!='\r') { The first time through this loop, you are accessing undefined memory. The contents of buffer[0] are undefined.
    Yep! Will fix.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,246
    I believe Content-Length is required
    No, it is not. If a server is using Chunked Encoding, then the size of the response is embedded in the response data.

  12. #12
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by bithub View Post
    No, it is not. If a server is using Chunked Encoding, then the size of the response is embedded in the response data.
    Maybe I will add a little note about that then.

    Thanks bithub!
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multiple Pages
    By rehan in forum C++ Programming
    Replies: 4
    Last Post: 09-13-2007, 11:25 AM
  2. To access multiple pages of a file
    By rehan in forum C++ Programming
    Replies: 6
    Last Post: 07-03-2007, 02:15 AM
  3. Dynamically adding method to created controls
    By earth_angel in forum Windows Programming
    Replies: 4
    Last Post: 06-26-2005, 07:11 PM
  4. Avoiding leaks on dynamically created data members
    By Mario in forum C++ Programming
    Replies: 13
    Last Post: 06-01-2002, 11:31 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21