Finding URLS

This is a discussion on Finding URLS within the C Programming forums, part of the General Programming Boards category; I have a program that now is able to find text within quotes, in an html file (special thanks to ...

  1. #1
    Registered User
    Join Date
    Aug 2005
    Posts
    56

    Finding URLS

    I have a program that now is able to find text within quotes, in an html file (special thanks to Hammer). Here is what I ended up with...

    Code:
    #include <stdio.h> 
    #include <stdlib.h> 
    
    #define SIZE 500000 
    #define INURL 1 
    #define NOT_INURL !INURL 
    
    int main(void) 
    { 
      
      int i = 0, b = 0, d = 0; 
      FILE *fp; 
      int matches = 0; 
      fp = fopen("c:\\blah.html", "r"); 
      
    
      char html[SIZE]; 
      char url[150][512]; 
      //char *url_p; 
      char c; 
      int  State, count = 0; 
      
      fread(html, sizeof(char), SIZE, fp); 
      //puts (html); 
      
      for (State = NOT_INURL; c = html[i]; i++)
      { 
        
        if (c == '\"') 
        { 
         count++;
    	 matches++; 
          State = !State; 
          continue; 
        } 
        if (State == INURL) 
        { 
    	
    		url[b][d] = c;
    		printf("%c", url[b][d]);
    		d++;
    	} 
    	if (count == 2)
    	{
    		url[b][d] = '\0';
    		printf("\n");
    		b++;
    		count = 0;
    	}
      } 
      
       
      printf("There are %d matches\n\n\n\n\n", matches / 2); 
      
      fclose(fp); 
      return(0); 
    }
    However, this method doesn't work for URLS, it only works for strings within the quotes. Do any of you guys have some ideas on how I can single out the quotes? I was thinking of first searching for a quote, then, if the next four characters were "http", it would continue to stuff charcters into the array. Or, I could search through the array after it finishes finding all quoted text, and then search for "http," but I don't know the code for throwing out charcters of an array and having everything reordered, unless I rewrote the array and deleted the old one, which is inefficient. I was wondering if anybody has any ideas toward this dilemna, and I would really appreciate it if someone could throw some code ideas at me.

  2. #2
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,498
    Well you toggle the state on the known start and end patterns of URL's rather than quotes.

    > if (c == '\"')
    Basically, change this.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  3. #3
    Registered User
    Join Date
    Aug 2005
    Posts
    56
    But I still need to use the quotes. I want it to read in http and beyond, and then stop when it hits the second quote. Or I might just not understand what you are saying.

  4. #4
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,498
    I mean your tests for 'in' and 'out' of a URL need to be more sophisticated than comparing a single character, but the method is essentially the same.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  5. #5
    Registered User
    Join Date
    Aug 2005
    Posts
    56
    That's the problem, I do not know any other way to read in the file, except by character. I am wondering if there is any code for something like this....

    If you hit quote, you are in the URL.
    If the first character is 'h', keep going.
    If the second character is 't', keep going (etc. up until "http").

    The problem with that is, that when it hit a quote, it writes to the array. But if for example, the second character is not 't', how would I reset the array back to first index, so it can re-write?

  6. #6
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,498
    That is the whole concept of a state machine, though a grammar would be equally effective.
    You look at the character, and the state you're in and determine what your next state should be.

    Eg.
    Code:
    enum states {
      S_NONE,
      S_TAG,
      S_QUOTED_STRING,  /* to hide say   "this is how a url begins - http://" */
      S_ESCAPE_CHAR,
      S_COMMENT,
    };
    Then you basically have a switch statement for each
    Code:
    while ( (ch=fgetc(fp)) != EOF ) {
      switch ( state ) {
        case S_COMMENT:
          if ( ch == '\n' ) state = S_NONE; // comment ends at a newline
          break;
        // etc
      }
    }
    Though you'd probably want to make each state a separate function if life is going to get complicated.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. tools for finding memory leaks
    By stanlvw in forum C++ Programming
    Replies: 4
    Last Post: 04-03-2009, 11:41 AM
  2. Program to URLS from internet
    By DarkDot in forum C++ Programming
    Replies: 3
    Last Post: 05-07-2007, 07:45 PM
  3. Finding primes
    By starripper in forum C++ Programming
    Replies: 19
    Last Post: 01-14-2006, 03:17 PM
  4. Outlook 2003 blocking URLs
    By ober in forum Tech Board
    Replies: 7
    Last Post: 12-21-2005, 10:59 AM
  5. MFC :: Finding Child Window of a CWnd* Object?
    By SyntaxBubble in forum Windows Programming
    Replies: 2
    Last Post: 09-06-2003, 09:06 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21