Thread: How Difficult is This?

  1. #16
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Do you have example data files you could upload somewhere, as Adak suggested?

    Your example programs do not show us the problem, only your constrained thinking of what you think the answer should look like.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  2. #17
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    OK, np.

    Here is a master list of URLs:
    swoopshare - Download TestMasterList.txt


    Here is a test search list:
    swoopshare - Download TestSearchList.txt

    Here is a test html page:
    swoopshare - Download TestHTMLPage.htm

    generally, I need make sure the base domain isn't duped, not just the whole url.

    One thing I might look for would be:
    powered by ShrinkTheWeb. (notice that ShrinkTheWeb is under an a tag).
    Sometimes I'd want the URL, sometimes just to know the TEXT says 'powered by ShrinkTheWeb'

    I hope this helps.
    As I mentioned, just a long list (sometimes 300K) compared with (usually) a shorter list and I need to remove the dupes before adding the shorter list to the longer one.
    Last edited by MAtkins; 02-10-2011 at 11:58 AM.

  3. #18
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    Right now I'm studying Binary Trees as a possible solution for the URL duping problem.
    I wonder if strstr would be faster or a Binary Tree?

    I need basic string search capabilities that will work with UTF-8
    .indexOf(char *Haystack, char *Needle) <- would return the index of Needle within Haystack

    .indexOfCI(char *Haystack, char *Needle) <- same thing but case insensitive

    .subStr(char *Str, int nStart) <- creates a substring of Str but shaves everything off before the INDEX of nStar

    .subStr(char *Str, int nStart, nLength) <- same thing but specifies the total length of the string returned

    .strReplaceFirst(char *Str, char *strSrch, char *strReplace) <- replaces first case of strSrch w/ strReplace in string Str

    .strReplace(char *Str, char *strSrch, char *strReplace) <- replaces all cases of strSrch w/ strReplace in string Str

    I'm working on these but I have to admit the whole 'pointer' thing is kicking my butt.
    I can't seem to find clear docs on nomenclature. I'm getting better though.
    I'm really surprised these funcs aren't in a readily available library already (or are they?).
    Of course the only reason I want them in C is because I need them to be FAST
    Last edited by MAtkins; 02-10-2011 at 12:05 PM.

  4. #19
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    The duplicate checking part is what I'm concentrating on.

    << Big Evil Grin >>

  5. #20
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Given your two data files, I can do the search of the 2nd list in the first list in about 0.25 seconds.
    Code:
    $ time ./a.out 
    597756 calls to comparison to sort 44337 records
    Hits=91, misses=122
    3216 calls to comparison to search 44337 records
    
    real	0m0.219s
    user	0m0.188s
    sys	0m0.020s
    That's a 1.6GHhz laptop and no attempt at optimisation.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #21
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I'm getting results similar to yours Salem, just using a sort and binary search, with string arrays. ***

    Unless there's a problem with the actual file sizes that were uploaded, this looks to be an algorithm fix, much more than anything else.

    MAtkins, can you confirm we have the right files? Have you tried using a binary search function to limit the comparisons that have to be made?

    *** On my spiffy new Pelles C IDE and compiler!

    Thanks, Common Tater, for the tip on Pelles C.
    Last edited by Adak; 02-10-2011 at 09:28 PM.

  7. #22
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Adak View Post
    *** On my spiffy new Pelles C IDE and compiler!

    Thanks, Common Tater, for the tip on Pelles C.
    It may not happen all that often, but at least some of the time I actually do know what I'm talking about

    It's a pretty cool tool... Glad you like it.

  8. #23
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    First a note on the binary search:
    ===============================
    The heart of the program is the binary search algorithm - which is just the old "guess the number I'm thinking of" game, you probably played at some time, as a child.

    Guess a number between 1 and 10:
    5 (always start in the middle)

    Too low: New "floor" is 6 (lowest possible valid number)
    8 (6 + 10)/2 is the new "middle"

    Too high: New "ceiling" is 7 (highest possible valid number)
    6 (6 + 7)/2, rounded down to nearest whole number (integer).

    Too low
    Has to be: 7

    Only 3 comparisons had to be made, instead of possibly 9. As the number of comparisons increase, binary search ("binary" because it constantly "cuts" the possible answer space, in half), just gets better and better.

    The only requirement to remember, is that the binary search has to be made, on sorted data. It won't work, otherwise.

    Program note:
    ====================
    The fastest string sorter I know uses the system call:
    system("sort master.txt >master1.txt");

    You may not be able to use that system call. If so, we'll set you up with a fast programmed sorter, no problem.

    Code:
    #include <stdio.h>
    #include <time.h>
    #include <string.h>
    #include <stdlib.h>
    
    #define MSIZE 50000
    #define SSIZE 215
    
    int checkIt(char s[SSIZE][200], char **m, int i, int low, int high);
    
    int main() {
      int i=0, isdup, mcount=0, scount=0, low, high, dupcount=0;
      char s[SSIZE][200];
      char **m; 
      clock_t start, stop;
      FILE *fp;
      
      printf("\n\n");
      start = clock();
      system("sort master.txt >master1.txt");
      m=malloc(MSIZE * sizeof(char *));
    
      for(i=0;i<MSIZE; i++) {
        m[i] = malloc(200);
        if(m[i] == NULL) {
          printf("Error allocating memory for m[]\n");
          return 1;
        }
      }
    
      if((fp=fopen("master1.txt","rt")) == NULL) {
        printf("Error opening master file - terminating\n");
        return 1;
      }
    
      while((fgets(m[mcount], 200, fp)) != NULL) {
        ++mcount;
        if(mcount == MSIZE-1)
          break;
      }
      fclose(fp);
      if((fp=fopen("slave.txt", "rt")) == NULL) {
        printf("Error opening slave file -- terminating\n");
        return 1;
      }
      while((fgets(s[scount], 200, fp)) != NULL) {
        ++scount;
        if(scount == SSIZE-1)
          break;
      }
      fclose(fp);
      low = high = 0;
      for(i=0;i<scount; i++) {
        isdup = checkIt(s, m, i, 0, MSIZE-1);
        if(isdup) {
          dupcount++;
          //printf("Check:\n%s\n%s\n", m[isdup], s[i]);
          //getchar();
        }
      }
    
      stop = clock();
      printf("\nElapsed Time: %lf \n",(double)(stop-start)/CLOCKS_PER_SEC);
      printf("%d Duplicates were found in the master list\n", ++dupcount);
    
      for(i=0;i<MSIZE; i++) {
        free(m[i]);
      }
      free(m);
      return 0;
    }
    /* this is the binary search function. The low and high parameters 
       aren't used, but could be, for an even faster search, by 
       further decreasing the search space.
    */
    int checkIt(char s[SSIZE][200], char **m, int i, int low, int high) {
      unsigned int mid;
    
      while(low <= high) {
        mid = (low + high)/2;
        if((strcmp(m[mid], s[i])) > 0)
          high = mid-1;
        else if((strcmp(m[mid], s[i])) < 0)
          low = mid+1;
        else
          return mid;  //duplicate found
      }
      return 0;
    }
    This is rough and untested code. Be sure to test it thoroughly - very thoroughly, before using it. You assume all risks by running it.
    Last edited by Adak; 02-11-2011 at 05:58 AM.

  9. #24
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Why didn't you use qsort() and bsearch(), both in stdlib.h ?

    I bulked up the data to 2M lines in the master file, and 10K lines in the search list.
    Code:
    $ wc -l *large.txt
      2169075 TestMasterList_large.txt
        10905 TestSearchList_large.txt
      2179980 total
    $ gcc -std=c99 -g -Wall -O2 zarniwoop.c
    $ ./a.out TestMasterList_large.txt TestSearchList_large.txt
    Read all files: Elapsed time = 2.211
    
    Sort master list for bsearch: Elapsed time = 3.076
    38724037 calls to comparison to sort 2169075 records
    
    Search update list in master: Elapsed time = 0.010
    Hits=52, misses=10853
    229535 calls to comparison to search 10905 records
    
    Search 10% (always hits) list in master: Elapsed time = 0.684
    Hits=216907, misses=0
    4351053 calls to comparison to search 216907 records
    If the master list were deliberately kept in a sorted state, that would basically half the time it takes.

    In any event, the actual lookup time is basically unmeasurable noise compared to everything else. The big factor that isn't going away is reading the file in the first place.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #25
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    Boy, I don't know if that's gonna be fast enough.
    It took .046 seconds to run through 172984 records against 78036 records.
    SHEESH! (did you hear the sarcasm?) That's scary man!!

    I blinked and it was DONE!
    Thanks)
    Now I've gotta figure out how to make it do what I need (which should be cake).

  11. #26
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    I had that wrong, I named my master file Master1.txt - need to look more closely at the code.
    So, I found a file w/ 1481807 records & a slave w/ 189526 records.
    It took Adak's proc 5.616 seconds to run it.

    I couldn't compile zarniwoop so I couldn't test that one.

  12. #27
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    Well, I found out how to compile zarniwoop w/ gcc.
    I ran it from cmd line like:
    zarniwoop 2 Master.txt Slave.txt

    It threw a pile of errors.
    Did I do that wrong?

  13. #28
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    @Salem, I was trying for the most basic code, so if MAtkins wanted to do this in C#, with his own code, he could, and learn something, too.

    I learned binary search in BASIC, and just never got into the habit of using binsearch(). Half the time, I forget it's there.

    For faster input he could use one of the newer SSD drives -- they're screamers! After that, (or a large buffer), you are quite right - the IO piper must be paid his due.

    That program is very rough, MAtkins. Be careful that you don't run outside the limits of the array, or (low + high)/2 doesn't overflow the range of unsigned int's, on your system.

    What I've done in the past, is set a good sized array (always a power of 2 and 16), and then used a BIG loop to empty and refill the array, over and over, as needed, until all the data is handled.

    Of course, you can't always just make a HUGE array and expect to handle every GIANT file of data, in one fill up.

    Glad you like it. Salem's version would be great for you to study. As you can see by his increased data size, it scales quite well.
    Last edited by Adak; 02-11-2011 at 07:09 AM.

  14. #29
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Are you suggesting that in the vast bloat of C#, there is nothing to do a simple sort and search on some user defined data?

    The key to this problem is picking the right algorithms and data structures.
    Not attempting some simplistic "brute force" approach in a faster language.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  15. #30
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    I'm sold on doing it in C. I can pretty easily make a dll that I can access from C#.
    I'm studying the code in both files.
    It's pretty straightforward I think so far bu really brilliant stuff though.

    So, tell me about those SSD drives . . .
    I'm running Win 7 64 bit on an Alienware Aurora - overclocked at I think 3.67 Ghz Quad core - 6 Gig RAM.
    It's got a radiator in it.

    I didn't mention when I ran your proc I was simultaneously running 50 treads that are scraping the Internet. It ran without a blink.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How difficult is parallel programming in C?
    By darsunt in forum C Programming
    Replies: 20
    Last Post: 07-16-2009, 01:42 AM
  2. How difficult would this be - download and read file
    By spadez in forum C Programming
    Replies: 4
    Last Post: 04-12-2009, 02:05 PM
  3. 3D games re difficult to play?
    By manav in forum A Brief History of Cprogramming.com
    Replies: 4
    Last Post: 05-28-2008, 06:50 PM
  4. Difficult time understanding generating terrain (from file)
    By indigo0086 in forum Game Programming
    Replies: 3
    Last Post: 06-07-2007, 11:36 PM
  5. 3D animation (difficult?)
    By maes in forum Game Programming
    Replies: 3
    Last Post: 09-08-2003, 10:44 PM