Thread: How Difficult is This?

  1. #1
    Registered User
    Join Date
    Feb 2011
    Posts
    96

    How Difficult is This?

    Hi:

    I've been programming VB, C#, php, Javascript, etc. for about 15 years now.
    I'm working in C# with string searches and finding that it's slow as a sleeping hound dog.

    I did some research and found this:
    Fast string search

    It turns out that basic C code blows the doors off of C# (like 20 to 1) with string searches.

    So, how hard would it be for me to learn how to write stuff using strstr in C, given my experience? I don't know jack about pointers, memory leaks, garbage collection, etc.

    About how much time can I expect to study to learn how to do it, all the way through compile and use?

    Also, is there any way to access it from C# without C# turning it into a dog?

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Why not try to implement something like Boyer-Moore in C# then?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by MAtkins View Post
    About how much time can I expect to study to learn how to do it, all the way through compile and use?

    Also, is there any way to access it from C# without C# turning it into a dog?
    So, basically you're asking how long it will take you to learn C... With your experience, you should be able to pour through a good tutorial or book in a couple of days and be writing fairly simple but useful proggies within a week. With the right information, you should be able to write DLLs that can be accessed by C# within a couple of weeks.

    Since C produces native machine code, C# can't turn it into a dog... it's far more likely you C language helpers will simply show C# up for the slow-mo puppy it is...

  4. #4
    'Allo, 'Allo, Allo
    Join Date
    Apr 2008
    Posts
    639
    So you're gonna take the advice of a flawed 8 year-old benchmark? I hope you realise that the Java and C# datasets are twice as large as the C/C++ one. Sifting through less data is always going to be faster.

    His results don't exactly hold true now anyway. My results for the C/C++ tests (AMD 64-bit dual-core 4GB ram / 32-bit program compiled with the VS2005 SP1 compiler):
    Code:
    the alg ids
    cruntime=0,
    naive, = 1
    kmp = 2,
    boyermoore = 3
    C:\Users\Adrian\Downloads\fastsearch\fastsearch\fastsearch>fastsearch
    duration for alg(0) test suite : 78 msecs
    duration for alg(1) test suite : 125 msecs
    duration for alg(2) test suite : 172 msecs
    duration for alg(3) test suite : 62 msecs
    
    C:\Users\Adrian\Downloads\fastsearch\fastsearch\fastsearch>fastsearch
    duration for alg(0) test suite : 78 msecs
    duration for alg(1) test suite : 124 msecs
    duration for alg(2) test suite : 156 msecs
    duration for alg(3) test suite : 63 msecs
    
    C:\Users\Adrian\Downloads\fastsearch\fastsearch\fastsearch>fastsearch
    duration for alg(0) test suite : 63 msecs
    duration for alg(1) test suite : 124 msecs
    duration for alg(2) test suite : 172 msecs
    duration for alg(3) test suite : 62 msecs

  5. #5
    Registered User ssharish2005's Avatar
    Join Date
    Sep 2005
    Location
    Cambridge, UK
    Posts
    1,732
    >So, how hard would it be for me to learn how to write stuff using strstr in C, given my experience?
    Well, it all really depends on you. Is this something you will have to do? If so there is a deadline and the pressure which will make you learn. If you have the enthusiasm for it, you wouldn't take too long at all.

    Why do you have to do this in C, if suppose C# is much better in string handling than C.

    ssharish
    Life is like riding a bicycle. To keep your balance you must keep moving - Einstein

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I would take that string test with a grain of salt.

    Yes, C is faster than C#, at almost everything except coding it, but I have real trouble believing C# is that bad.

    Also, the "test" string was heavily populated by very short strings, like "the" (323 times). That heavily favors strstr() and hurts the more elegant Boyer-Moore and etc., algorithms.

    Why don't you post up a little sample of the data you have to search through, and the strings you need to find within that data, and let's test it.

    Testing YOUR data, with different algorithms, (at least strstr() and B-M), (and here, languages of C# and C), is the only way to know for sure, what the facts are. There's no substitute for actual head to head testing on your data, and on your system (if possible).

  7. #7
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    thanks for the replies:

    I've got a text file with 300K lines having 1 URL per line.
    I want to append it with 40K new URLs - BUT - I don't want to duplicate any of the urls that are already in the original file.
    I can do it 5 different ways but they're all dogs.
    I'm basically looping through the 40K and in that looping through the 300K doing an IndexOf or == for each.
    That would be 40,000 * 300,000 loops total - it takes all day!

    I tried just grabbing the FileText for the 300K and looping through the 40K using an FileText.IndexOf(line) to see if the URL is there.
    Really inefficient I think.

    I've got an HMTL page with 2500 lines of HTML code in it.
    I want to find out if "This Text" can be found in the HTML code.
    I should add that a list of 'This Text' possibilities are stored in a text file.
    So, I end up basically asking is "This Text A' in HTML if not then is 'This Text B' in HTMLif not then is 'This Text C' in HTML, etc.
    I'm using for loops and IndexOf to do it - painfully slow.

    Also, I want to extract a string that represents the first (or sometimes second) entire HTML form within the HTML.

    I want to search for string Needle in string Haystack and get the INDEX of the first character of the Needle.

    I know at least one program that can do stuff like this at least 20 times faster than I can in C#. I've tried as many techniques as I can find including using a MySQL database. All of them are dogs. They take for ever. What the other program can do in 5 seconds, my app takes minutes to do.

    I'd really like to learn better ways of doing this kind of work.
    I'm convinced that *much* better ways to do it exist.
    I don't care if it's in C or C++ or C# or PHP for that matter.

    I did get a c dll to be accessible from my C# app today.
    I'm fighting with strings . . .
    Last edited by MAtkins; 02-10-2011 at 01:07 AM.

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > I've got a text file with 300K lines having 1 URL per line.
    Is it sorted?
    If it were, then you can use a binary search to locate any string (or know it isn't there) in at most 19 comparisons.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Databases are going to be waaaayyyy too slow. You need to use some internal structure of your own for these problems.

    URL File:
    Read the 300k URLs into a fast insert/access data structure, like a binary tree. Read the 40k URLs, each time doing a lookup of that URL in the tree, and if it misses, insert the URL and append it to the file. GLib has some great data structures (including a balanced binary tree) and can be used with Windows (I'm a Linux guy though, so there may be better tools out there that I don't know of). Or you can roll your own if you want some extra practice. Balanced trees get tricky, but you might not need them in your case.

    HTML:
    Look at a generic XML parser. The only one I know of is Expat. There may bit of a learning curve, but it's pretty speedy and powerful.

    Needle/haystack:
    strstr with a simple pointer subtraction is the easiest way.

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I've got a text file with 300K lines having 1 URL per line.
    I don't see any data posted up, AND I don't see a link on Swoopshare (or any other free file sharing site), either:
    swoopshare - Save files on the web

    So, you're probably lying! But why?

    I want to append it with 40K new URLs - BUT - I don't want to duplicate any of the urls that are already in the original file.

    I can do it 5 different ways but they're all dogs.
    Yes -- but why dogs - inquiring minds want to know!

    Seriously, post up a little sample, and give it a test on your best C# program for time. We'll do the same with some C code, on those same test samples. Then, you'll have some facts to make a decision about.

    The test doesn't have to be on a full size file, just a representative sample that takes maybe 15 to 60 seconds to run on C#, should be fine.

    It's not uncommon for us to beat the pants off such samples, either by using C 's inherent speed, or by using a better algorithm, or both.

    Otherwise my goons will have to escort you right up to Father O'Reilly for confession. I've got a kickback agreement with him. Figure I've cut off 10 years from purgatory this way.


  11. #11
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    Adak: I'm not interested in testing yet. I *know* that what I've got doesn't work. It's not like I've got a 'one case' scenario here. I"ve gotta do it day in and day out.
    Finding 40K of urls in a list or an HTML page w/ 2500 lines is nothing to accomplish. I'm looking for good programming techniques, once I think I've got something I'll be happy to test.

    anduril462: Great stuff. Thanks. How do I learn more about binary trees? Or are there any better ideas for that job?
    Most often, I've got to compare, not just the lines but the root url. For example:
    http://cboard.cprogramming.com/newre...te=1&p=1001614
    I need to not duplicate any URLs that include:
    cboard.cprogramming.com

    I don't think an xml parser will be that great. Generally, I'm looking for text and/or code anywhere within the html. Sometimes I strip the html tags first and then search just the text. The text I'm looking for may be part of the html code itself or text or a combination mixed together. A classic would be 'Powered By <a href='http://www.wordpress.com'>WordPress</a>. I may be looking for the text, the url or both. Currently, my list of search params is about 175 different patterns.

    I've been working with c & strings all day. Pointers & strstr keep returning the actual text (by printing what is at the memory location) or the ASCII value, but not the actual location. I've looked high and low for it and can't find the code I need.
    I'm looking for,
    Idx = (ptr_Mem_Location_of_Needle - ptr_Mem_Location_of_Haystack)
    correct? How do I get the actual memory location from the result of strstr?

  12. #12
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    This works:
    I'm getting a warning about 'assignment of incompatible pointer type' though.
    Can't figure out what a pointer to the location of a char should be I guess.

    Code:
    int main()
    {
    	char str1[]="Testing a string of text";
    	char str2[]="esting";
    	char *ptr1;
    	char *ptr2;	
    	int idx;
    	
    	ptr1 = &str1;           
        ptr2 = strstr(str1,str2);
    	idx = ptr2 - ptr1;
    	printf( "\nptr1:%d ptr2:%d Idx=%d\n", ptr1,ptr2,idx);
    
      return 0;
    }
    This doesn't work. I can't figure out how to get the strings into the indexOf function:
    Code:
    #include <stdio.h>
    #include <string.h>
    int indexOf(char *str1, char *str2)
    {
    	char *ptr1;
    	char *ptr2;
    	int idx;
        ptr1 = &str1;           
        ptr2 = strstr(str1,str2);
    	idx = ptr2 - ptr1;
    	printf( "\nptr1:%d ptr2:%d Idx=%d\n", ptr1,ptr2,idx);
    	return idx;
    }
    int main()
    {
    	char str1[]="Testing a string of text";
    	char str2[]="esting";	
    	int idx;
    	
    	idx = indexOf(str1,str2);
    	printf( "\nIdx=%d\n", idx);
    
      return 0;
    }

  13. #13
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    This works now (I removed the &) but I see 2 potential problems:
    1. How do I use something like this with UTF-8 encoding?
    2. Can I throw a 2500 line (probably more like 500) HTML file at it?
    Code:
    #include <stdio.h>
    #include <string.h>
    int indexOf(char *str1, char *str2)
    {
    	char *ptr1;
    	char *ptr2;
    	int idx;
        ptr1 = str1;           
        ptr2 = strstr(str1,str2);
    	idx = ptr2 - ptr1;
    	if(idx < 0) idx = -1;
    	printf( "\nptr1:%d ptr2:%d Idx=%d\n", ptr1,ptr2,idx);	
    	return idx;
    }
    int main()
    {
    	char str1[]="Testing a string of text";
    	char str2[]="Festing";	
    	int idx;
    	
    	idx = indexOf(str1,str2);
    	printf( "\nIdx=%d\n", idx);
    
      return 0;
    }

  14. #14
    Registered User
    Join Date
    Feb 2011
    Posts
    96
    well, I asked the girl who wrote the 'fast' app how she did it . . .
    Assembly Language through Delphi.
    That will only handly ASCII though won't it?
    I thought I understood that strstr was about as close to Assembly as you could get.
    Yes? No? Ideas?

  15. #15
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Most C compilers these days support inline assembly... so that shouldn't stop you.

    Most of the standard C99 functions are likely to be optimized using either inline or pure assembly. We don't actually know what's "under the hood" but given that it's nearly impossible to match the library's performance with C code, I'm guessing it's a safe assumption.

    And yes... if her code is written for 8bit char it's not going to support Unicode... But that's also true of the C library functions. If you note there are two versions of strstr... strstr() and wcsstr() ... one is in string.h the other in wchar.h... one is for ascii and the other for unicode.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How difficult is parallel programming in C?
    By darsunt in forum C Programming
    Replies: 20
    Last Post: 07-16-2009, 01:42 AM
  2. How difficult would this be - download and read file
    By spadez in forum C Programming
    Replies: 4
    Last Post: 04-12-2009, 02:05 PM
  3. 3D games re difficult to play?
    By manav in forum A Brief History of Cprogramming.com
    Replies: 4
    Last Post: 05-28-2008, 06:50 PM
  4. Difficult time understanding generating terrain (from file)
    By indigo0086 in forum Game Programming
    Replies: 3
    Last Post: 06-07-2007, 11:36 PM
  5. 3D animation (difficult?)
    By maes in forum Game Programming
    Replies: 3
    Last Post: 09-08-2003, 10:44 PM