Thread: Searching String in a file

  1. #1
    Registered User
    Join Date
    Jul 2011
    Posts
    24

    Searching String in a file

    I want to search a string in a file:

    suppose my file has data as follows :
    img01_234
    img01_123
    img01_233
    img01_2342
    img01_233435
    img01_2323
    img01_2334
    img01_235
    .
    .
    .
    and it continues... and its a very big file..wats the most efficient way to search string in a file..? is there any function to do this ??

    I need something dat works really fast ...can anyone suggest something ?

    I am using C, Visual C++ compiler, and Windows 7 OS.

  2. #2
    Registered User claudiu's Avatar
    Join Date
    Feb 2010
    Location
    London, United Kingdom
    Posts
    2,094
    1) Have you even given "dat" a try?

    2) Have you searched this forum and the internet for an answer? This is one of the most common questions ever asked.

    3) If you are storing all your data on the file system you really can't expect too much efficiency. You just read through the file until you find your string. If you don't like it, use a database.
    1. Get rid of gets(). Never ever ever use it again. Replace it with fgets() and use that instead.
    2. Get rid of void main and replace it with int main(void) and return 0 at the end of the function.
    3. Get rid of conio.h and other antiquated DOS crap headers.
    4. Don't cast the return value of malloc, even if you always always always make sure that stdlib.h is included.

  3. #3
    Registered User
    Join Date
    Aug 2010
    Location
    Rochester, NY
    Posts
    196
    Do you control how things are added to the file, or is the list given to you?

    If you generate the list - sort it alphabetically, then perform a binary search on it. It's pretty much the best shot you've got in a situation like that. If it can't be given to you sorted, then simply buffer the whole thing, apply a sorting algorithm to it, such as a quick sort, then do your binary search through those sorted values...though the question then becomes, why do you need it? Because if you need to know where it was originally, not just if it's there or not, then you'll need to map it back to its original location..which then means you're probably looking at the more tedious side of things.

    Either way, just for the sake of file IO time I'd buffer the whole thing - so throw the whole file contents into memory, then read from there. It's definitely much faster than continuously going to disk then checking.

  4. #4
    Registered User
    Join Date
    Jan 2010
    Posts
    412
    I once did something similar to this by using memory mapping and the Boyer–Moore string search algorithm
    It's probably not the fastest way of doing it, but it works.

  5. #5
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Syndacate is very close to a good answer....

    Me, I'd just load the whole file into one continuous buffer (with a single fread() call) and then set strstr() loose on it. Even though it's a rather poor linear search, it will be multiples faster in memory than trying to search the disk file on a line by line basis.

  6. #6
    Registered User
    Join Date
    Jun 2011
    Posts
    35
    I think it would help if we know what you are searching for?

  7. #7
    Registered User
    Join Date
    May 2011
    Location
    Around 8.3 light-minutes from the Sun
    Posts
    1,949
    Quote Originally Posted by Linux Trojan View Post
    I think it would help if we know what you are searching for?
    No. It wouldn't help. We are talking about sorting and searching techniques here; what we are searching for is irrelevant.
    Quote Originally Posted by anduril462 View Post
    Now, please, for the love of all things good and holy, think about what you're doing! Don't just run around willy-nilly, coding like a drunk two-year-old....
    Quote Originally Posted by quzah View Post
    ..... Just don't be surprised when I say you aren't using standard C anymore, and as such,are off in your own little universe that I will completely disregard.
    Warning: Some or all of my posted code may be non-standard and as such should not be used and in no case looked at.

  8. #8
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Syndacate
    apply a sorting algorithm to it, such as a quick sort, then do your binary search through those sorted values
    That would only be useful if many searches will be performed, otherwise a linear search would be better.

    Quote Originally Posted by AndrewHunter
    We are talking about sorting and searching techniques here; what we are searching for is irrelevant.
    That is not quite true, e.g., when the string string is short then say, Boyer-Moore might not be so good as when the search string is long.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #9
    Registered User
    Join Date
    May 2011
    Location
    Around 8.3 light-minutes from the Sun
    Posts
    1,949
    Quote Originally Posted by laserlight View Post
    That is not quite true, e.g., when the string string is short then say, Boyer-Moore might not be so good as when the search string is long.
    I stand corrected. Thank you for pointing that out Laser; I didn't consider that line of thought when I made my reply.
    Quote Originally Posted by anduril462 View Post
    Now, please, for the love of all things good and holy, think about what you're doing! Don't just run around willy-nilly, coding like a drunk two-year-old....
    Quote Originally Posted by quzah View Post
    ..... Just don't be surprised when I say you aren't using standard C anymore, and as such,are off in your own little universe that I will completely disregard.
    Warning: Some or all of my posted code may be non-standard and as such should not be used and in no case looked at.

  10. #10
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Linux Trojan View Post
    I think it would help if we know what you are searching for?
    Re-read message 1... he gave us enough to offer general advice.

  11. #11
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    You sure like to over-complicate things (which I admit is kinda fun sometimes).
    Code:
    while not at end of file
        while fgetc is not first letter
            ;
        fscanf + strcmp
    You aren't going to beat a linear search for this on unsorted data.

    edit - Actually in this case, since it looks like he has one per line, fgets + (sscanf + strcmp) OR (strstr) is all he needs. It's possible my first example might not like it if you searched for "abc" in "ababc", of course strstr instead of strcmp would fix that.


    Quzah.
    Last edited by quzah; 07-31-2011 at 03:24 PM. Reason: not awake enough
    Hope is the first step on the road to disappointment.

  12. #12
    Registered User
    Join Date
    Aug 2010
    Location
    Rochester, NY
    Posts
    196
    Quote Originally Posted by laserlight View Post
    That would only be useful if many searches will be performed, otherwise a linear search would be better.
    Yeah, that's a good point.

    I guess from the OP, it would be good to know the timing in the searches, especially comparatively to the list generation. If he only needs to look for one then it would be a waste to sort, I overlooked that.

    Though I'd say if he needs to do a few searches, caching that file during the program's initialization and sorting it would probably prove beneficial to the performance over the run of the program.

    Kind of hard to say much more without the OP's input :-\.

    Also, another thing that's important: WHO is generating this list? Because if whoever is generating it isn't time restricted and can do a search for you before they generate it, then that's half the battle right there. Also, if you *know* the person generating this list - it would probably be beneficial to write a connection layer so you can snag the list directly from them, so it doesn't have to write out to the disk before it reads. More info from the OP regarding the timing, relationship with the list generator (ie. if it's a proprietary program somewhere else, then he's kind of SOL, but if it's a program he wrote, or it's OSS, or it accepts modules, possibly not so much).

    @ OP:
    More info would REALLY be beneficial in order for people to help you further. Important aspects, I'd highlight are:
    - Relationship with the list generator (program you wrote?)
    - Time restrictions on both your program and the list generator (ie. if the list generator runs at midnight when nobody is around, it can do the sorting prior to dumping to a file, and save yours lots of work)
    - How many searches are going to be required once your program is started (ie. is this just one search, or is it the case that the user might need to search 30x of course of a few minutes?)

    I think that's about it - I feel I'm missing something, but it's not hitting me atm.

  13. #13
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Syndacate View Post
    More info would REALLY be beneficial in order for people to help you further. Important aspects, I'd highlight are:
    - Relationship with the list generator (program you wrote?)
    - Time restrictions on both your program and the list generator (ie. if the list generator runs at midnight when nobody is around, it can do the sorting prior to dumping to a file, and save yours lots of work)
    - How many searches are going to be required once your program is started (ie. is this just one search, or is it the case that the user might need to search 30x of course of a few minutes?)

    I think that's about it - I feel I'm missing something, but it's not hitting me atm.
    I don't think this is necessary --although it would be nice to have. I'd say just accept the question as given and get him started working on his code. Then if he runs into trouble, he can post up and ask specific questions.

    One of the problems in a situation like this is "information overload" where basically we tell the poor guy too much and get him all confused.

    Unless we are told other wise, I'd stay with my original recommendation... snag the whole file into memory and unleash strstr() on it.

  14. #14
    Registered User
    Join Date
    Aug 2010
    Location
    Rochester, NY
    Posts
    196
    Quote Originally Posted by CommonTater View Post
    I don't think this is necessary --although it would be nice to have. I'd say just accept the question as given and get him started working on his code. Then if he runs into trouble, he can post up and ask specific questions.

    One of the problems in a situation like this is "information overload" where basically we tell the poor guy too much and get him all confused.

    Unless we are told other wise, I'd stay with my original recommendation... snag the whole file into memory and unleash strstr() on it.
    Doesn't that method leave you open to false positives?

    I mean if you read in to EOF, and the strings are:
    Code:
    IMG_12425
    IMG_1423
    IMG_1242
    Which enter the buffer passed to fread like this:
    Code:
    IMG_12425\nIMG_1423\nIMG_1242
    Then when you try to search for the last img (1242), by doing:
    Code:
    char *location = strstr(buffer, "IMG_1242")
    And
    Code:
    if (location == buffer) // true
    I mean I suppose you can always yank the string out up until the new line and then strcmp it with strlen to verify it that it's not larger...but I've seen substring methods yield false positives when one string happens to be a substring of another string (the requested string) within a given buffer.

    Personally if I was going to cache it like that I'd use fread to capture it all, then use the string tokenizer to split it at new lines and toss them in a double pointer. While it's more computational at the beginning, I feel it will pay off.

    Though that was kind of a segway. While it's not *necessary* anymore than anything else is, I believe it would help greatly to tell him what his next steps should be. Though I definitely hear where you're coming from in terms of information overload.

  15. #15
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by Syndacate View Post
    Doesn't that method leave you open to false positives?
    Only if he doesn't provide us with enough information about what his file looks like. If it is like what he says it is:
    img01_234
    img01_123
    img01_233
    img01_2342
    img01_233435
    img01_2323
    img01_2334
    img01_235
    Then this...
    Quote Originally Posted by Syndacate View Post
    I mean if you read in to EOF, and the strings are:
    Code:
    IMG_12425
    IMG_1423
    IMG_1242
    ...
    char *location = strstr(buffer, "IMG_1242")
    Is only an issue if you do what you did, instead of this:
    Code:
    foo = strstr( buf, "img_123\n" );
    If you know for exact matches that after the thing to find is a newline, or some kind of whitespace, or something else (anything really) then that's easy. strstr is fine. Otherwise:
    Code:
    if( (location = strstr( buf, findme )) && strlen( findme ) == strlen( location ) )
        ...match...
    else
        ...nada...
    But those are the types of things he needs to think about when we're giving him all the answers--I mean, when he's coming up with what he needs to do.


    Quzah.
    Hope is the first step on the road to disappointment.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Searching a file for a String Occurence
    By Phyxashun in forum C++ Programming
    Replies: 37
    Last Post: 01-10-2009, 10:57 AM
  2. File string searching
    By Dan17 in forum C++ Programming
    Replies: 2
    Last Post: 05-12-2006, 12:49 PM
  3. Searching File for String
    By bcianfrocca in forum C++ Programming
    Replies: 19
    Last Post: 09-08-2005, 03:28 PM
  4. Searching for a string in a file
    By alec in forum C++ Programming
    Replies: 2
    Last Post: 01-17-2003, 02:31 PM
  5. Searching for a String within a file
    By TidusBlue in forum C Programming
    Replies: 5
    Last Post: 02-11-2002, 07:29 PM