Thread: Improve Performance of String comparison

  1. #1
    Registered User
    Join Date
    Mar 2008
    Posts
    7

    Improve Performance of String comparison

    Hi all,

    I am doing compression using Huffman .

    I have a string of length around 1500000. The string consists of 1s and 0s only.

    I have a linked list consisting of Huffman code table of a symbol and a short binary string representing the symbol.

    This is what I do

    Loop through longer length string
    Loop through linked list
    compared if substring of longer length string == string from linked list
    decode
    End
    End

    However, this operation takes around 20min to complete which is way too long to decompress a data.

    Is there a faster way of decompressing algorithm using the same huffman coding?

    Thanks a lot

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    If your STRING is really a binary number, perhaps it would be quicker to translate it to an integer first, then compare integers? That way you can get up to 32 "bytes" compared in one integer.

    There are faster string compares, but they are a bit complex too, and by no means "a magic bullet". On a 32-bit machine, you can achieve about 4x improvement by comparing 4 bytes (one int) at a time. On a 64-bit machine, you get 8x improvment [1]. But that is the string comparison itself, and none of the other code that you run. Whilst I'm sure the string compare is a large proportion of your 20 minutes, I'd expect it to still be quite slow even at 8x faster string compare.

    Also, are you sure that a linked list is the best method here? Perhaps you should consider using some sort of sorted container, so that you can eliminate comparisons that are not necessary - for example a binary search?

    Are you taking the length of the string each time? strlen() is another function that is quite slow.

    Of course, all of my suggestions are based on your brief description of the code - it would help to see the string compare loop in C++ - there may be other things that would help.

    [1] The improvement assumes that the string is "mostly matching" - so comparing strings that mismatch on the first few bytes will be almost identical speed.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Huffman codes are the tree - so use tree to store the codes... You search should be a lot faster
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by vart View Post
    Huffman codes are the tree - so use tree to store the codes... You search should be a lot faster
    I thought that was for the uncompressing side - if you are compressing, you need to compare the "current input" with the list of corresponding compressed values.

    But I'm sure that the "uncompressed->compressed" list (using list in a vague container sense here) can be sorted or otherwise "optimized" to make the number of possible comparisons a lot fewer.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    No, I'm pretty sure the whole point of the tree is to decompress -- you read a 0 and go left, or read a 1 and go right; if you land on a character print it out and start over, otherwise read a 0 and go left or read a 1 and go right.

    It would be hard to use the tree to compress -- easier to use an associative array or similar (or if we assume our symbols are contiguous in the encoding, a plain old array).

  6. #6
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Yeah, just build and use the tree to decompress.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ ini file reader problems
    By guitarist809 in forum C++ Programming
    Replies: 7
    Last Post: 09-04-2008, 06:02 AM
  2. Replies: 8
    Last Post: 04-25-2008, 02:45 PM
  3. Custom String class gives problem with another prog.
    By I BLcK I in forum C++ Programming
    Replies: 1
    Last Post: 12-18-2006, 03:40 AM
  4. creating class, and linking files
    By JCK in forum C++ Programming
    Replies: 12
    Last Post: 12-08-2002, 02:45 PM
  5. Warnings, warnings, warnings?
    By spentdome in forum C Programming
    Replies: 25
    Last Post: 05-27-2002, 06:49 PM