Here is our first article for discussion. It is on Searching Techniques.
Printable View
Here is our first article for discussion. It is on Searching Techniques.
Very good article, good read, however, in the first few lines it talks about hashing, and I personally haven't ever worked with hashing before...is it relatively simple to implement as far as a searching algorithm goes? I would be really interested in learning how to use this sort of thing and also, what data type works best with hashing...if any?
Also, the article mentions binary trees for searching faster, but would a binary tree be faster than just implementing a binary search, or are the two things the same thing in a sense?
well written indeed. my only thought is that the first bit of code may be overly complex for what it is demonstrating. what I mean is that some people may get lost in the syntax, when what is more important is the algorithm behind it.
for example, something like this may get the point across better:
Code:#include<iostream>
int main()
{
int arr[]={1,32,4,6,4,253,6,1,24,10,35};
int size=sizeof(arr)/sizeof(int);
int index=-1;
for(int i=0;i<size;i++)
{
if(arr[i]==10)
{
index=i;
break;
}
}
if(index>=0)
std::cout<<"10 was found at index "<<index<<std::flush;
else
std::cout<<"10 was not found"<<std::flush;
std::cin.get();
return 0;
}
The algorithm for searching structures built with hashing is very simple. You run your hash function on the object being searched for to get an index for that object. You then check the index for the object; if the object is at that index, your done! if the object found at the index is null, the structure does not contain the object. Lastley, if the there is an object at the index, but its not the object your searching for, you have a collision; you then iterate by a means determined by your structure (often sequentially) untill you either find the object (its found at index + iteration) or you find a null entry (the object doesnt exist in the structure)Quote:
Originally Posted by jverkoey
Any data type that you can come up with a hashing function for will work just fine in a hash structure. The key part of any hashing structure is to have a good hash function. A poor hash function will result in too many collisions reducing performance.Quote:
Originally Posted by jverkoey
I can't see why one would be faster than the other on a theoretical level. Perhaps someone else can comment on the difference in performance.Quote:
Originally Posted by jverkoey
I'd have to disagree here. I think anyone persuing this subject should have a strong enough grasp on programming to understand (or figure out) the examples through experimentation both on the computer and with examples on paper as mentioned in the article.Quote:
Originally Posted by major_small
My biggest complaint with the example you posted would be that the it doesnt offer an "as is" usable function because the entire algorithm is coded in main. Algorithms like this should be broken into [re]usable components. The code in the article is usable "as is".
I do agree with your underlying point though, the article could supply a driver program to reduce confusion and give the reader something concrete to exeriment with. (though i did notice that writing a driver program was the intent of Exercise 2)
A binary tree is probably faster than a binary search. With a tree, all you're doing is moving down to the next pointer. With a binary search, you're preforming at least some math each time. I suspect it will always be faster to simply access one more pointer than it will be to find the next spot to jump to using said math.
Even if we shift there, you've still got addition plus a shift. Where as with a tree, you just move to the next pointer. So assuming the comparison test is identical, and you've both got one assignment, the tree should still be faster.Code:middle = ( beginning + end ) >> 1;
If I'm incorrect, feel free to let me know, because I haven't played around calculating notations and the like.
Quzah.
Good article.
Perhaps it could be mentioned that interpolation search will only be O(log log n) if the data is uniformly distributed in the range between low and high, because the interploation used is linear, unlike binary search where O(log n) is the worst case.
Yes, but sequential search isn't a very advanced topic; it's a topic which many newbies are struggling with. I don't think an example like Major Small's suggestion would be out of place.Quote:
Originally Posted by Perspective
If I'm not mistaken, interpolation search can actually be slower than binary search even for extremely huge lists, if the entries in the list are chosen carefully.
Consider the list L with N elements (sorted).
The element l[n] at position n is defined as:
l(n) = N^n
So, the list looks like this:
{N^0, N^1, N^2, ... , N^N}
When searching the list with binary search, the time complexity will be on average O(log2 n), because each iteration halves the number of entries where the wanted element might be.
Now, when using interpolation search, the first iteration will yield (using Prelude's variable names):
key = N^n
low_diff = N^n - 1
range_diff = N^N - 1
count_diff = N - 0 = N
range = N*(N^n - 1) / (N^N - 1) + 0
range < N*(N^n - 1) / N^N = 1/ N^(N-n-1) - 1/N^(N-1) < 1 for all n<N.
So, if the element searched for isn't the last (n<N), the first calculation of range will equal 0 (when truncated) and the new value of low will be 1 instead of 0.
It continues like this, the low boundary will increase by one each iteration, so the complexity of the interpolation search will be O(n).
Note that this is by no means any argument against using interpolation search. This is more of theoretical interest, if it's of any interest at all.
EDIT:
Hmm, actually the list hasn't N elements; it has N+1. The argument still holds though. Remove the first element from the list (N^0).
An idea about a possible application for the article (where hashing is impractical):
You may have a hash table that 'maps' keys to values, but what if you want the inverse operation (find the key of a value)? I suppose you could generate two different hash tables, one for each direction, but if it's more difficult to hash the value than the key (i.e. integer key, structure value), it may be more expedient to simply implement a search for the inverse operation rather than going to the trouble of coming up with an effective hash algorithm.
I don't know what it is, but the very idea of using hashing is very appealing to me. You take a value, any value of any type, and run it through a process that performs arithmetic on the bits and generates a numeric value which is used as an index. It's like magic. Of course, as Perspective mentioned, the main thing to be concerned with is that you use an algorithm that does not generate the same hash value for two unique values (or, at least one that is very unlikely to cause said collisions except extremely infrequently). As far as the difficulty in implementing a search algorithm goes, it's extremely simple. You just use the index returned from your hash function and if a value is there, you've found it. In the very simplest form, a search would have a complexity of O(1) [you could do something like `if (series[hashFunc("value")] != NULL) ...`].Quote:
I personally haven't ever worked with hashing before...is it relatively simple to implement as far as a searching algorithm goes?
Strictly speaking, a binary search tree does not have a better complexity than a binary search. In fact, it has a worse complexity. In the worst possible case, a binary search tree can require the traversal of all elements in a search (O(n)), but a binary search has a much better complexity (O(log n)). This is because the shape of the tree cannot be guaranteed and if many sorted elements are inserted (yielding a tree with repeated right-hand children), it will become essentially a list instead of a tree. Now, an AVL tree (a binary search tree that is reformed as necessary after any insertion/deletion to retain a true tree shape) is much more comparable in complexity to a binary search. It also yields O(log n), so any difference in their application would be constant time, thus virtually identical (theoretically speaking).Quote:
Also, the article mentions binary trees for searching faster, but would a binary tree be faster than just implementing a binary search, or are the two things the same thing in a sense?
Oh, I get it now, I've done this sort of thing before without realizing it. I was writing a .lwo renderer and the tags in the file are all four letter tags, so I made a function that when passed a string of 4 letters would return a unique integer value. *Rummages through code*, here we go:Quote:
Originally Posted by LuckY
And the function just assumes you send the proper size of array. However, as I ran through the list of tags, I came up with quite a duplicate tag every now and then which required changing the algorithm around to work with all of them.....So is there some universal hashing algorithm that works best with certain things? Or is the method I used above pretty much the best way to go?Code:unsigned int ParseTag(char* tag)
{
srand((unsigned int)tag[0]);
return (tag[0] + rand()%128) + (tag[1] + rand()%64) + (tag[2] + rand()%32) + (tag[3] + rand()%16);
}
The argument about hashing - well, the article talks about searching algorithms on common (perhaps sorted) arrays and linear lists. Not about binary trees or hash tables. In that case, skiplists should be discussed too. So perhaps we should have an article about data structures.
Hash functions must be deterministic, jverkoey. They must not contain a random element.
Good article. And I don't think the code is hard to read at all. Rather the functions look very succint and to the point. Very easy to follow.
So what's with having Prelude write the first article....yeah like we are all just gonna jump right into writing articles now. :D Tough act to follow.
;)
Well, as an accident of the way rand( ) is normally written (i.e. not hooked up to a lava lamp or something similar), that piece of code is deterministic, though a more proper hash function would define its own utility functions instead of using a dubiously named function (rand).Quote:
Originally Posted by CornedBee
Anyway, a great article. I much prefer the code in usable functions as they are as opposed to in a small program as major_small suggested. The target audience could rather readily write small test programs that use these functions.
I really liked Prelude's use of problems in the article as well. They were all of an appropriate level, I thought, and all very valuable to the learning process. The code was clear as well.
I'll comment more (in a more helpful manner) later on.
Cheers
Right, Zach, my fault. I didn't look at the argument to srand() and assumed the usual time(NULL).