So I want your guys opinions on this search algorithm that I just developed. This search algorithm is meant to search a file system for a given file.
Here is how it works:
We begin by indexing all the files on the given file system.
Step 1: Find all files on the file system and put them into a master file list.
Step 2: This step is important and a lot of the work gets done here
I create a map with 256 keys in it, one for each value a byte can hold (I will refrain from saying ASCII characters, because we also dealing with unicode...although in most case we are dealing with single characters per byte). Each key contains a value which is another map of 256 keys. (So basically we have a 256x256 matrix...but it's really a map...if that makes sense)
Each key on the 2nd dimension of maps holds an empty list which will be used in the future.
In this step we kind of "hash" our file system list, in a sense. It is based off of letter pairs. So let's take a random file, for example: "filename.txt"
The first character pair in that file name is "fi". I go to the "f" key of the main map, and then I branch off to the "i" key of its map....just like you would a matrix. From there I append a value onto the list of that element with the data: "fi" (the character pair), "1" (the ending index of the character pair in the file name string), and "some number" (a file ID number).
I do this for every character pair, and for every file in the file system. If that doesn't make sense, here is a quick illustration...
Once this data structure is completed, we are ready to search. (This process obviously takes a considerable amount of time...the indexing process is something that would take place overnight, for example, while users are not using the file system)
Step 3: Searching for a file
Now let's say you search for the string "file"
The search algorithm starts from the second to last character in the search string and indexes into the hash table we created earlier.
So in this case, it would go to the position map[ "l" ] [ "e" ], which would then return a list of all files that have the character pair "le" in it.
It would then go to the character pair "il", and find all files with that character pair in the filename.
The search algorithm then does an intersection on these 2 lists to see where matches occur (in file IDs, and if the ending indexes of the character pairs are adjacent to each other). If so, then those matches get carried on into the next iteration.
If not, anything that does not match gets thrown into a "less relevant" list. This process continues until the entire search string has been matched.
We will eventually end up with something like this:
Searched for string: file
Notice how they are sorted in terms of relevance. Anything that got dropped throughout the process went into the "less relevant" list, but didn't get thrown out altogether. It comes back in the end as a less relevant match. All the matches that endured through the whole process kept bubbling up through the iterations, and they are the first matches in the list.
Now....there are some particulars that I have no described here, such as "How does the algorithm account for files with double occurrences of character pairs like fifi.txt" I assure you that it does account for stuff like that....but I don't got time to go into that.
Anyways, hopefully you guys were able to understand my explanation of the algorithm. I want your input....what do you think of it? What flaws and advantages do you see in it? And most especially: what do you think the time complexity of the search is (not the file indexing)?
Something that I forgot to mention is that I did already implement the algorithm...and it works. So far I have only tested it on a file system of 1000 files, and tomorrow I am going to test it on our server which contains about 240000 files...we will see how it performs. I just want your input and thoughts on it. :-)
This has been done before. It's a type of pseudo-hashing. I can't remember the name of the algorithm at the moment.
Congrats on coming up with it yourself. Based on a few seconds of thinking, I would say the complexity is linear in the length of the string being matched and logarithmic in the number of files on the filesystem, given certain assumptions on the distribution of characters in filenames.
If the string being matched has n pairs of characters, an intersection of n lists will have to be performed. Linear time sounds fine here.
But given that there are m files in the filesystem, the average number of files in a list must be linear, because every file end up in at least one list so m/65536 is a lower bound on the average number of files in a list. I am not sure why this search could be logarithmic in the number of files.
When you calculate the intersection of two lists, you must at least look at every element in them, so linear time is required.
I don't know, maybe O(n*m).
Sounds like an incomplete trie, or perhaps an incomplete B-Tree. Simply put, it would work pretty much the same as those, but it doesn't scale, because your lookup structures are of fixed size. In other words, you get good performance for a few files (a hundred thousand perhaps) as long as they're evenly distributed. After that, the performance becomes indeed n*m - algorithmically speaking, no better than a linear search through an index.
Consider also that large file indexes don't fit into main memory and must be paged to disk. Linked lists don't page well. That's another scaling problem.
Also, your two separate indices seem somewhat strange. What you have is, basically, a 2d-array used as a lookup map, with the first character of the filename as the first index, the second as the second index. (By the way, what happens to files with names that are only one character long?) This is, essentially, the same as treating the first two characters as a 16-bit value and using it as an index into a 2^16 array.
From a completely linguistic point of view, your description of the structure is ... inaccurate, and thus hard to understand: keys don't contain values, they index them. A map is a container of key/value pairs that allows fast lookup of the key and the associated value.
Performing intersections isn't trivial, for each element you have to see if it exists in all other lists. A naive implementation will be O(k*n^2) where n is the number of elements in each list and k is the number of lists (minus one). You can do a lot better if the lists are sorted, but if not you'll pay k*n*log(n) to sort all k+1 lists. The table look-ups are constant time, the complexity will likely be dominated by your intersection algorithm.
I think you'd do better with indexing n-grams in a reverse look up (huge, but easy to partition into disk pages), or by using a tree structure like a suffix tree. As CornedBee mentions though, your biggest challenge will be effeciently accessing the index when it becomes too large to fit into memory.
Well I tested the search algorithm today on a large file system.
The file system has about 250,000 files on it.
A search takes approximately 1 minute 50 seconds to complete.
After much thinking and researching...the algorithm is incredibly slow and a linear search does much better.