So I want your guys opinions on this search algorithm that I just developed. This search algorithm is meant to search a file system for a given file.
Here is how it works:
We begin by indexing all the files on the given file system.
Step 1: Find all files on the file system and put them into a master file list.
Step 2: This step is important and a lot of the work gets done here
I create a map with 256 keys in it, one for each value a byte can hold (I will refrain from saying ASCII characters, because we also dealing with unicode...although in most case we are dealing with single characters per byte). Each key contains a value which is another map of 256 keys. (So basically we have a 256x256 matrix...but it's really a map...if that makes sense)
Each key on the 2nd dimension of maps holds an empty list which will be used in the future.
In this step we kind of "hash" our file system list, in a sense. It is based off of letter pairs. So let's take a random file, for example: "filename.txt"
The first character pair in that file name is "fi". I go to the "f" key of the main map, and then I branch off to the "i" key of its map....just like you would a matrix. From there I append a value onto the list of that element with the data: "fi" (the character pair), "1" (the ending index of the character pair in the file name string), and "some number" (a file ID number).
I do this for every character pair, and for every file in the file system. If that doesn't make sense, here is a quick illustration...
Attachment 7520
Once this data structure is completed, we are ready to search. (This process obviously takes a considerable amount of time...the indexing process is something that would take place overnight, for example, while users are not using the file system)
Step 3: Searching for a file
Now let's say you search for the string "file"
The search algorithm starts from the second to last character in the search string and indexes into the hash table we created earlier.
So in this case, it would go to the position map[ "l" ] [ "e" ], which would then return a list of all files that have the character pair "le" in it.
It would then go to the character pair "il", and find all files with that character pair in the filename.
The search algorithm then does an intersection on these 2 lists to see where matches occur (in file IDs, and if the ending indexes of the character pairs are adjacent to each other). If so, then those matches get carried on into the next iteration.
If not, anything that does not match gets thrown into a "less relevant" list. This process continues until the entire search string has been matched.
We will eventually end up with something like this:
Searched for string: file
Found:
file_a.txt
filemyfile.txt
blahfile.txt
bobfil.txt
joeile.txt
someotherfi.txt
blahf.txt
Notice how they are sorted in terms of relevance. Anything that got dropped throughout the process went into the "less relevant" list, but didn't get thrown out altogether. It comes back in the end as a less relevant match. All the matches that endured through the whole process kept bubbling up through the iterations, and they are the first matches in the list.
Now....there are some particulars that I have no described here, such as "How does the algorithm account for files with double occurrences of character pairs like fifi.txt" I assure you that it does account for stuff like that....but I don't got time to go into that.
Anyways, hopefully you guys were able to understand my explanation of the algorithm. I want your input....what do you think of it? What flaws and advantages do you see in it? And most especially: what do you think the time complexity of the search is (not the file indexing)?
[edit]
Something that I forgot to mention is that I did already implement the algorithm...and it works. So far I have only tested it on a file system of 1000 files, and tomorrow I am going to test it on our server which contains about 240000 files...we will see how it performs. I just want your input and thoughts on it. :-)
[/edit]