One Huge Lib vs. Many Small Libs

**Rahvasaadik** · 06-03-2011

Greetings!

Im planning to program a dictionary and would like to know which library configuration would work fastest.

which is most efficient and fastest: scanning one huge library file OR scanning many smaller library files for a corresponding word?

My logic goes, that if using one huge library file, it would use more memory to scan, but would be faster. If using many smaller libraries, I would have to develop a more complex logic into my program to assess which library to access, to limit unnescessary access. Besides, would looking through multiple smaller libraries be slower anyways, since the locating and loading of multiple files from HDD is slower?

Consider, that the library alltogether would weigh 10 MB as an example.

**Salem** · 06-03-2011

10MB would fit comfortably in memory anyway.

Unless you're stuck using some 20+ year old fossil compiler like TurboC, where the most amount of memory you can have is 640KB.

**Rahvasaadik** · 06-03-2011

Well, consider then that the library is 1 terabyte. Which is now better? One huge lib or many smaller libs?

EDIT: Besides, i think that the time used to calculate the required smaller lib would perhaps even exceed the time required to scan one huge file. Is this true?

**Salem** · 06-03-2011

Is there anything else that you're not telling us, or that you're going to change your mind on?

For example, that perhaps the files on disk are sorted/indexed in some way, so you don't have to do a time-consuming linear search every time you need to do something.

**anduril462** · 06-03-2011

Wow, first the library was 10MB, now it's 1TB. That's a huge difference (100,000 times), and if you don't actually know which one it's going to be, this is an absurd question. Unless you have one bada$$ super computer, you probably don't have 1TB of memory to keep the whole thing in, so you'll likely be paging large chunks of it, or you'll have to access directly on disk. With that much data, the overhead of opening 1, or even 100 separate files is trivial. The performance of your program will depend a great deal on the data structures and algorithms you use to store the stuff in memory and on disk.

**Rahvasaadik** · 06-03-2011

Sorry

Well, how much would permutations of all ASCII characters up to 20 letters long weigh?
Is paging large chunks slower than accessing directly on disk?

The algorithm is simple. Get a 8-bit binary code from a source file, correlate the code with a string in library and write the corresponding string into another file.
So the algorithm would be searching a source file either in pairs of bytes, where the first byte indicates library file (from 0 to 255) and the other the number of entry in the library (from 0 to 255)
or by 3 bytes, where the first of two indicate library (0-65535) and the third the number of entry (0-255)
or by 3 bytes, where the first number indicates library (0-255) and the other two the number of entry (0-65535).

As you can see, the first method creates symmetrical libraries, where are 256 libraries with 256 entries in each,
the second method creates 65535 libraries with 256 entries in each and the last method creates 256 library files with 65535 entries in each.
Each word in the library may have length up to 20 characters.

As you can see, we can have less but longer libraries or many but shorter libraries.

I did not think that the algorithm or the method would have such an impact on performance.

**GReaper** · 06-03-2011

Originally Posted by Rahvasaadik

Well, how much would permutations of all ASCII characters up to 20 letters long weigh?

(256^20 combinations number)*(20 letters per word) = total number of bytes ( Too much!

)

EDIT: Actually it's Sum(i = 1, i <= 20)(256^i * i) = 256 + 256^2 * 2 + 256^3 * 3 + ... + 256^20 * 20 ( Waaay too much!!

)

Originally Posted by Rahvasaadik

Is paging large chunks slower than accessing directly on disk?

Obviously it is, because you each time read a portion of the file and not the whole thing. ( But it's your only alternative )

**anduril462** · 06-03-2011

Originally Posted by Rahvasaadik

Well, how much would permutations of all ASCII characters up to 20 letters long weigh?

There are only 128 different ASCII values. For just a string of 20 characters, assuming each of the 20 spots can contain any one of the 128 values, you have 128^20 permutations, or approximately 1.39*10^42. That's over one tera-tera-tera-megabytes of data (read: a metric crapload). Now, realize that there is way, way, way more than that, since you have to add 128^19 + 128^18 + ... + 128^1. I'm going to assume you have a data set closer to the MB range.

Is paging large chunks slower than accessing directly on disk?

It really depends on how contiguous your file sectors are on the hard drive and how much you can store in memory at once. It also depends on how much of your file you're reading at once. Ideally, you want to leverage the locality of reference. There is no way of really knowing how well either will perform without some serious tests, and even then it wont be deterministic. Both involve lots of disk access and will result in lots of thrashing, making both slow. This is all moot if your data set is really that big, since I don't know of a system with that much storage space. It's also moot if it's small, since you can do it all from memory.

The algorithm is simple. Get a 8-bit binary code from a source file, correlate the code with a string in library and write the corresponding string into another file.

So the algorithm would be searching a source file either in pairs of bytes, where the first byte indicates library file (from 0 to 255) and the other the number of entry in the library (from 0 to 255)
or by 3 bytes, where the first of two indicate library (0-65535) and the third the number of entry (0-255)
or by 3 bytes, where the first number indicates library (0-255) and the other two the number of entry (0-65535).

This sounds nothing like "all permutations of ASCII codes up to 20 characters in length". An 8-bit value can only hold 256 different values. The first method uses only 2 bytes for the key, meaning 65,536 keys, while the second and third methods use 3 bytes for the key, meaning a total of 16,777,216 values, or 16MB. The pigeon hole principle states you can't uniquely correlate several tera-tera-tera-megabytes of data to a mere 16,777,216 keys, let alone 65,536 keys. Seems you can only do strings of up to 2 or 3 characters.

As you can see, the first method creates symmetrical libraries, where are 256 libraries with 256 entries in each,
the second method creates 65535 libraries with 256 entries in each and the last method creates 256 library files with 65535 entries in each.
Each word in the library may have length up to 20 characters.

As you can see, we can have less but longer libraries or many but shorter libraries.

Write it one way, run a wide variety of tests, and record your measurements. Write it a second way, test, record. Same for the third method. Pick the one that performs best. There's no real way to know which will work best for you since too much of it depends on file layout on disk, system call speed, hardware configurations, etc. Though, if there really is only 16MB, put it in a single file and load it all into RAM.

I did not think that the algorithm or the method would have such an impact on performance.

That is probably the most ridiculous statement I've ever heard. That is, unless you can perform a linear search on one tera-tera-tera-megabyte of data faster than I can perform my binary search on that much sorted data. If you could do that, you would have invalidated the entire field of computer science, and everything we've done in the last century could just be thrown away. Hell, this whole thread would no longer be relevant if you had an infinitely fast algorithm. If you don't know what those two algorithms are, how they're different and why one is better, you aren't ready to write your dictionary program yet. Time to bust out some algorithm and data structure books.

Also, it seems clear to me you have no real idea of the scope of your program, or what exactly it's supposed to do. You must have a good grasp of your problem and the solution before you start coding. You're also exhibiting premature optimization, which is bad. Step way back, and start thinking about solving the problem on a small scale, get it working that way, then start scaling, and optimize only when you absolutely need to.

**iMalc** · 06-03-2011

First off, forget about the word "library". Most of us probably thought you were talking about .lib files!

Originally Posted by Rahvasaadik

I did not think that the algorithm or the method would have such an impact on performance.

The algorithm is the most important part.

You need to take a step back. You're not even asking the right question yet. What is your real overall goal?
If you just want a dictionary for looking up English words, perhaps for some kind of word game, and you just want lookups to be fast, then there are plenty of fast options. However, everything you are talking about so far is along way down one tiny little avenue of thinking, that has already excluded most of the important and useful things that you should actually be thinking about, and it may be that you're asking us to choose between a crap solution and a dreadful one.

Do not try and force others down the same like of thinking as yourself. I wouldn't ask you which type of screwdriver to use to hit in a nail. Withholding the information to allow us to come to perhaps the same conclusions on our own just slows things down. Amoung everyone here, we know the best ways to do things, so you need to go back to the beginning and explain your real goal.

**Salem** · 06-03-2011

It seems like some kind of decompression to me.

1. Get token from input file
2. Lookup token in dictionary to find what it expands to
3. Output expanded token to output file.

Is this it?

**Rahvasaadik** · 06-03-2011

Oh, thanks! You have been very helpful people!

Yes, I think I need a dictionary for looking up english words. It will probably have half a million entries, like Oxford Dictionary.

I figured I write a program that compresses text as much as 80%. So a book that weighed 10MB would weigh only 2MB in the end.
That 2MB file would be a sequence of library and entry indexes, which are either 2bytes or 3 bytes long, depending on the length of the word being indexed.
For instance, it does not make much sense to map 4 character words to 3 byte indexes, if I have 2 byte indexes available. The compression benefit is 50% for the latter.

So if you stream your english text file through this program, you end up with a heavily compressed format, which can be converted back to the whole text by the same program.
By the way, I figured out that it is possible to write source code in plain english (forget commenting) and make the program index it with libraries, compressing source code of any program possibly as much as 80%.

So probably there is no real point to do this, since we have so much storage space and RAM available anyways, but when you have a limited server running and would like to save as much bandwidth as possible....

Thread: One Huge Lib vs. Many Small Libs

Thread Tools

Search Thread

Display

One Huge Lib vs. Many Small Libs

Similar Threads

libs out there

mad libs

libs

A small problem with a small program

Using MS .libs with DevC++