How to allocate memory for a string which its size is bigger than available RAM?

**Nyda** · 12-08-2004

Originally Posted by franziss

So you are saying that for a array element of size 8bits, I can store 4 DNA characters each of size 2 bits, and I need bitwise operations to get the character, but what does these means?

Code:

byte (index/4), bits (index%4)*2 and (index%4)*2 +1.

Thats just the exact position in the array of characters. You need to extract data from bytes because c doesn't offer you any means to create an array of elements smaller than one byte. The array of characters/bytes would look like

Code:

bit        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 ...
byte/char  [--------- 0 ---------] [---------- 1 --------] [--------
base       [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ]

So for your GetBase() to retrieve the 4th base it would have to extract bits 8 and 9, which happens to be located at byte 4/4=1, bits (4 mod 4)*2=0 and (4 mod 4)*2+1=1 within that byte. You will need the bit-position within the byte to extract the base.

You can extract individual bits or pairs of bits from a byte using binary shift operations and bit masks (those are two nice search terms

). It's really not as complicated as it sounds.

This line of code would extract 2 bits from a byte, starting at bit bit_offset. You will have to select the correct byte from the array (identified as explained above), extract the 2-bit representation of the base (as shown below) and then look up the correct base name or character or whichever representation you chose for the bitpair and return it. That should be no more than 4 or 5 lines of code, so it's a lot simpler than it sounds.

Code:

base_id= (char_array_element >> bit_offset) & 3

(3 in binary is 0....011 or in other words the "last" two least significant bits.)

---

Anyways, this will not be sufficient to store the entire gnome, but it might make processing the data easier as you can keep more of the data in memory at a time.

**itsme86** · 12-08-2004

Originally Posted by franziss

So you are saying that for a array element of size 8bits, I can store 4 DNA characters each of size 2 bits, and I need bitwise operations to get the character, but what does these means?

Code:

byte (index/4), bits (index%4)*2 and (index%4)*2 +1.

Picture it in your mind...say you read in 3 bytes. If the bits were laid out in the array so that the first base was stored in the least significant 2 bits, the second base was stored in the next least significant bits, etc., it might look like this:

Code:

10110101 11001101 01000110
\/\/\/\/ \/\/\/\/ \/\/\/\/
3 2 1 0  7 6 5 4  11109 8

That's 12 different bases, or whatever. 2 bits for each one. The first byte would be at index array[0], the next at array[1], and the last at array[2]. So say you had a variable called index that stored which base you wanted. You could retrieve it like this:

Code:

base = (array[index / 4] >> ((index % 4) * 2)) & 3;

And you could store 2 bits at any given index like this:

Code:

array[index / 4] |= base << ((index % 4) * 2);

So if index was 0 it would grab the 2 least significant bits of array[0]. If index was 6 it would grab the pair of bits that was second to most significant of array[1] (6 / 4 = 1, and (6 % 4) * 2 = 4).

Clear as mud, right?

EDIT: Doh! Beaten to the punch.

**anonytmouse** · 12-08-2004

>> If I were to read in 30gb of DNA sequences, I will definitely need paging in and out the DNA sequences from harddisk to memory. Does anyone know of any good books or website that taught this kind of things? <<

One option is to map the file into memory with the mmap function. You map a portion of the file, say 150MB at a time, and you can use it just like memory. When you're done with a block call munmap and move onto the next block.

**eccles** · 12-08-2004

Hi there...
I don't know if this will help or not, but - since there are only 64 possible codons (and even fewer amino acids being coded for, plus the stop codon etc) - couldn't you define each codon sequence, and store them in memory, then read in the entire sequence 3 at a time and store them as pointers to the codon in question?

would that use less memory? you could still represent every possible codon as a char, and just define the bit sequence you wanted for the six bits required.
so you'd have 64 * 128bits = your set of codons, and then a huge array of pointers.
I am not sure how much memory a pointer takes up though so i don't know if this is more efficient or not (I am very new to C). but I guess since they have to hold a memory address then it may require more information to specify a memory location that in would to just put the info in about the bases.

oh well... something for people to attack anyway, and i might learn something from that

**itsme86** · 12-08-2004

Pointers take up 2 bytes in Win16 programs, 4 bytes in every other x86 executable type that I'm aware of. 64-bit code would take up 8 bytes per pointer.

**franziss** · 12-08-2004

Originally Posted by eccles

Hi there...
I don't know if this will help or not, but - since there are only 64 possible codons (and even fewer amino acids being coded for, plus the stop codon etc) - couldn't you define each codon sequence, and store them in memory, then read in the entire sequence 3 at a time and store them as pointers to the codon in question?

I'm not sure about this because I'm building a suffix tree that looks for common pattern. For example, AGTGTC, if AGT coded as a condon, GTC coded as a condon, then I would not be able to detect GTG?

**franziss** · 12-08-2004

Nyda, itsme86 and anonytmouse, thanks for your help!

**VirtualAce** · 12-09-2004

You can allocate as much memory as you want but your code will run slower than crap. Window runs in protected mode with caching enabled. Any memory that cannot be allocated will be 'allocated' on disk via the swap file. However this poses several problems. If you attempt to access a page of memory that has been swapped you will get a page fault error. I believe that Windows handles the locking and unlocking of pages while you are writing/reading from them. But theoretically you can allocate any chunk of memory nearly as large as the amount of RAM you have in your system (minus OS, drivers, etc) plus the size of your hard disk(s). But again this is not recommended.

How it would work is when you attempt to access a portion of your memory or array Windows would compute that the information is not actually in RAM but resides on disk inside of the swap file. Immediately Windows would swap the current RAM page to disk, and read in the data you need into RAM (probably into the same page it just swapped to disk).

| -> denotes a page border

RAM...............|...........................Disk
Buffer[0].........|Buffer[0xFFFF]............................Buffer[0xFFFFFFFFFFF]

So if you access anything over SEG [Buffer+(0xFFFF*size_of_buffer_type)] Windows would immediately unlock the current page in memory, swap this to the correct place in the swap file, lock the memory and then load in data from the correct location in the swap file into the page of RAM where the old data used to be.

What this amounts to is a heck of a lot of locking and unlocking of pages of RAM and depending on page size this can really slow the program down tremendously. I would not recommend using extremely large chunks of RAM...unless you are also writing your own memory management schemes to manage the chunk.

The default paging mechanism of the x86 architecture is an awesome system but it comes with a price.

For more information about the x86 paging mechanism and architecture I would recommend consulting the Intel IA32 System Programmer's Reference Manual as well as the Application Programmer's Reference Manual. They explain it in very vivid detail.

**franziss** · 12-09-2004

Originally Posted by Bubba

What this amounts to is a heck of a lot of locking and unlocking of pages of RAM and depending on page size this can really slow the program down tremendously. I would not recommend using extremely large chunks of RAM...unless you are also writing your own memory management schemes to manage the chunk.

The default paging mechanism of the x86 architecture is an awesome system but it comes with a price.

For more information about the x86 paging mechanism and architecture I would recommend consulting the Intel IA32 System Programmer's Reference Manual as well as the Application Programmer's Reference Manual. They explain it in very vivid detail.

I'm using RedHat Linux 9 with AMD CPU. Will it be the same? About writing my own memory management schemes, does C allow me to do it? For example, if I create my own stacks, queues, linked list with LRU, MRU strategy with my program, does this consider a memory management strategy?

**Salem** · 12-10-2004

Sure you can use the sbrk() function to grab lots of memory directly from the operating system.
But that doesn't buy you much compared to just getting malloc to do the same thing for you.

I think you're worrying about the wrong thing. No matter what you do, you're always going to have a memory problem if you even attempt to store the whole thing in memory. So do yourself a favour and start thinking about the actual problems rather than trying to reinvent memory allocation in some new and interesting way.

**franziss** · 12-10-2004

Hi Salem,

I don't have much choice as part of my project is on memory management strategy. There are some research papers which focus on different memory management; they implement their own management strategy and compared results with other standard strategy such as LRU, MRU, etc.

I emailed them to ask how they implement their strategy, is it using C language? They haven't reply me yet. so I kindda stuck here...

**CornedBee** · 12-10-2004

The memory management is done by the OS, and in Protected Mode, you have no say in it. You can ask the OS to do specific things for you, but in the end you still have to share the computer with other programs, and that's why your options are very limited.

**eccles** · 12-10-2004

I'm not sure about this because I'm building a suffix tree that looks for common pattern. For example, AGTGTC, if AGT coded as a condon, GTC coded as a condon, then I would not be able to detect GTG?

oh! does this mean that you are scanning DNA for information content? (ie the level of entropy in the signal)? or are you scanning for restriction enzyme target sites? (like so you can create plasmids)

forgive me if these are really dumb questions... it's been a few years since i've done molecular biochemistry...

**franziss** · 12-10-2004

Originally Posted by eccles

oh! does this mean that you are scanning DNA for information content? (ie the level of entropy in the signal)? or are you scanning for restriction enzyme target sites? (like so you can create plasmids)

forgive me if these are really dumb questions... it's been a few years since i've done molecular biochemistry...

Nope... These are very intelligent questions...

I'm not working on these fields... I'm working on building suffix tree of DNA genome. This suffix tree will be used for sequence alignment.

And my algorithm use frequent patterns to build the tree, thus, I can't code the sequence to codons.

Thread: How to allocate memory for a string which its size is bigger than available RAM?

Thread Tools

Search Thread

Display

Similar Threads

[Inheritance Hierarchy] User Input on program with constructors. How ?

String issues

Something I'd like to share with all of you: simple frustum culling

Is it necessary to write a specific memory manager ?