Thread: How to allocate memory for a string which its size is bigger than available RAM?

  1. #1
    Registered User
    Join Date
    Dec 2004
    Posts
    163

    How to allocate memory for a string which its size is bigger than available RAM?

    Hi!

    I'm working on a program which read a file and allocate each character of the file into an array, which I used malloc to declare. But what if the file size (eg 1Gb) is bigger than my PC ram(256mb)? Is there any way to be still able to allocate all the characters to the array?

    Thanks for your help!

    sincerely,
    Franziss

  2. #2
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    Welcome to the forums!

    The bad news is that you need to redesign your program. For several reasons, trying to treat 1GB of data as a cstring is simply not going to work on a 32bit machine. Typically, for large files, you read in data and work on it block by block. If you give more details on what you are trying to achieve we may be able to provide more detailed help.
    Last edited by anonytmouse; 12-07-2004 at 11:22 AM.

  3. #3
    Registered User
    Join Date
    Jan 2002
    Location
    Vancouver
    Posts
    2,212
    You could still allocate that amount, but there would be a lot of page file usage going on and it would slow your program down significantly. Try doing what you're trying to do in smaller chunks.

  4. #4
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    Do you think you'll ever manage to allocate that much memory?? You'll have to wait a few years more...

  5. #5
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by anonytmouse
    trying to treat 1GB of data as a cstring is simply not going to work on a 32bit machine.
    Why? 2^31, allowing for a sign bit, is roughly 2 billion. That is, last time I checked, bigger than 1 billion.

    1024 (1k) *= 1024 (1mb) *= 1024 (1gb) = 1073741824 bytes = fits in a 32 bit int.

    Quzah.
    Hope is the first step on the road to disappointment.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You might run into OS limitations, though. Windows, for example, only offers 2 GB of address range to applications. A single annoyingly-placed 8-byte allocated block can create an out-of-memory situation.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    I am trying to write a program which takes in DNA Genome(which is a string of characters) and build a tree with it. I know of existing program which take in an entire Human genome (which is about 30Gb) and build the tree, but too bad the code is not available for public release yet.

    I am now in the stage of coding the program where I assume everything can be contained in my 256 mb RAM, I am now trying strings which are of small size. And I am working in RedHat Linux ver 9.0, using emacs and gcc

    The next stage will be trying to handle large string size, that is the genome.

    I totally agree with anonytmouse and Brian, I will need paging, work block by block, bring a section of the string and build a part of the tree.

    But how do I use C programming to manage the memory, the paging stuffs? I only know basic C programming skills.

    Thanks for the help guys, I really appreciate it!

  8. #8
    Registered User
    Join Date
    Apr 2004
    Posts
    210
    School has been quite a while and I didn't specialize in biology, as far as I know there are only a few different DNA bases (Adenin, Guanin, Cytosin, Thymin and Uracil, and no, I didn't remember those. Google ).

    That's 5 which means 3 bits would be sufficient to store each base. Accessing 3 bits within a string of bytes isn't as easy as using the bytes directly, but it's not such a big deal either. It would reduce memory usage down to a little more than one third of what it was where you using bytes. (That is assuming we're actually talking about bases and not some kind of other representation and that the data is not already store the way I said).

    Alternatively, if there are any bases that are a lot more common than others, you could use non-linear compressions like huffman trees to save some space. It will make accessing a random base element (pretty much*) impossible though, so I guess that's out of question.

    ([*] You'd have to walk through sequentially from the beginning or use another datastructure to remember indexes (like every 1000th) within your database.)
    Last edited by Nyda; 12-08-2004 at 12:42 AM.
    main() { int O[!0<<~-!0]; (!0<<!0)[O]+= ~0 +~(!0|!0<<!0); printf("a function calling "); }

  9. #9
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    Quote Originally Posted by Nyda
    That's 5 which means 3 bits would be sufficient to store each base. Accessing 3 bits within a string of bytes isn't as easy as using the bytes directly, but it's not such a big deal either. It would reduce memory usage down to a little more than one third of what it was where you using bytes. (That is assuming we're actually talking about bases and not some kind of other representation and that the data is not already store the way I said).
    Thanks Nyda! That's a very interesting idea! I will try to work on that. But how do you read an 32bits character(assuming each character is a long word) and convert it into a 3 bits? Does C allow us to do it?

  10. #10
    Registered User
    Join Date
    Apr 2004
    Posts
    210
    Quote Originally Posted by franziss
    Thanks Nyda! That's a very interesting idea! I will try to work on that. But how do you read an 32bits character(assuming each character is a long word) and convert it into a 3 bits? Does C allow us to do it?
    I don't know how your data looks. If it's a string of characters like

    Code:
    AGCU...
    then each character is 8 bit long. So that sequence would take up 4 bytes of memory. Representing those as a binary instead could look like this:

    Code:
    bin dec rep
    000 0   Adenin
    001 1   Cytosine
    010 2   Thymine
    011 3   Guanine
    100 4   Uracil (???)
    Since you are only talking about DNA in your above post, do you need Uracil at all? Because if you don't you'd get along with 2 bits as you can see above. This would make bitextraction a *lot* easier. That way, you could fit in exactly 4 bases into one byte.

    Now all you need is a basic understanding of
    - how decimals are represented in the binary system
    - c bit-operators
    - strings/arrays

    To access a 2bit base within an array of bytes, you'd have to write a few functions with prototypes like this.

    Code:
    char GetBase(char *sequence, int index);
    void SetBase(char *sequence, int index, char base);
    GetBase(myDNA, 1) would retrieve the first 2 bits of the first byte and convert it to a character. GetBase(myDNA, 6) would retrieve the second 2 bits from the second byte and so on. Your DNA would be stored as bitsequence but you could acces it almost as if it were an array of characters. SetBase would look up the bitwise representation chosen for any base supplied and store it in the array at byte (index/4), bits (index%4)*2 and (index%4)*2 +1.

    Since I'm making a lot of assumptions here I won't go in any more details. For a start, search this board or google for bitwise operations. You could also post more details and some code.
    main() { int O[!0<<~-!0]; (!0<<!0)[O]+= ~0 +~(!0|!0<<!0); printf("a function calling "); }

  11. #11
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Uracil and Thymine have an equivalent function. In DNA, there is no Uracil, it is entirely composed of the other four. Uracil is used instead of Thymine in RNA.
    Thus you have only four bases and two bits will suffice.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  12. #12
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > I know of existing program which take in an entire Human genome (which is about 30Gb) and build the tree
    It probably runs on hardware which is way more capable than a single desktop machine.

    Compressing each letter (8 bits) into two bits will only buy you 4 times more storage - which is nowhere near what you actually need to achieve of storing the whole lot in memory at the same time.

    You need to do some serious design work to get up to speed on all the issues, and do some estimating of the resources your candidate solutions would require.
    For example, I would first look at what it would take to store a single chromosome.

  13. #13
    Registered User
    Join Date
    Apr 2004
    Posts
    210
    Quote Originally Posted by Salem
    For example, I would first look at what it would take to store a single chromosome.
    Which is still about 600-700MB of data for each of those. Compressed by 4 thats ~150MB. Sounds like serious fun :-)
    main() { int O[!0<<~-!0]; (!0<<!0)[O]+= ~0 +~(!0|!0<<!0); printf("a function calling "); }

  14. #14
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > Compressed by 4 thats ~150MB.
    Which is looking just about do-able on a 256MB machine with lots of swap space.

  15. #15
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    Quote Originally Posted by Nyda
    To access a 2bit base within an array of bytes, you'd have to write a few functions with prototypes like this.

    Code:
    char GetBase(char *sequence, int index);
    void SetBase(char *sequence, int index, char base);
    GetBase(myDNA, 1) would retrieve the first 2 bits of the first byte and convert it to a character. GetBase(myDNA, 6) would retrieve the second 2 bits from the second byte and so on. Your DNA would be stored as bitsequence but you could acces it almost as if it were an array of characters. SetBase would look up the bitwise representation chosen for any base supplied and store it in the array at byte (index/4), bits (index%4)*2 and (index%4)*2 +1.
    So you are saying that for a array element of size 8bits, I can store 4 DNA characters each of size 2 bits, and I need bitwise operations to get the character, but what does these means?
    Code:
    byte (index/4), bits (index%4)*2 and (index%4)*2 +1.
    Although this idea does not really compress all 30GB of DNA sequences into a 256mb RAM, it does speed up the time needed to build the tree, because I will be able to read in more DNA sequences at one time to build the tree.

    I am still in my designing phase (hee........), currently I'm studying on memory management (this is a great site http://users.actcom.co.il/~choo/lupg...ix-memory.html)

    If I were to read in 30gb of DNA sequences, I will definitely need paging in and out the DNA sequences from harddisk to memory. Does anyone know of any good books or website that taught this kind of things?

    The existing program that I mentioned runs the 30GB DNA on Linux Pentinum 4 with 2 Gmb of RAM. For more details of this program, you can read its paper "Practical Suffix Tree Construction"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 8
    Last Post: 04-25-2008, 02:45 PM
  2. String issues
    By The_professor in forum C++ Programming
    Replies: 7
    Last Post: 06-12-2007, 09:11 AM
  3. Replies: 11
    Last Post: 03-25-2003, 05:13 PM
  4. Is it necessary to write a specific memory manager ?
    By Morglum in forum Game Programming
    Replies: 18
    Last Post: 07-01-2002, 01:41 PM