Thread: reading a file into a block

  1. #1
    Registered User
    Join Date
    Jul 2007
    Posts
    32

    reading a file into a block

    Hello,
    I've done this:
    Code:
    	long size;
    	ifstream file (name, ios::ate);	//open and go at the end	
    	size = file.tellg();	
            char* _buffer = new char [size];	
    	//memset( _buffer, 0, size ); //without this strange there are characters
    	file.seekg (0, ios::beg);
    	file.read (_buffer, size);
            cout << _buffer << endl;
            file.close;
            delete[] _buffer;
    Problem is when I print it: some strange chars (that there aren't in the file) occurs at the end of the _buffer; Why? How can I avoid this? What wrong?

  2. #2
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    _buffer is a character array, and someone forgot to null terminate it.
    Mainframe assembler programmer by trade. C coder when I can.

  3. #3
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Consider reading it into a string and then you don't have to jack with adding an extra byte and setting the zero in place.
    Mainframe assembler programmer by trade. C coder when I can.

  4. #4
    The larch
    Join Date
    May 2006
    Posts
    3,573
    When you use char*, you need to allocate one extra char and set that to '\0'.
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Is this on Windows? If so, every newline will be CR + LF in the file, and converted to LF only in the buffer - and that means that your resulting content is shorter than the actual file, so you get "rubbish" behind the file.

    You can easily fix this by using file.gcount() to see how much data ACTUALLY got transferred.

    Also note that even if it isn't windows and newlines are not reducing the size of the file, there is no "zero at the end" in your buffer.

    Finally, I would recommend that you don't use the technique of "read all the file into a buffer" unless you know for sure that the file is small. If the file is LARGE, it may not even fit in memory, never mind that if you are processing the file afterwards, it's less efficient to READ 2GB or file, process 2GB of file and then write 2GB of file back - it'll be more efficient to do say 4-16KB at a time.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Careful with your statements there, mats
    We know it varies much on conditions. If you have the memory, it might certainly be faster to read it all at once, process and write back than to read small chunks.
    But it might also depend on that processing, if it's optimized for lots of data or just small chunks, etc.
    Cache, if I'm not mistaken, or synchronous reads (not random), works better if you read much data at once than small chunks.
    We can also discuss the overhead with reading data, which is more noticeable when reading small chunks.

    So, anyway, the point is, it depends. Not to say it's certain, but it's not always better to read small chunks, so it's better to be careful when mentioning so
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #7
    Registered User
    Join Date
    Jul 2007
    Posts
    32
    Quote Originally Posted by matsp View Post
    Is this on Windows? If so, every newline will be CR + LF in the file, and converted to LF only in the buffer - and that means that your resulting content is shorter than the actual file, so you get "rubbish" behind the file.

    You can easily fix this by using file.gcount() to see how much data ACTUALLY got transferred.

    Also note that even if it isn't windows and newlines are not reducing the size of the file, there is no "zero at the end" in your buffer.
    I don't understand how to solve using char*; tellg() says me 181 chars, but gcount() (called after the read and not before) says me 168; I'm on windows; If I put _buffer[size+1]='\0' is it ok, but strange character are still included; should I do something like this?
    Code:
    long size;
    ifstream file (name, ios::ate);		
    size = file.tellg();
    _buffer = new char [size+1];	//one more char for '\0'
    file.seekg (0, ios::beg);	
    file.read (_buffer, size);
    size = file.gcount(); //
    _buffer[size] = '\0';
    in this way seems work; but at this point size+1 seems not necessary. I think there's somemething that I don't still clear to me.....

  8. #8
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Let me guess. There are 13 lines in your file.
    Mainframe assembler programmer by trade. C coder when I can.

  9. #9
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Why to bother with dynamic C-array if this is C++?

    Code:
    #include <fstream>
    #include <sstream>
    #include <string>
    #include <iostream>
    
    int main(void)
    {
    	std::ifstream f("c:\\1.txt");
    	std::stringstream str ;
    	str << f.rdbuf();
    	std::string s(str.str());
    	std::cout << s;
    	return 0;
    }
    reads the whole file into C++ string...
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You're not allowed to use tellg() to find out the size of a text file. You either have to walk through the file once to find out its actual length, or you reallocate as you go along. (Hint: use a vector<char>.)
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    Registered User
    Join Date
    Jul 2007
    Posts
    32
    Quote Originally Posted by vart View Post
    Why to bother with dynamic C-array if this is C++?

    Code:
    #include <fstream>
    #include <sstream>
    #include <string>
    #include <iostream>
    
    int main(void)
    {
    	std::ifstream f("c:\\1.txt");
    	std::stringstream str ;
    	str << f.rdbuf();
    	std::string s(str.str());
    	std::cout << s;
    	return 0;
    }
    reads the whole file into C++ string...
    At the moment my file is 14 lines and not 13; But it's a simple test; it could be 10,000 lines too.
    I didn't know that I can't use tellg with a text file; so if it is RIGHT, I have to find a different way; BUT i don't read anywhere that tellg() works on binary file only..... using stringstream is ok; but I read that if the file is large it not good idea using a string; is it true?

  12. #12
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Off by 1. I didn't take into account the last line. Pretty damn good guess if you ask me. Actually, it wasn't a guess. 181-168 = 13. That's most likely the Windows line ending scheme coming into play based on how you are reading the file. The method you used for the 168 count is most likely stripping out the redundant \r's.

    Todd
    Mainframe assembler programmer by trade. C coder when I can.

  13. #13
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    We know it varies much on conditions. If you have the memory, it might certainly be faster to read it all at once, process and write back than to read small chunks.
    That is why he suggested 4-16KB at a time (instead of one byte at a time). From what I understand, the overhead cost becomes insignificant as read size increases. For instance, let's assume that every read has an overhead of 1s, and after that, 1 byte/s.

    size (bytes) | time needed to read (s)
    1 2
    2 3
    3 4
    4 5
    5 6
    ...

    If you read 1 byte at a time, there is a 100&#37; overhead. If you read 5 bytes at a time, its a 20% overhead. 100 bytes at a time = 1% overhead. Therefore it's pointless to read say 1GB at a time vs 1MB at a time, as the speed gained is negligible. Since matsp suggested 4-16KB, I am assuming that is enough to get close to 100% efficiency with modern hardware. If that is the case, why would you want to take up more memory? (it could be used by other programs while your program is running)

  14. #14
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I just did some experiment on my Linux machine (not rocket science, but shows what I meant) -
    Code:
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=1
    102400000+0 records in
    102400000+0 records out
    102400000 bytes (102 MB) copied, 239.271 s, 428 kB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ rm copy
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=2
    51200000+0 records in
    51200000+0 records out
    102400000 bytes (102 MB) copied, 117.278 s, 873 kB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ rm copy
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=4
    25600000+0 records in
    25600000+0 records out
    102400000 bytes (102 MB) copied, 61.4141 s, 1.7 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=8
    12800000+0 records in
    12800000+0 records out
    102400000 bytes (102 MB) copied, 30.6745 s, 3.3 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=16
    6400000+0 records in
    6400000+0 records out
    102400000 bytes (102 MB) copied, 17.4458 s, 5.9 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=32
    3200000+0 records in
    3200000+0 records out
    102400000 bytes (102 MB) copied, 11.4355 s, 9.0 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=64
    1600000+0 records in
    1600000+0 records out
    102400000 bytes (102 MB) copied, 3.99045 s, 25.7 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=128
    800000+0 records in
    800000+0 records out
    102400000 bytes (102 MB) copied, 2.07903 s, 49.3 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=256
    400000+0 records in
    400000+0 records out
    102400000 bytes (102 MB) copied, 1.18204 s, 86.6 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=512
    200000+0 records in
    200000+0 records out
    102400000 bytes (102 MB) copied, 0.687293 s, 149 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=1024
    100000+0 records in
    100000+0 records out
    102400000 bytes (102 MB) copied, 0.455564 s, 225 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=2048
    50000+0 records in
    50000+0 records out
    102400000 bytes (102 MB) copied, 0.35289 s, 290 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=4096
    25000+0 records in
    25000+0 records out
    102400000 bytes (102 MB) copied, 0.260405 s, 393 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=8192
    12500+0 records in
    12500+0 records out
    102400000 bytes (102 MB) copied, 0.226622 s, 452 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=16364
    6257+1 records in
    6257+1 records out
    102400000 bytes (102 MB) copied, 0.242798 s, 422 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=32768
    3125+0 records in
    3125+0 records out
    102400000 bytes (102 MB) copied, 0.215378 s, 475 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=65536
    1562+1 records in
    1562+1 records out
    102400000 bytes (102 MB) copied, 0.209652 s, 488 MB/s
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ rm copy
    cyberfish@cyberfish-laptop:/tmp/bigfile_test$ dd if=big_file of=copy bs=4194304 //4MB
    24+1 records in
    24+1 records out
    102400000 bytes (102 MB) copied, 0.286021 s, 358 MB/s
    For non-Linux/UNIX people, the dd program, in this case, copies big_file (a 100MB file with random data) to "copy" in block sizes specified in bs. The speed increase is negligible beyond 16KB blocks. (no, I don't have a 400MB/s harddrive, it is probably caching/buffering in effect)

  15. #15
    Registered User
    Join Date
    Jul 2007
    Posts
    32
    Finally, what do I have to do if I wanted to use ifstream::read ???
    It's still not clear to me....

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. sequential file program
    By needhelpbad in forum C Programming
    Replies: 80
    Last Post: 06-08-2008, 01:04 PM
  2. Formatting the contents of a text file
    By dagorsul in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2008, 12:36 PM
  3. Simple File encryption
    By caroundw5h in forum C Programming
    Replies: 2
    Last Post: 10-13-2004, 10:51 PM
  4. Manipulating the Windows Clipboard
    By Johno in forum Windows Programming
    Replies: 2
    Last Post: 10-01-2002, 09:37 AM
  5. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM