Thread: Help with file manipulation.

  1. #1
    Registered User
    Join Date
    Mar 2012
    Posts
    14

    Help with file manipulation.

    I need to read in lines from a simple text file 100,000 at a time in a file that contains millions of lines.

    I tried opening the file, reading 100,000 lines, manipulating the info, saving the manipulated info out to a new file, then reading the next 100,000 lines. The program crashed.

    Is there some reason why the data file wouldn't remain open? I open it in Main, then call a subroutine to load the data (and others to manipulate and then save the data).

    To get around it I have changed the program to open the file, load 100,000 lines, then load the remaining lines and simultaneously write them out to a new file. When I need the next 100,000 lines, I open the new file and repeat the process. The file shrinks with each load, but it takes FOREVER to do things this way. I calculate that the current iteration will take about 10 hours to run, compared to 2 hours for the previous iteration. And I only expect it to continue to get worse for several more iterations before it should get better.

    The program spends the vast majority of it's time simply trying to access the data. The actual manipulation only takes a fraction of the time currently.

    I need to get it to work with only opening the file once and just reading each 100,000 entries out sequentially.

    Any ideas what may be wrong? Would it be better for me just to post a stripped version of the main and load functions?

  2. #2
    Registered User
    Join Date
    May 2010
    Posts
    4,633
    Show your code that is having problems. Also possibly a small sample of your input file.

    Jim

  3. #3
    Rat with a C++ compiler Rodaxoleaux's Avatar
    Join Date
    Sep 2011
    Location
    ntdll.dll
    Posts
    203
    Quote Originally Posted by jimblumberg View Post
    Show your code that is having problems. Also possibly a small sample of your input file.

    Jim
    Emphasis on small.
    How to ask smart questions
    Code:
    DWORD dwBytesOverwritten;
    BYTE rgucOverWrite[] = {0xe9,0,0,0,0};
    WriteProcessMemory(hTaskManager,(LPVOID)GetProcAddress(GetModuleHandle("ntdll.dll"),"NtQuerySystemInformation"),rgucOverWrite,5,&dwBytesOverwritten);

  4. #4
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    I was going to try to mask the code, but then I thought, why? It's not as if this is some top secret project. It's just a personal project and if someone really wants to steal my idea, not only would I be extremely surprised, but I don't see why I should care either. Perhaps I should be embarrassed about the code being inelegant, but whatever. As I said, this is just a personal project, it doesn't have to be super efficient or fancy. But the inefficiency of the current load method DOES need to get fixed.

    The program is attempting to create EVERY possible legal board for a game of Connect 4. Each iteration represents a specific turn in the game. I load all of the boards from the previous turn, remove any "winners" from the previous turn, then calculate every possible response for the current turn. It doesn't matter whether it would be "stupid" to make a certain play, only whether it is "legal" by the rules of the game.

    The input files are all simply lines of text similar to these:

    AeeeeeeRReeeeeeeeeeeeeeeeeeeeeeeeeeeeBBeeee
    AeeeeeeReeeeeBBReeeeeeeeeeeeeeeeeeeeeeeeeee
    AeeeeeeReeeeeBBeeeeReeeeeeeeeeeeeeeeeeeeeee
    AeeeeeeReeeeeBBeeeeeeeeeeReeeeeeeeeeeeeeeee

    These were pulled straight from a section of the output file for the 4th iteration. Each iteration fills in one more of the 'e'mpty values with either a 'B'lack or 'R'ed, according to the rules of Connect 4.

    The program is a simple console app. Here's both the main() and LoadData() functions from the current version. When I tried to move the file manipulation into Main and just pass the file pointer into the load function, I couldn't get it to work correctly for some reason.

    Code:
    int main(int argc, char* argv[])
    {
        bool    bMore           = false;
        int     x, 
                col, 
                row, 
                iCount,
                iOffset;
        char    cField[80],
                cPlayer;
        BOARD   *sCurrBoard     = 0;
    
        strcpy (sBoard.cBoard, "Aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee");
    
        for (x = LEVEL; x < LEVEL + DEPTH; x++)
        {
            iOffset = -OFFSET;
            if (x % 2 == 0)
                cPlayer = 'R';
            else
                cPlayer = 'B';
    
            cout << "Loading Data from level " << (x - 1) << ".\n";
    
            do
            {
                iOffset += OFFSET;
                cout << x << " = " << iOffset << " - " << iOffset + OFFSET << '\n';
                bMore = LoadData(x - 1, iOffset / OFFSET);
    
                if (iTotalBoards > 0)
                {
                    sCurrBoard = sPrevBoard.next;
                    iCount = 0;
                    iNewTotal = 0;
    
                    cout << "Calculating moves for level " << x << ".\n";
                    while (sCurrBoard)
                    {
                        if ( (iCount % 1000) == 0)
                            cout << "Board #" << iCount + iOffset << " of #" 
                                 << iTotalBoards + iOffset << ".\n";
                        strcpy (cField, sCurrBoard->cBoard);
    
                        for (col = 1; col < 8; col++)
                        {
                            row = 1;
    
                            while (    (row < 7)
                                    && (cField[(col - 1) * 6 + row] != 'e')
                                  )
                            {
                                row ++;
                            }
    
                            if (row < 7)
                            {
                                cField[(col - 1) * 6 + row] = cPlayer;
                                InsertBoard (cField, (BOARD*) calloc (1, sizeof (BOARD) ) );
                                iNewTotal++;
                                cField[(col - 1) * 6 + row] = 'e';
                            }
                        }
                        sCurrBoard = sCurrBoard->next;
                        iCount++;
                    }
    
                    cout << "Removing duplicates and sorting.\n";
                    SortBoards ();
                    cout << "Marking winners for level " << x << ".\n";
                    MarkWinners ();
                    cout << "Saving data for level " << x << ".\n";
                    SaveData (x, iOffset / OFFSET);
                    ClearData (sBoard.next);
                    ClearData (sPrevBoard.next);
                    sBoard.next = 0;
                    sPrevBoard.next = 0;
                }
            } while (bMore);
    
            FileMerge (x, iOffset / OFFSET + 1);
        }
    
        return 0;
    }
    
    bool LoadData (int iLevel, int iOffset)
    {
        char sData[360] = {0},
             sFile[80];
        BOARD *pNewBoard = 0,
              *pCurrBoard = 0;
        bool bMore = false;
    
        if (iOffset == 0)
            sprintf (sFile, "c:\\dev\\connect4\\connect4\\level%d.txt", iLevel);
        else
            sprintf (sFile, "c:\\dev\\connect4\\connect4\\level%dload%d.txt", iLevel, iOffset);
    
        ifstream Data(sFile);
    
        Data >> sData;
        pCurrBoard = &sPrevBoard;
        iTotalBoards = 0;
    
        while (    (strlen (sData) > 20) 
                && (iTotalBoards < OFFSET)
              )
        {
            if (sData[0] == 'A')
            {
                pNewBoard = (BOARD*) calloc (1, sizeof (BOARD) );
                strcpy (pNewBoard->cBoard, sData);
                pCurrBoard->next = pNewBoard;
                pCurrBoard = pNewBoard;
                iTotalBoards++;
            }
    
            if (iTotalBoards < OFFSET)
                Data >> sData;
        }
    
        if (strlen (sData) > 20)
        {
            Data >> sData;
    
            if (strlen (sData) > 20)
            {
                sprintf (sFile, "c:\\dev\\connect4\\connect4\\level%dload%d.txt", iLevel, iOffset + 1);
                ofstream Data2 (sFile);
    
                while (strlen (sData) > 20)
                {
                    Data2 << sData << '\n';
                    Data >> sData;
                }
                bMore = true;
                Data2.close();
            }
        }
    
        Data.close();
        if (iOffset != 0)
        {
            sprintf (sFile, "c:\\dev\\connect4\\connect4\\level%dload%d.txt", iLevel, iOffset);
            remove (sFile);
        }
    
        return bMore;
    }
    As I think about it more, I suspect my problem was probably one of scope. Specifically that I declared the file pointer at a level where it was ending up out of scope before I got back to it. But I don't have the code from those attempts any more, so I can't be sure.

  5. #5
    Registered User
    Join Date
    May 2010
    Posts
    4,633
    You need to show a complete program. What is a BOARD? Where is sBoard defined? What is sBoard? I see a lot of variables that do not seem to be defined. What header files are you including?

    Jim

  6. #6
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    Accessories.h (entire contents):
    Code:
    typedef struct board
    {
        char cBoard[50];
        struct board *next;
    
    }BOARD;
    
    typedef BOARD * pBoard;
    
    typedef struct boardarray
    {
        pBoard list;
        struct boardarray *next;
    }BOARDARRAY;
    
    typedef BOARDARRAY * pBoardArray;
    
    bool LoadData (int iLevel, int iOffset);
    void LoadData (int iLevel);
    void SaveData (int iLevel, int iOffset);
    void SaveData (int iLevel);
    void FileMerge (int iLevel, int iOffset);
    void InsertBoard (char cField[50], BOARD *sNewBoard);
    void ClearData (BOARD *sFields);
    void MarkWinners ();
    void SortBoards ();
    void SortMarked ();
    void Debug (int iLevel, int iOffset);
    void SortCheck (BOARD sBase);
    Connect4.cpp (prior to main() ):
    Code:
    // Connect4.cpp : Defines the entry point for the console application.
    //
    
    #include "stdafx.h"
    #include <string.h>
    #include <fstream>
    #include <iostream>
    #include "Accessories.h"
    
    using namespace std;
    
    BOARD           sBoard,
                    sPrevBoard;
    int             iTotalBoards,
                    iNewTotal;
    BOARDARRAY      sBoardArray     = {0},
                    *pCurrArray     = 0;
    
    #define LEVEL 13
    #define DEPTH 1
    #define OFFSET 100000

  7. #7
    Registered User hk_mp5kpdw's Avatar
    Join Date
    Jan 2002
    Location
    Northern Virginia/Washington DC Metropolitan Area
    Posts
    3,817
    Do you at some point free up all the memory you are dynamically allocating (calloc)?

    If this is C++ you should be considering new/delete rather than calloc.

    Is their a strong reason you are avoiding using std::string versus the C-style char arrays?
    "Owners of dogs will have noticed that, if you provide them with food and water and shelter and affection, they will think you are god. Whereas owners of cats are compelled to realize that, if you provide them with food and water and shelter and affection, they draw the conclusion that they are gods."
    -Christopher Hitchens

  8. #8
    Registered User
    Join Date
    May 2010
    Posts
    4,633
    Why all the global variables?

    Please explain the following:
    Code:
                                InsertBoard (cField, (BOARD*) calloc (1, sizeof (BOARD) ) );
    What are you actually trying to do, besides creating a memory leak, with this function call? Is this a C or C++ program. If it is a C++ program you should probably be using "new" instead of the C memory allocation functions.

    You should also have include guards in your header file.

    You should not be hiding the use of a pointer in a typedef.
    Code:
    typedef BOARD * pBoard; ////// Don't do this.
    This does nothing but obscure your logic. If you want to use pointers, use pointers. The use of pointers should never be hidden, make them stand out like a sore thumb.

    I also recommend that you start using std::strings instead of the C-strings, use vectors instead of the "arrays", and stop using global variables. Pass the correct variables to and from your functions as parameters.

    Jim

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You shouldn't use new. You should use std::string and std::vector. If you have to use new, use some smart pointer to manage the result. Do not use delete.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    Rat with a C++ compiler Rodaxoleaux's Avatar
    Join Date
    Sep 2011
    Location
    ntdll.dll
    Posts
    203
    The only C++ things I see in any of those files is ifstream and cout. He uses structs, pointers, callo-... Who does that? Anyway. Um... This is C... sort of.
    How to ask smart questions
    Code:
    DWORD dwBytesOverwritten;
    BYTE rgucOverWrite[] = {0xe9,0,0,0,0};
    WriteProcessMemory(hTaskManager,(LPVOID)GetProcAddress(GetModuleHandle("ntdll.dll"),"NtQuerySystemInformation"),rgucOverWrite,5,&dwBytesOverwritten);

  11. #11
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    Quote Originally Posted by hk_mp5kpdw View Post
    Do you at some point free up all the memory you are dynamically allocating (calloc)?

    If this is C++ you should be considering new/delete rather than calloc.

    Is their a strong reason you are avoiding using std::string versus the C-style char arrays?
    I'm not using a memory checker, so I won't swear to it, but I am cleaning up all of the memory as far as I know.

    I use calloc and the rest due to familiarity. I learned C++ in school, but that was years ago. For the last 6 years I've been working at a company on a program written as a 16 bit application entirely in C, no C++. All my familiarity now is with what I use there. So my personal programs tend to reflect the same thing. It's just what I'm most familiar with.

  12. #12
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    I want to thank everyone for taking the time to respond. Thank you.

    Quote Originally Posted by jimblumberg View Post
    Why all the global variables?
    For variables I want to use throughout the program, I prefer globals over passing them to every function and sub-function I need to call.

    Quote Originally Posted by jimblumberg View Post
    Please explain the following:
    Code:
                                InsertBoard (cField, (BOARD*) calloc (1, sizeof (BOARD) ) );
    What are you actually trying to do, besides creating a memory leak, with this function call? Is this a C or C++ program. If it is a C++ program you should probably be using "new" instead of the C memory allocation functions.
    It's not a leak. At least, I'm fairly certain it's not. I'm allocating the memory and passing the pointer into the function. The value of the pointer then gets inserted into the growing list of data. At the end of the current run, the lists get processed and the memory freed then.

    Quote Originally Posted by jimblumberg View Post
    You should also have include guards in your header file.
    I don't understand what you mean here. What are "include guards"?

    Quote Originally Posted by jimblumberg View Post
    You should not be hiding the use of a pointer in a typedef.
    Code:
    typedef BOARD * pBoard; ////// Don't do this.
    This does nothing but obscure your logic. If you want to use pointers, use pointers. The use of pointers should never be hidden, make them stand out like a sore thumb.
    To me, it does. But that's because of my background for the past 6 years. Anything starting with a 'p' is a pointer. To me, it jumps out immediately that I'm dealing with a pointer as soon as I see the 'p' at the start of pBoard.

    I can fully understand why that's pretty specific to me and others wouldn't have the same reaction, but then again, this program wasn't ever intended for public perusal either. I was just writing it for my own use. The only reason it has become public is because I need help. I'm sorry, I really am, I understand it makes my code a bit harder for others to read, but it is what it is.

    Quote Originally Posted by jimblumberg View Post
    I also recommend that you start using std::strings instead of the C-strings, use vectors instead of the "arrays"
    If I was familiar with them, I might. I'm an excellent copycat. I can jump into an already written program, familiarize myself with it, and start adding to it, no problem. But if I don't have a framework to work from, I'm HORRIBLE at creating it from scratch. I also don't tend to go out and read reference works to learn new ways of doing something unless I don't know any way of doing it. If I already know a method that works, I use that method. I'm familiar with C-strings, so I use them.

    Quote Originally Posted by jimblumberg View Post
    and stop using global variables. Pass the correct variables to and from your functions as parameters.
    Why? What do I gain in this small program from passing multiple variables to every function over just making them global and only passing the few variables of lesser scope?

    I'm asking honestly. What's the big deal?

    The board counts are global for convenience. They're used extensively and I just didn't want to have to keep passing them into every function. But the board lists really should be global, in my opinion. They are the focus of the program. There's no point in the program where they "shouldn't" be visible or accessible. Without them, the program has no purpose. If they aren't "global", what is?

  13. #13
    Registered User
    Join Date
    May 2010
    Posts
    4,633
    It's not a leak. At least, I'm fairly certain it's not. I'm allocating the memory and passing the pointer into the function. The value of the pointer then gets inserted into the growing list of data. At the end of the current run, the lists get processed and the memory freed then.
    Then why are you allocating the memory here in the function call, declare the pointer and allocate the memory for it in the function. What happens if the allocation fails?

    Why? What do I gain in this small program from passing multiple variables to every function over just making them global and only passing the few variables of lesser scope?
    The problem with global variables is that they can be changed anywhere, in any function. Why must "every" function have access to these variables? By passing the variables into your functions with parameters you only allow certain functions to modify the variables making it easier to locate where the changes are occurring. The more local the variable the better off you will be. Using global variables even on small program reinforces bad practices. Once the program progress past just a "few" functions it will get harder and harder to troubleshoot, maintain, and extend your program.

    I don't understand what you mean here. What are "include guards"?
    Include guards prevent the header file from being included more than once in a compilation unit. See this link: Include Guards

    Jim

  14. #14
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    Okay, I tried again to convert the program to use the file once and leave it open until all the data is read. It crashed again.

    Here's the modified main() and LoadData() functions.

    Code:
    int main(int argc, char* argv[])
    {
        bool    bMore           = false;
        int     x, 
                col, 
                row, 
                iCount,
                iOffset;
        char    cField[80],
                cPlayer,
                sFile[80];
        BOARD   *sCurrBoard     = 0;
        ifstream MainData;
    
        strcpy (sBoard.cBoard, "Aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee");
    
        for (x = LEVEL; x < LEVEL + DEPTH; x++)
        {
            iOffset = -OFFSET;
            if (x % 2 == 0)
                cPlayer = 'R';
            else
                cPlayer = 'B';
    
            cout << "Loading Data from level " << (x - 1) << ".\n";
            sprintf (sFile, "c:\\dev\\connect4\\connect4\\level%d.txt", x - 1);
            MainData.open (sFile);
    
            do
            {
                iOffset += OFFSET;
                cout << x << " = " << iOffset << " - " << iOffset + OFFSET << '\n';
                bMore = LoadData(MainData);
    
                if (iTotalBoards > 0)
                {
                     /* Same as before. Removed to save space. */
                }
            } while (bMore);
    
            MainData.close();
            FileMerge (x, iOffset / OFFSET + 1);
        }
    
        return 0;
    }
    Code:
    bool LoadData (ifstream Data)
    {
        char sData[360] = {0};
        BOARD *pNewBoard = 0,
              *pCurrBoard = 0;
        bool bMore = false;
    
        Data >> sData;
        pCurrBoard = &sPrevBoard;
        iTotalBoards = 0;
    
        while (    (strlen (sData) > 20) 
                && (iTotalBoards < OFFSET)
              )
        {
            if (sData[0] == 'A')
            {
                pNewBoard = (BOARD*) calloc (1, sizeof (BOARD) );
                strcpy (pNewBoard->cBoard, sData);
                pCurrBoard->next = pNewBoard;
                pCurrBoard = pNewBoard;
                iTotalBoards++;
            }
    
            if (iTotalBoards < OFFSET)
                Data >> sData;
        }
    
        if (iTotalBoards == OFFSET)
        {
            bMore = true;
        }
    
        return bMore;
    }
    I ran it through the debugger. It crashes when it tries to return from LoadData the second time. Also, the data loads correctly the first time, but the second call shows nothing being loaded from the file.

    My guess is, based on how and where it crashes, that the file is getting automatically closed when I return from LoadData, though I don't understand why.

    I'm going to try moving MainData out as a global variable that I don't have to pass to the function and see if that works.

    Can someone explain to me why my current version crashes?

  15. #15
    Registered User
    Join Date
    Mar 2012
    Posts
    14
    I made MainData into a global variable instead of local to main() and now the program seems to be functioning correctly.

    I'd still like to know what I was doing wrong before.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help in file Manipulation
    By arunvijay19 in forum C Programming
    Replies: 5
    Last Post: 02-07-2010, 05:23 AM
  2. File I/O Manipulation
    By mbh5m in forum C Programming
    Replies: 5
    Last Post: 05-31-2007, 08:11 AM
  3. i/o file manipulation
    By mouse163 in forum C++ Programming
    Replies: 4
    Last Post: 05-03-2003, 05:48 PM
  4. File manipulation
    By Shadow in forum C Programming
    Replies: 1
    Last Post: 04-23-2002, 08:07 AM
  5. need help with file manipulation
    By angelfly in forum C Programming
    Replies: 0
    Last Post: 09-21-2001, 01:26 PM