Thread: Segmentation fault when reading very large text files

  1. #1
    Registered User
    Join Date
    Dec 2009
    Posts
    2

    Segmentation fault when reading very large text files

    Hello all,

    I've created a fairly simple C program to search a text file for a pattern supplied by the user. I have test files that are 1MB (17k lines), 10MB (174k lines), and 100MB (1.74 million lines). The first one that reads 17k lines of text runs perfectly, but the other two output Segmentation Fault errors. I'm guessing the size of the files is the problem. Am i requesting too much memory? Could fopen() or fgets() be the problem, can they not handle this kind of load?

    Here's some code:

    Code:
    // Currently works when running from command line: ./a.out genome1MB.txt aaatcg
    
    #include <math.h>
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    int main(int argc, char *argv[])
    {
    
    	char pattern[100];		//holds the user's DNA pattern
    	int lineCount = 0;		
    	int patternsFound = 0;
    	int lineLocation[500], i;	//array that holds the line numbers of the pattern 
    	int charLocation[500], j;	//array that holds the character locations of the pattern
    	int x = 0;			//used to increment lineLocation array
    	int y = 0;			//used to increment charLocation array
    	char viewPatternLocations;
    
    	if(argc != 3)
    	{
    		printf("Please provide the correct number of command line arguments.\n");
    	}
    	else
    	{		
    		// Save the user's DNA pattern from the command line into a variable:
    		sscanf(argv[2], "%s", pattern);
    
    		FILE *f;
    		char line[500];
    		char *item;
    		int calcCharLocation;
    		
    		f = fopen(argv[1],"r");
    
    		if(!f)
    		{
    			printf("Error: cannot open file.\n");
    			return 1;
    		}
    
    		while(fgets(line,500,f))
    		{
    			lineCount++;
    			//printf("%d: %s", lineCount, line);
    
    			item = strstr(line, pattern);	//assists in calculating character location within each line
    
    			// Search each line in the genome (file) for the user's DNA pattern:
    			if(strstr(line, pattern))
    			{
    				//printf("Found\n\n");
    				patternsFound++;
    				lineLocation[x] = lineCount;	//if pattern is found, save the occurrence's line number into an array
    				x++;				//increment the lineLocation array for the next occurrence
    				calcCharLocation = strlen(line) - strlen(item);	//calculates how far into the line the DNA pattern is
    				charLocation[y] = calcCharLocation;		//save the character location calculation into an array
    				y++;				//increment the charLocation array for the next occurrence
    			}
    			else
    			{
    				//printf("Not found\n\n");
    			}
    
    
    		}
    
    		printf("--------------------\n");
    		printf("Given pattern '%s' found %d times in genome.  Would you like specific locations? ('y' or 'n')> ", 
    						pattern, patternsFound);
    		
    		/* 
    		     Specific pattern locations could potentially dump a lot of lines to the screen.  
    		     Give the user a choice whether to see these locations or not.
    		*/
    
    		scanf("%c", &viewPatternLocations);
    
    		if(viewPatternLocations == 'y')
    		{
    			for(i = 0; i < patternsFound; i++)
    			{
    				printf("In line %d, %d characters in.\n", lineLocation[i], charLocation[i]);
    			}
    		}
    
    		fclose(f);
    	}
    
    	return 0;
    }
    Anyone have any ideas?

    Thanks,
    Steve

  2. #2
    DESTINY BEN10's Avatar
    Join Date
    Jul 2008
    Location
    in front of my computer
    Posts
    804
    Generally the stack size in a machine is 1-2MB. So when you're reading the file size greater than 1MB, stack overflow occurs. I'm not sure if this is the exact reason for it or not.
    HOPE YOU UNDERSTAND.......

    By associating with wise people you will become wise yourself
    It's fine to celebrate success but it is more important to heed the lessons of failure
    We've got to put a lot of money into changing behavior


    PC specifications- 512MB RAM, Windows XP sp3, 2.79 GHz pentium D.
    IDE- Microsoft Visual Studio 2008 Express Edition

  3. #3
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    He's not loading the entire thing into memory. My guess is that he finds out that there's more than 500 locations, runs off the end of his array, and trashes his memory.


    Quzah.
    Hope is the first step on the road to disappointment.

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Edit:

    Thanks for that correction laserlight! fgets() stops at n-1 char's, (or the newline), so there should always be an EOS char at the end of the array of char's it is copying into the array.

    I don't see anything obviously wrong, then.
    Last edited by Adak; 12-07-2009 at 01:57 AM.

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Adak
    When you fgets() 500 char's, and put them into a 500 element array, there's no room for the end of string char.
    That depends on what you mean by "fgets() 500 char's". If you pass 500 as the second argument to fgets, at most 499 chars will be read and stored in the destination array.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Here's what I'd do. Change the variables inside the loop, from int to unsigned long's. Then see how it works. I'm thinking x or some other int variable, just overflows and thus goes negative.

    Then when you use that as an index to the array - crash city.

  7. #7
    Registered User slingerland3g's Avatar
    Join Date
    Jan 2008
    Location
    Seattle
    Posts
    603
    Quote Originally Posted by quzah View Post
    He's not loading the entire thing into memory. My guess is that he finds out that there's more than 500 locations, runs off the end of his array, and trashes his memory.


    Quzah.

    I agree! Check your x and y's here. The use of fgets() is fine.

  8. #8
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    As quzah noted, you're probably going out of bounds somewhere.
    Use a debugger to pinpoint the code location where it is happening.

  9. #9
    Registered User
    Join Date
    Dec 2009
    Posts
    2
    Yup quzah fixed it, thanks. When I first coded it up I was using such a small test file, I figured there was no way there could be more than 500 matches. Then when I went to use the larger test files, i forgot to allocate more space.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Reading text files
    By Xinco in forum C++ Programming
    Replies: 20
    Last Post: 01-09-2007, 01:24 PM
  2. Segmentation fault
    By bennyandthejets in forum C++ Programming
    Replies: 7
    Last Post: 09-07-2005, 05:04 PM
  3. Reading binary files and writing as text
    By thenrkst in forum C++ Programming
    Replies: 8
    Last Post: 03-13-2003, 10:47 PM
  4. displaying text files, wierd thing :(
    By Gades in forum C Programming
    Replies: 2
    Last Post: 11-20-2001, 05:18 PM
  5. Outputting String arrays in windows
    By Xterria in forum Game Programming
    Replies: 11
    Last Post: 11-13-2001, 07:35 PM