Thread: stripping text from a word doc

  1. #1
    Registered User
    Join Date
    Apr 2002

    stripping text from a word doc

    I have to pull the text out of a word doc. I know a small bit of C, but I'm not sure I know enough to do this. Regardless, I have a C++ dll that another guy here at work wrote from my VB version and it does strip out the text. However, it also strips out the formatting characters that are alphanumeric. This will not work for the situation I need it for. Does anyone know how to find the formatting characters in a word doc when stripping the text out so that it will ignore the formatting characters? Here's the code to the dll that I'm using right now. Any ideas or suggestions would be greatly appreciated. Thanks.

    #include <windows.h>
    #include <stdio.h>
    #include <iostream.h>
    #include <string.h>
    #include <fstream.h>
    extern "C" LPWSTR __stdcall ConvertDocument(const char* pPath)
    	long i;
    	char ch;
    	//char oStr[100000];
    	LPWSTR bsText;
    	//WCHAR wszText[200000];
    	CHAR oStr[200000];
       ifstream tfile(pPath, ios::binary | ios::nocreate );
       if( tfile ) {
    	   i = 0;
    	   while ( (tfile.good()) && (i <= 199999) ) { // EOF or failure stops the reading
    			tfile.get( ch );
     			if((ch >= 'A')&&(ch <= 'Z')){
    				oStr[i] = ch;
    			if((ch >= 'a')&&(ch <= 'z')){
    				oStr[i] = ch;
    			if((ch >= '0')&&(ch <= '9')){
    				oStr[i] = ch;
    			if(ch == 13){
    				oStr[i] = 13;
    				oStr[i] = 10;
    			if((ch == ' ')||(ch == '\t')){
    				oStr[i] = ' ';
    			if((ch == '.')||(ch == '?')||(ch == '!')||(ch == ';')||(ch == '(')||(ch == ')')||(ch == '{')||(ch == '}')||(ch == '[')||(ch == ']')||(ch == '`')||(ch == ':')||(ch == 39)){
    				oStr[i] = ' ';
    	   tfile.close();	// No need for this really, ~ofstream kills the file
    	   oStr[i] = '\0';
       else {
          cout << "ERROR: Cannot open file." << endl;
    	  oStr[0] = '\0';
    	bsText = SysAllocString((LPWSTR)&oStr);
    	return bsText;

  2. #2
    decide which symbols you wish to leave out and delete them from the list of accepted chars.

    If you know what the formatting tags are you can look for them and edit them out. Say the line started like this:

    <indent>It was a rainy day in May.

    and <indent> was formatting fluff you didn't want to keep. To eliminate it you could search for opening <, ignore it and all interval char until the > is found and ignore it, too. If <indent> can be part of the message you wish to keep or part of the formatting fluff you wish to loose, then you're screwed, unless you can figure out some way to distinguish when to save and when to ignore. Same for any other fornatting symbols, words, whatever.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. please help with binary tree, urgent.
    By slickestting in forum C Programming
    Replies: 2
    Last Post: 07-22-2007, 07:55 PM
  2. How to use FTP?
    By maxorator in forum C++ Programming
    Replies: 8
    Last Post: 11-04-2005, 03:17 PM
  3. Replies: 3
    Last Post: 05-25-2005, 01:50 PM
  4. Read word from text file (It is an essay)
    By forfor in forum C Programming
    Replies: 7
    Last Post: 05-08-2003, 11:45 AM
  5. Outputting String arrays in windows
    By Xterria in forum Game Programming
    Replies: 11
    Last Post: 11-13-2001, 07:35 PM