Thread: Reading Microsoft Word documents

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    98

    Reading Microsoft Word documents

    How can I read Word documents using C#? I use streamReader for plain text files, but I don't know how to read Word documents. Does anyone know how to do this?

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    The file-format produced by Word (in a .doc file) is a binary format that contains all sorts of "extra data" beyond the basic text that is "the real documet".

    I would guess there are libraries available to read it, but not sure really.

    The easiest solution is perhaps to save the document as text or "mostly text" document (rtf for example).

    --
    Mats

  3. #3
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    There's a COM Interop object available if you have Word installed. Take a look through Google for "Word interop" or something similar.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    98
    I have Word installed, but it's quite old (Office 2000). Isn't there some other way to read Word documents?

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Depends on what you want to do - what are you trying to achieve? If you can describe what your end goal is, then we can almost certainly describe some way of getting there (or towards that goal) - but "just reading a word document" isn't trivial, because the information in the file is stored in quite a complex manner (for example, if you enable "visible changes" both the previous text and the new text for multiple generations of the document may be kept in the document).

    --
    Mats

  6. #6
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    Another alternative is to use the COM IFilter interface provided by Microsoft Desktop Search. You'll lose all the font information, but you will be able to at least get the words out of the document. It may be possible to use OpenOffice to programatically open Word docs.

    In any case, it's not a simple endeavor.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    98
    Quote Originally Posted by matsp View Post
    Depends on what you want to do - what are you trying to achieve? If you can describe what your end goal is, then we can almost certainly describe some way of getting there (or towards that goal) - but "just reading a word document" isn't trivial, because the information in the file is stored in quite a complex manner (for example, if you enable "visible changes" both the previous text and the new text for multiple generations of the document may be kept in the document).

    --
    Mats
    All I want to do is extract the text from the Word document. Just the text, no formatting.

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Well, if you want to do that, you could try an approach of this:

    Code:
    fin = fopen("something.doc", "rb");
    fout = fopen("someelse.txt", "w"); 
    while ((c = fgetc(fin)) != EOF) {
        if (isascii(c)) fputc(c, fout);
    }
    This isn't meant to be complete, and I'm fairly sure you'll get some "garbage", but you probably can get the actual text out of it.

    Alternatively, try the "Word Viewer":
    Microsoft Word Viewer download site

    It allows you to copy text out of a word document, it's free and you don't have to write a single line of code (and it's probably going to do a better job of sorting out what's what in your document too).

    --
    Mats

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    98
    Thanks for the help!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Hangman game and strcmp
    By crazygopedder in forum C Programming
    Replies: 12
    Last Post: 11-23-2008, 06:13 PM
  2. Reading a Whole Word in C
    By Chinfrim in forum C Programming
    Replies: 2
    Last Post: 10-19-2008, 12:54 PM
  3. Microsoft Word Automation
    By BobS0327 in forum Windows Programming
    Replies: 12
    Last Post: 11-22-2007, 05:53 PM
  4. Apps that act "differently" in XP SP2
    By Stan100 in forum Tech Board
    Replies: 6
    Last Post: 08-16-2004, 10:38 PM
  5. Replies: 1
    Last Post: 04-01-2003, 06:02 AM