Thread: HTML to TXT

  1. #1
    Registered User
    Join Date
    Jun 2007
    Posts
    12

    HTML to TXT

    Hello Everybody,
    I have a question regarding the conversion of files from HTML to TXT. Im fairly new to the C language but understand the basics. Im a little confused with what to do.I have to write a program that asks the user to input a html file and then the program converts it to txt.
    Thanks-

  2. #2
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    You may have a question, but you failed to ask it.

  3. #3
    Registered User
    Join Date
    Jun 2007
    Posts
    12

    ?

    huh?

  4. #4
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    You would want to open the HTML file for reading. Read it and output the contents to another file with a .txt extension.

    You can find out how to do that here:
    http://faq.cprogramming.com/cgi-bin/...&id=1043284392

  5. #5
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    Quote Originally Posted by eobergfell View Post
    huh?
    That's the first question you actually asked.

    Look at your first post. There is no question. You just said you wanted to make a new txt file from an HTML file. Someone concluded that meant you just want a copy of the same file with a different file extension. I think it could mean you want to strip all the tags and create a new txt file with just straight text from the original file.

    You were first too ambiguous in your initial description of what you wanted to do. Your second fault was to not even ask a question, and somehow hope we would just know what to say.

    Have a look at this: http://www.catb.org/~esr/faqs/smart-questions.html


  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by eobergfell View Post
    Hello Everybody,
    I have a question regarding the conversion of files from HTML to TXT. Im fairly new to the C language but understand the basics. Im a little confused with what to do.I have to write a program that asks the user to input a html file and then the program converts it to txt.
    Thanks-
    Your program will be very similar to one's that count up words. You'll need to keep track of the current position in the file, and whether that position is INSIDE an html tag, (in which case the text char's found there do NOT need to be copied to the text file you're program will make), or that position is OUTSIDE any html tag, and therefore is just text char's that DO need to be copied into the text file.

    So your program will scan the html file, and keep an integer flag variable "Inside", equal to 1 (yes, it's inside currently), or equal to 0, (no, it is not inside, currently).

    Now all you have to do is learn what are the border char's, that always distinguish html tags?

    The more specific your questions are, the better we can answer your questions. Be sure to post up some code of yours as you ask, as well. That will also help guide our communications, and show that you are working on it, also.

  7. #7
    Registered User
    Join Date
    Jun 2007
    Posts
    12
    Thanks mike and adak and to MacGyver keep being a sweetheart okay?

  8. #8
    Registered User
    Join Date
    Jun 2007
    Posts
    12
    Yea I have to strip off all the tags <> and make it into a readable plan TXT file

  9. #9
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    If you want to strip XML (or HTML, same thing basically) properly, you need a finite state machine (never mind, it's just a term) with three states.
    • Outside any tag.
    • Inside a tag.
    • Inside an attribute.

    If you don't have the last state, your code would be fooled by
    Code:
    <tag data=">" />
    Anyway, the logic is quite simple. If you're outside a tag, look for a '<', which would involve changing to mode inside-a-tag. If you're inside a tag, look for a ' or ", which would make the mode inside-an-attribute. Ditto for the closing characters.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  10. #10
    Registered User
    Join Date
    Jun 2007
    Posts
    12
    Code:
    #include<stdio.h>
    int main()
    {
            int a, b;
            int input;
            FILE* outfile;
            char filename[80];
            printf("Enter Filename: ");
            gets(filename);
            outfile = fopen(filename, "w");
            scanf("&#37;d", &input);
    
            fprintf(outfile, "%d\n",input );
            fclose(outfile);
    }
    thats what ive done so far...I understand the logic behind this "If you're outside a tag, look for a '<', which would involve changing to mode inside-a-tag. If you're inside a tag, look for a ' or ", which would make the mode inside-an-attribute" But I don't know how to write it out.
    Help, Anybody?

  11. #11
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Don't use gets(). It's bad. See the FAQ.

    How are you going to read from the file if you open it in "w" mode? Or are you reading from stdin?

    Regardless, the logic goes something like this.
    Code:
    int mode = outside;
    
    while(get a character) {
        if(mode == outside) {
            if(char == '<') mode = inside;
            else write character
        }
        else if(mode == inside) {
            if(char == '>') mode = outside;
            else if(char == '"' || char == '\'') mode = attribute;
        }
        else {  /* mode = attribute */
            /* ... */
        }
    }
    An enum would be useful here. http://www.cprogramming.com/tutorial/enum.html
    Code:
    enum {
        MODE_OUTSIDE,
        MODE_INTAG,
        MODE_INATTRIBUTE
    } mode = MODE_OUTSIDE;
    But you can use ordinary variables too of course.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Using gets() on a C board is sort of like putting a match out on a burn victim - very touchy about that. Check the faq on how to use fgets() which is safe, for input from stdin (when needed in your program). You won't get much help beyond "don't use gets()!" until that's done.

    You'll definitely want a variable named "InsideTag" (or OutsideTag), and InsideAttrib, (or OutsideAttrib). Please get those added and initialized to out.

    While you're trying to code this, I'd temporarily /* remark out */ the file names, etc., until you get a handle on the logic. It will greatly reduce your frustration, and speed everything up.

  13. #13
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    You'll definitely want a variable named "InsideTag" (or OutsideTag), and InsideAttrib, (or OutsideAttrib). Please get those added and initialized to out.
    It makes more sense to have a single variable to hold three states, since the program can only be in one state at once. But two variables would work as well.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  14. #14
    Registered User
    Join Date
    Jun 2007
    Posts
    12
    Hey everybody thanks, im having trouble with using the "enum function"
    Code:
    enum Status{OUTSIDE = ?, INSIDE = ?);
    I dont know what to set OUTSIDE and INSIDE equals to.
    If anyone could help I would really appreciate it. Thanks-

  15. #15
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    The first value in an enum is automatically set to zero. Succeeding values are automatically set to one more than the previous value. But you can override this if you like.
    Code:
    enum numbers {
        minusone = -1,
        zero,
        three = 3,
        four,
        five
    };
    But the point of enums is that the values don't matter. You just have something like
    Code:
    enum {
        MODE_OUTSIDE,
        MODE_INTAG,
        MODE_INATTRIBUTE
    } mode = MODE_OUTSIDE;
    and you always compare the enum to those values.
    Code:
    if(mode == MODE_OUTSIDE) mode = MODE_INTAG;
    if(mode != MODE_INTAG) puts("not in tag");
    Using enums the way they are intended to be used, it should not matter what the values of the enumerations are.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Please Help - C code creates dynamic HTML
    By Christie2008 in forum C Programming
    Replies: 19
    Last Post: 04-02-2008, 07:36 PM
  2. Writing an HTML Preprocessor
    By thetinman in forum C++ Programming
    Replies: 1
    Last Post: 09-17-2007, 08:01 AM
  3. Parsing HTML files
    By slcjoey in forum C++ Programming
    Replies: 2
    Last Post: 08-28-2005, 07:01 AM
  4. Design + HTML
    By orbitz in forum C Programming
    Replies: 8
    Last Post: 11-21-2002, 06:32 AM