Thread: Strip HTML code and save to a file

  1. #1
    Registered User
    Join Date
    Mar 2012
    Posts
    5

    Strip HTML code and save to a file

    Hi I'm making a program which takes arguments at command line for a specific website and filename and then saves the html code for you.


    I have that working fine but I'd like to strip the HTML tags as well so it only saves the text of a webpage, I understand that won't work perfectly but I can't get it to work at all!


    Need to put in something like this but I'm unsure where




    Code:
    if (c == '<' || c == '>') {
               in_tag = (c == '<') ? 1 : 0;

    Here's my full program


    Code:
    #include <curl/curl.h>#include <stdio.h>
    
    
    size_t write_data(void *ptr, size_t size, size_t nmeb, void *stream) {
        return fwrite(ptr, size, nmeb, stream);
    }
    
    
    
    
    int main(int argc, char *argv[]) {
    
    
    
    
        //checks there is the required amount of arguments
        if (argc == 3) {
    
    
    
    
            char *getcwd(char *buf, size_t size);
            char cwd[1024];
            int confirm;
    
    
    
    
            printf("Saving website \"%s\".\n", argv[1]);
            printf("To file %s\n\n", argv[2]);
    
    
    
    
            //request save file confirmation from user
            printf("Are these details correct? (1 = Yes, 0 = No)\n\n");
            scanf("%d", &confirm);
    
    
    
    
            if (confirm == 1) {
    
    
    
    
                //tells the user where the file has been saved
                if (getcwd(cwd, sizeof (cwd)) != NULL)
    
    
    
    
                    fprintf(stdout, "Document saved in: \"%s\"\n\n", cwd);
    
    
    
    
                //opens file for writing (doesn't need to exist)
                FILE * file = (FILE *) fopen(argv[2], "w+");
                    if (!file) {
                    perror("File Open:");
                    exit(0);
                }
                CURL *handle = curl_easy_init();
                //collecting the html from command line specified argument
                curl_easy_setopt(handle, CURLOPT_URL, argv[1]);
                curl_easy_setopt(handle, CURLOPT_WRITEFUNCTION, write_data);
                curl_easy_setopt(handle, CURLOPT_WRITEDATA, file);
                curl_easy_perform(handle);
                curl_easy_cleanup(handle);
            }//user chooses not to save
            else if (confirm == 0) {
                printf("File not saved\n");
                return 0;
            }//invalid input by user
            else {
                printf("Incorrect input\n");
                return 0;
            }
    
    
    
    
        } else {
            //showing correct usage of command line argument
            printf("Correct usage:\n\n \"./gethtml http://www.example.com filename.txt\"\n\n");
            return (0);
        }
    
    
    
    
    
    
    }



    That doesn't include an attempt at stripping the HTML as I've been trying all day and I'm clueless right nowAny help much appreciated

  2. #2
    Registered User
    Join Date
    Jan 2009
    Posts
    1,485
    You will need to strip out the tags before you write it to file. Looking at your code, you have a function called write_data that you give as argument to curl_easy_setopt(). If you write one character at a time inside this function you should be able to add your evaluation code there (testing if you are in or out of a tag). You only actually fprintf() or fputc() the character if in_tag is false.

    If you want this feature to be optional you could write two versions of this function, one that strips html tags and one that doesn't.

    Edit: Basically something like this.

    Code:
        char *c = ptr;
        int in_tag = 0;
        size_t i;
    
        for(i = 0; i < size; i++) {
            if(c[i] == '<') {
                in_tag = 1;
            }
            else if(in_tag == 0) {
                if(fputc(c[i], stream) == EOF) {
                    return i; 
                }
            }
            else if(c[i] == '>') {
                in_tag = 0;
            }
        }
    
        return i; /* this may not be correct */
    You need to look up what curl expects from the return value from your function, since you will technically not be printing all characters. Is it ok to report back how many characters you have handled from the buffer, probably?
    Last edited by Subsonics; 03-05-2012 at 09:34 AM.

  3. #3
    Registered User
    Join Date
    Mar 2012
    Posts
    5
    Quote Originally Posted by Subsonics View Post
    You will need to strip out the tags before you write it to file. Looking at your code, you have a function called write_data that you give as argument to curl_easy_setopt(). If you write one character at a time inside this function you should be able to add your evaluation code there (testing if you are in or out of a tag). You only actually fprintf() or fputc() the character if in_tag is false.

    If you want this feature to be optional you could write two versions of this function, one that strips html tags and one that doesn't.

    Edit: Basically something like this.

    You need to look up what curl expects from the return value from your function, since you will technically not be printing all characters. Is it ok to report back how many characters you have handled from the buffer, probably?
    Thanks for your reply mate, appreciated! I'm going to have a good read of the CURL documentation then I'll post back in a bit.

  4. #4
    Registered User
    Join Date
    Jan 2009
    Posts
    1,485
    Most likely it's ok to return the amount you have considered for printing, as fwrite returns the amount written. The only thing curl could do with that (as I can think of) is to determine if something went wrong when writing to disk, which will be correctly reported any way since fputc (above) returns 'i' if EOF shows up prematurely. From the outside the function should behave the same way.
    Last edited by Subsonics; 03-05-2012 at 12:05 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 3
    Last Post: 10-05-2011, 12:25 AM
  2. save html file
    By *nick in forum C Programming
    Replies: 8
    Last Post: 08-27-2010, 06:15 PM
  3. Strip at bottom of HTML table
    By sean in forum A Brief History of Cprogramming.com
    Replies: 9
    Last Post: 05-17-2003, 07:27 AM
  4. how to add save file func into this code!!
    By sam3291 in forum C++ Programming
    Replies: 5
    Last Post: 04-03-2002, 11:19 PM
  5. How to save a source code file in DOS.
    By csick in forum C Programming
    Replies: 5
    Last Post: 01-24-2002, 10:10 AM