Thread: html web translator/interpreter

  1. #1
    Registered User
    Join Date
    May 2009
    Posts
    242

    html web translator/interpreter

    What I'd like to do is write a C++ program that does the following:

    1) Opens a file with html code that's the source for a web page.

    2) Creates a new .txt file with some kind of name appendage (I'll probably do it so that if such a file already exists, the program just aborts right here after closing the original file).

    3) Writes into the text file new text of the sort such that if you write it onto a web page, what you'll see in your browser is the text of the original html file.

    This is much less confusing than I'm unfortunately making it sound. Here's an example:

    In our old html file (call it orig.htm), we see <p>Hello, World!</p>

    The new file (call it orig_new.txt), I want to have &lt;p&gt;Hello, World&lt;/p&gt;

    What I'm wanting to do is post some solutions to html problems on my website, including the code. So, I want the browser to render what the html coder sees in the html file, and I don't want to create the code for the code by hand.

    Where I need help for the moment is in setting the whole thing up. I know the basics about how to open files, read them, and write new ones. And I'll need to review in detail exactly what special symbols will get me by in the translation (just < and > will already get you pretty far), but what all is needed is an issue I'm not asking about at the moment anyway.

    What my problem is is that you're not translating "<" into exactly one character but into FOUR CHARACTERS. So, you can't just create a big character array for the whole file and make a simple translation each time you come to this character because the required memory will change.

    One solution would be to read the entire original file into a string. But this seems like greater length than is at least INTENDED for normal string variables. I mean, maybe it's perfectly normal and never creates problems to have strings as long as maybe even 50k characters. The html files I'll be dealing with are normally only a few thousand characters long, and I'll likely even be breaking those up in practice, but I'd like for the program to work even on long html files without getting runtime errors or weird results.

    So, another solution would be to do something like creating a string variable that holds in memory the first 20 (?) characters from the original file. Then this variable will have the flexibility to expand so as to replace an instance of "<" with an instance of "&lt;" without getting into odd memory situations.

    What I'm wondering is how basically to set up the variables that record what I read from the old file before writing the translation to the new file.

    In short: What's a good size for a healthy string variable in this context? Or am I worrying unnecessarily about creating potentially huge (compared to what I've used in C++ up to now) string variables? Or, is solving this problem with string variables going to be harder than I think so that I should just try to find a program that does it and not attempt to write the code myself?

    If such a program already exists (as it presumably does), I'd still like to code this on my own just as exercise unless you guys tell me that it's going to require more advanced skills than I think it will. I really think this should be doable without getting into advanced programming issues.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Just use std::string and stop worrying about it (for now at least).

    Your multi-tabbed browser probably holds a lot more HTML (and it's rendered representations) than your program will.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Generally speaking, you have two basic options:

    1) Read the entire file into memory.
    2) Read the file in small chunks and repeat until all chunks are processed.

    The first is usually faster, unless of course the file is so large that the memory get's paged out, in which case it might be much slower.

    As far as the conversion, I'm not sure if html "special" sequences are all one byte wide, but if they are, just loop through each byte of the input and simply output the translated text if you encounter one, otherwise output the input byte. On the other hand, if they can span multiple bytes, you'll just need to build up a "token", process it, then flush the token, etc.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  4. #4
    Registered User
    Join Date
    May 2009
    Posts
    242
    So, a std::string variable into which you read 50k bytes isn't normally a problem? In actuality, my current purpose is just to display code on a webpage, so the files are going to be way smaller, probably normally < 1000 bytes.

    Is there any rough guide as to how big C++ "likes" its std::string variables? I'm thinking what I might do is read it in until arriving at '\n' on my string (putting it roughly in the 50-100 range). I've seen some javascript code that runs like 20k characters before getting to a new line, but no way I'd ever want to show that on a webpage anyway.

    Actually, it probably wouldn't be a bad idea to end each line with a <br />, too, to manually require that the line breaks occur in the right spot--i.e., convert '\n' to "<br />\n"

    I'm not sure about the problem with special sequences either, Sebastiani. If there are multi-byte sequences requiring special treatment, then I almost have to read it in as one big string. But I'm at the moment thinking that I can probably get by with just <, > and &. I'm likely forgetting something, but those 3 cover all problem cases I can think of. So, I'm hoping I don't have to digest your suggestions regarding tokens just yet ...

    If your source file shows something like ì then the interpreter should crank out &amp;#236;, which a browser should then render back as ì (that weird "i" merely illustrates the point: it's supposed to be code for a special character, but gets rendered here that way).

    Well, if anyone has any other ideas, they're much appreciated. Otherwise, I'll just try it out and report back upon success or failure.
    Last edited by Aisthesis; 08-02-2009 at 04:38 AM.

  5. #5
    Registered User
    Join Date
    May 2009
    Posts
    242
    p.s.: I seriously doubt speed will be any issue at all. Once I have an executable file, it will do the conversion on the kind of files I'm talking about essentially immediately, I'm pretty sure.

    Quite different if it were ever to be used in some iterative context or multiple times for multiple large files. But just for a one-shot translation of fairly small files, I doubt run time will be noticeable.

  6. #6
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    There is such a thing as std::string::max_size() which is the largest a std::string can be. On my system, that works out to be 1073741820. I would doubt that 50K would be a problem on any actual modern computer (not counting things like wristwatches).

  7. #7
    Registered User
    Join Date
    May 2009
    Posts
    242
    Oh, wow, that's WAY bigger than I thought. It sounds like just reading the whole file into a string is much less problematic than I thought it would be.

    I'll also have to see what I can come up with for limiting the length the program even deals with--like, if the file is bigger than maybe 5k (giving PLENTY of headroom for the practical purpose of this program while giving some security against the unexpected), just cout "File too big" and terminate the program.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Web designer needed! html, css and other design stuff...
    By Akkernight in forum Projects and Job Recruitment
    Replies: 0
    Last Post: 04-11-2009, 09:33 AM
  2. I need to open a web page from c++ and grab the html.
    By rloveless in forum C++ Programming
    Replies: 1
    Last Post: 09-28-2006, 04:12 PM
  3. Web and HTML
    By Danny_Beaudoin in forum C Programming
    Replies: 2
    Last Post: 08-03-2004, 06:13 PM
  4. Downloading HTML Files from Web Page
    By Unregistered in forum A Brief History of Cprogramming.com
    Replies: 13
    Last Post: 07-18-2002, 05:59 AM