html web translator/interpreter

**Aisthesis** · 08-02-2009

What I'd like to do is write a C++ program that does the following:

1) Opens a file with html code that's the source for a web page.

2) Creates a new .txt file with some kind of name appendage (I'll probably do it so that if such a file already exists, the program just aborts right here after closing the original file).

3) Writes into the text file new text of the sort such that if you write it onto a web page, what you'll see in your browser is the text of the original html file.

This is much less confusing than I'm unfortunately making it sound. Here's an example:

In our old html file (call it orig.htm), we see Hello, World!

The new file (call it orig_new.txt), I want to have Hello, World

What I'm wanting to do is post some solutions to html problems on my website, including the code. So, I want the browser to render what the html coder sees in the html file, and I don't want to create the code for the code by hand.

Where I need help for the moment is in setting the whole thing up. I know the basics about how to open files, read them, and write new ones. And I'll need to review in detail exactly what special symbols will get me by in the translation (just < and > will already get you pretty far), but what all is needed is an issue I'm not asking about at the moment anyway.

What my problem is is that you're not translating "<" into exactly one character but into FOUR CHARACTERS. So, you can't just create a big character array for the whole file and make a simple translation each time you come to this character because the required memory will change.

One solution would be to read the entire original file into a string. But this seems like greater length than is at least INTENDED for normal string variables. I mean, maybe it's perfectly normal and never creates problems to have strings as long as maybe even 50k characters. The html files I'll be dealing with are normally only a few thousand characters long, and I'll likely even be breaking those up in practice, but I'd like for the program to work even on long html files without getting runtime errors or weird results.

So, another solution would be to do something like creating a string variable that holds in memory the first 20 (?) characters from the original file. Then this variable will have the flexibility to expand so as to replace an instance of "<" with an instance of "<" without getting into odd memory situations.

What I'm wondering is how basically to set up the variables that record what I read from the old file before writing the translation to the new file.

In short: What's a good size for a healthy string variable in this context? Or am I worrying unnecessarily about creating potentially huge (compared to what I've used in C++ up to now) string variables? Or, is solving this problem with string variables going to be harder than I think so that I should just try to find a program that does it and not attempt to write the code myself?

If such a program already exists (as it presumably does), I'd still like to code this on my own just as exercise unless you guys tell me that it's going to require more advanced skills than I think it will. I really think this should be doable without getting into advanced programming issues.

**Salem** · 08-02-2009

Just use std::string and stop worrying about it (for now at least).

Your multi-tabbed browser probably holds a lot more HTML (and it's rendered representations) than your program will.

**Sebastiani** · 08-02-2009

Generally speaking, you have two basic options:

1) Read the entire file into memory.
2) Read the file in small chunks and repeat until all chunks are processed.

The first is usually faster, unless of course the file is so large that the memory get's paged out, in which case it might be much slower.

As far as the conversion, I'm not sure if html "special" sequences are all one byte wide, but if they are, just loop through each byte of the input and simply output the translated text if you encounter one, otherwise output the input byte. On the other hand, if they can span multiple bytes, you'll just need to build up a "token", process it, then flush the token, etc.

**Aisthesis** · 08-02-2009

So, a std::string variable into which you read 50k bytes isn't normally a problem? In actuality, my current purpose is just to display code on a webpage, so the files are going to be way smaller, probably normally < 1000 bytes.

Is there any rough guide as to how big C++ "likes" its std::string variables? I'm thinking what I might do is read it in until arriving at '\n' on my string (putting it roughly in the 50-100 range). I've seen some javascript code that runs like 20k characters before getting to a new line, but no way I'd ever want to show that on a webpage anyway.

Actually, it probably wouldn't be a bad idea to end each line with a , too, to manually require that the line breaks occur in the right spot--i.e., convert '\n' to " \n"

I'm not sure about the problem with special sequences either, Sebastiani. If there are multi-byte sequences requiring special treatment, then I almost have to read it in as one big string. But I'm at the moment thinking that I can probably get by with just <, > and &. I'm likely forgetting something, but those 3 cover all problem cases I can think of. So, I'm hoping I don't have to digest your suggestions regarding tokens just yet ...

If your source file shows something like ì then the interpreter should crank out &#236;, which a browser should then render back as ì (that weird "i" merely illustrates the point: it's supposed to be code for a special character, but gets rendered here that way).

Well, if anyone has any other ideas, they're much appreciated. Otherwise, I'll just try it out and report back upon success or failure.

**Aisthesis** · 08-02-2009

p.s.: I seriously doubt speed will be any issue at all. Once I have an executable file, it will do the conversion on the kind of files I'm talking about essentially immediately, I'm pretty sure.

Quite different if it were ever to be used in some iterative context or multiple times for multiple large files. But just for a one-shot translation of fairly small files, I doubt run time will be noticeable.

**tabstop** · 08-02-2009

There is such a thing as std::string::max_size() which is the largest a std::string can be. On my system, that works out to be 1073741820. I would doubt that 50K would be a problem on any actual modern computer (not counting things like wristwatches).

**Aisthesis** · 08-02-2009

Oh, wow, that's WAY bigger than I thought. It sounds like just reading the whole file into a string is much less problematic than I thought it would be.

I'll also have to see what I can come up with for limiting the length the program even deals with--like, if the file is bigger than maybe 5k (giving PLENTY of headroom for the practical purpose of this program while giving some security against the unexpected), just cout "File too big" and terminate the program.

Thread: html web translator/interpreter

Thread Tools

Search Thread

Display

html web translator/interpreter

Similar Threads

Web designer needed! html, css and other design stuff...

I need to open a web page from c++ and grab the html.

Web and HTML

Downloading HTML Files from Web Page