Data model for text editor

**Dino** · 09-03-2009

Originally Posted by MTK

It just seems that text is meant to be organized in separate rows because that is the way it is meant to be read, and the \n character is just to make it easy for computers to store.

Amen brother.

**Nyda** · 09-03-2009

Originally Posted by MTK

It moves seperately along the y and x axes, so organizing the text as such in memory would make it easier to compute cursor movement (just jump to the next line in the list), while if it was stored as a big array, you would have to iterate backwards and count characters until the next \n, and then iterate forward until the next \n, and then iterate until you get to the right column, checking whether there is an \n in the way.

If you're referring to my post, you can use the pointers to access the rows directly and you can determine the current length of each row by counting the chars between the pointer target and the next \n (not \0) to verify cursor position validity. The memory address to the right of the cursor would be row_ptr[row]+(char *)column.
There is no deeper connection between the cursor and the data. You only need to know where the new letters will end up in memory, and what's behind (to the right of) the cursor. Both easy enought to access. Using a secondary buffer for changes ensures that you don't have to realloc one large buffer all the time and that you don't clutter your heap with 178.898... mini-mallocs if the file is a little larger than usual.

**MK27** · 09-04-2009

Originally Posted by MTK

I was thinking about the way the cursor moves. It moves seperately along the y and x axes, so organizing the text as such in memory would make it easier to compute cursor movement (just jump to the next line in the list), while if it was stored as a big array, you would have to iterate backwards and count characters until the next \n, and then iterate forward until the next \n, and then iterate until you get to the right column, checking whether there is an \n in the way. (/N)

It just seems that text is meant to be organized in separate rows because that is the way it is meant to be read, and the \n character is just to make it easy for computers to store.(/N)

This is WRONG. Particularly the last sentence. The newline HAS NOTHING TO DO with making it easy for the computer to store rows of text. The newline is a formatting convention. The x and y position of the cursor, the appearance of the text on the screen*, etc., HAVE NOTHING TO DO WITH THE NEWLINE.(/N)

As you can see, the text here is organized in rows and columns, but I will tell you right now that the rows are NOT created with newlines!!! The text is wrapped by the display; there is no software (eg, phpBulletin) that inserted newlines or something to make the rows appear. (/N)

With blocks of text, a newline indicates a break that could occur ANYWHERE in a physically displayed row, NOT at the end. (/N)

If you have a 1500 byte string, you could represent the rows as they will appear on the screen by determining the screen width and keeping an array of pointers at that interval. YOU DO NOT INSERT NEWLINES INTO THE TEXT. NO EDITOR** DOES THIS. ONLY THE USER INSERTS A NEWLINE. You do not have to manually wrap text. The wrap will occur automatically, determined by the width of the display area.(/N)

* beyond the user determined intermittent line breaks WHICH HAVE NO REGULAR INTERVAL.
** except for an email client.

**MTK** · 09-04-2009

I know that but in an editor the user sees the text in an x-y grid and moves around an edits that way, so storing the text that way in the editor makes editing more efficient, not having to tediously translate between 1 and 2 dimensions.

**MK27** · 09-04-2009

Originally Posted by MTK

I know that but in an editor the user sees the text in an x-y grid and moves around an edits that way, so storing the text that way in the editor makes editing more efficient, not having to tediously translate between 1 and 2 dimensions.

All you have to do is take y*width + x and that will give you the character position in the string. So if you are on the 3rd line, 10th column of an 80 character display, your y is 2 (starting at 0) and your x=9, so you are at 2*80+9 = string[169]. Wow was that tedious. Of course, this does not take into account the newline, which is where the pointer array would come in. You position the pointers every 80 bytes OR after the next newline, and map the y coordinate to them. So a better algorithm would be to use ptr[2] + x, presuming you have a simple function that can assign such a pointer array to a string.

What would really be tedious is if you spared yourself such an immense complication by using a 2D array, so you could say look, it's simple, now I am at array[2][9]! True, but if you have to keep shifting the text in that array, you will be performing significantly more operations for every insertion and deletion, because using a string all you have to do is split it at the appropriate point, add or remove the appropriate text, and put it back together. Now picture doing that to the first paragraph of a 20 page document that has been split into 800 separate strings. Of course, you only have to shift the data up to the next incomplete row (line with a newline, like an end of paragraph) -- so if that was say 10 lines (meaning shifting the data in all ten following rows), then after that all you have to do is renumber the 785 remaining array elements*. Hmm.

In short, I promise that using a 2D array in that manner will mean:

Having to write MORE code, not less
The final product will be wasteful of resources, ie, slow and inefficient

Nyda is talking about the same idea as I am, except that s/he has a slightly different approach to managing the memory. I don't like the secondary buffer idea, allocating 1K at a time is plenty for general use. Make it 4k even -- big deal.

* as Dino pointed out, a linked list will spare you that last part, but not the fun with the ten shifted lines. I still cannot comprehend how you think either of those options could possibly be easier, more optimal, or less tedious than just using a text string.

**MTK** · 09-04-2009

Originally Posted by MK27

All you have to do is take y*width + x and that will give you the character position in the string.

How are you supposed to do that if each line is a different length?

**brewbuck** · 09-04-2009

Here's what I would do.

1. Each line is in its own buffer.
2. The document is represented as an array of buffer pointers. This array is dynamic and can grow as more lines are added. I would not bother ever shrinking the array when a line is deleted.
3. The buffers are drawn from buffer pools of fixed sizes. There would be a pool for length 16, 32, 64, 128, 256, etc in progression.

Line buffers are NOT resized at every insertion/deletion. Only when the fill of the buffer falls to 1/2 would the buffer be swapped for the next smaller size, or when the buffer fills it would be swapped for the next larger size. When the buffer pool is depleted more buffers are allocated for the pool.

**MK27** · 09-04-2009

Originally Posted by MTK

How are you supposed to do that if each line is a different length?

Did you read the entire paragraph?!??

Originally Posted by MK27

Of course, this does not take into account the newline, which is where the pointer array would come in. You position the pointers every 80 bytes OR after the next newline, and map the y coordinate to them. So a better algorithm would be to use ptr[2] + x, presuming you have a simple function that can assign such a pointer array to a string.

The width is a fixed value returned by getmaxyx() (or whatever). Again, Nyda said pretty much the same thing.

Brewbucks idea is not a bad one, since I believe by "line" he means a string ending in '\n', not a physical line on the screen. But I still don't think it will give you any advantage unless you are dealing with ridiculously large files, which text never is, and it will still require more coding (for nothing).

**MTK** · 09-04-2009

That sure seems like a huge waste of memory, considering the huge variation in line length in the average source code file.

In the attachment I have a diagram of my idea:

**brewbuck** · 09-04-2009

Something else I haven't seen mentioned is SGI's "rope" data type: rope<T, Alloc>

It is specifically geared for string manipulation tasks and has some interesting efficiency properties.

**MK27** · 09-04-2009

Originally Posted by MTK

That sure seems like a huge waste of memory, considering the huge variation in line length in the average source code file.

In the attachment I have a diagram of my idea:

Right, now think about it this way

Code:

char string[4096] = "hello\nworld";
char **ptrs=malloc(3*sizeof(char*));
ptr[0]=string[0];
ptr[1]=string[5];

As mentioned, you need a function to go thru the string and assign the pointers (and possibly expand the array)*. So now you can modify the string (easy) and reassign the pointers, also easy.

BTW: The approach that uses the least memory is, by definition, the single string.

* so maybe go thru the string, record the positions where the pointers should go, and then you will know how many (more) you need before the actual assignment. Of course, "string" should be realloc'able as well.

**MTK** · 09-04-2009

Originally Posted by MK27

So now you can modify the string (easy) and reassign the pointers, also easy.

But if you have a very, very long file and insert a character near the beginning, you have to shift not just the pointers but the whole text, right?

**MK27** · 09-04-2009

Originally Posted by MTK

But if you have a very, very long file and insert a character near the beginning, you have to shift not just the pointers but the whole text, right?

Always do it the same way: split the string at the insertion point*. Concat the insertion onto the first string. Concat the three strings together and assign to the original pointer.

So with a really big file, this could become a performance hit vs. using an array. However, I think a file that big will be an exception, not the rule. "The rule" will be that all the extra operations necessary with a list or array will be a performance hit vs. the string. In other words, the string is optimal in most cases.

You could easily test this by timing strcat (or just strcpy), on a couple of say 500kb strings (and think: when was the last time you saw a 500kb text file? That's like, the bible...). But I don't think it will be anything noticeable.

Code:

char one[500000], two[500000];
strcpy(one,two);
for (i=0;i<100;i++) strcpy(one,two);

Now, unless gcc optimizes to recognize that 99 of those might as well be no-ops (which I doubt) I would say it took 0.0 seconds on my computer to strcpy the bible a hundred times.

* by which I mean, if the insertion point is [666], copy 666-> end into another string, insert at 666, add '\0', concat the copy onto the end (allocating another kb if necessary). The array version is WAY more work to no advantage.

**brewbuck** · 09-04-2009

Unless you are grossly wasting memory I wouldn't worry about it. Even very large text files are usually only a few megabytes. If your memory wastage is a full 100% you're still only talking about a few megabytes.

Thread: Data model for text editor

Thread Tools

Search Thread

Display

Similar Threads

question about a working linked list

[question]Analyzing data in a two-dimensional array[with code]

Need help on Data Model

All u wanted to know about data types&more

C Programming Question