View Full Version : Character Sets

12-15-2008, 03:12 PM
What part of a computer/application handles character sets?

12-15-2008, 04:13 PM
uh, charmap?

12-15-2008, 05:13 PM
No, I mean like where the character encodings are held in your computer. Like when you call printf with something like "(char)101", how does the terminal know to print the letter 'A'? Where is it decided that the number 108 will be used to store the letter 'A'?

12-15-2008, 06:12 PM
In a character map. In the really ancient days (and still duiring startup of the BIOS) this was a ROM-image that contains a bitmap of the 256 possible characters, and hardware would copy the bits onto the screen in rows of pixels from each character.

In a modern system, we have a font rasterizer that either draws a bitmap from a template bitmap, or draws it using a vectorized form (so basicly, a Z would be drawn as 0,0 -> 1, 0; 1, 0 -> 0, 1; 0, 1 -> 1,1 - if we assume a capital letter is in a basic bounding box of 1.0, 1.0 sides (some letters DO stick out of the basic bounding box, such as 'g' or '').

It gets much more interesting when to deal with script languages, such as Arabic, where some of the letters will connect across many other letters [or at least, so I understand, from talking to some of my colleagues that deal with font matters where I work].


12-15-2008, 07:38 PM
So the characters come from fonts(same as charmaps?) which are pictures of letters based off of standards such as ASCII and Unicode?

12-16-2008, 03:28 AM
Yes, ultimately the fonts are maps from integer value to the glyph actually drawn.

In reality, it's a lot more complicated. The integers may be remapped before being used in drawing, for example (the WinAPI will remap every ANSI string to UTF-16). Several codepoints (integers) may be combined, and looked up from the font as a single entity. This happens for ligatures (the sequence "fi", for example, is often drawn combined), for some scripts like Arabic, or sometimes for combined characters: Unicode allows representing letters like Ä in two forms: either as a single Ä codepoint, or as an A followed by a combining diacrisis (or whatever it's called). They may be looked up separately in the font, or they could be combined and looked up then.