Pages

Thursday, September 15, 2016

Code points are an abstraction

In The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), Joel Spolsky discusses Unicode code points and gives a few examples, such as U+0639 representing the Arabic letter Ain and U+0041 representing the English letter A.

Splosky doesn't come right out and use the word abstraction, but the concept of an abstraction is exactly what he's talking about when he writes:
OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F.
Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message.
The Wikipedia article Abstraction (software engineering) contains the following quote attributed to John V. Guttag: “The essence of abstractions is preserving information that is relevant in a given context and forgetting information that is irrelevant in that context.”

In the context of Unicode code points, the information that is relevant is some hexadecimal number, like 0048, and the character that number represents (H). Information we might want to forget (at least temporarily), which may be irrelevant in the context of a general discussion about Unicode, is the number of bytes and the specific bits used to represent hexadecimal numbers like 0048.