ASCII, Unicode, and UTF-8 — a Practical Guide
Text looks simple until you ship it. Then “é” becomes “é”, emoji break your logs, and databases refuse to sort correctly. This article gives you a solid mental model of how text becomes bytes, why ASCII still matters, how Unicode fixes the global text problem, and why UTF-8 is the default encoding you should reach for.
1) Characters, code points, bytes
- Character: the abstract “letter/symbol” humans see (e.g.,
A,é,🙂). - Code point: a number assigned to a character. In Unicode,
Ais U+0041,éis U+00E9,🙂is U+1F642. - Encoding: a method that turns code points into bytes (binary) and back.
Computers store and transmit bytes, not characters. Encodings are the agreement for mapping between the two.
2) ASCII: the OG mapping (7-bit)
ASCII defines 128 code points (0–127). It fits in 7 bits, commonly stored as a full byte (the top bit is 0). That’s what your chart shows: the 8-bit binary for each ASCII character.
Examples:
A→ decimal 65 → hex0x41→ binary01000001a→ decimal 97 → hex0x61→ binary01100001!→ decimal 33 → binary00100001
ASCII covers:
- control characters (0–31, 127):
NUL,LF,CR, etc. - printable characters (32–126): space, digits, punctuation, letters.
ASCII is a subset of most modern encodings (including UTF-8).
3) Unicode: one space for all writing systems
ASCII wasn’t enough for world languages. Unicode assigns a unique code point to virtually every character in use (over 149k characters and counting). Code points are written like U+XXXX (hex).
But a giant code space needs a way to serialize to bytes → encodings for Unicode:
- UTF-8: variable length (1–4 bytes), ASCII-compatible. Dominant on the web.
- UTF-16: 2 or 4 bytes (uses surrogate pairs). Common in some OS APIs.
- UTF-32: fixed 4 bytes per character. Simple, but space-heavy.
4) How UTF-8 works (the short, useful version)
UTF-8 uses prefix bits to indicate how many bytes a character uses:
| Bytes | Pattern (binary) | Payload bits |
|---|---|---|
| 1 | 0xxxxxxx | 7 |
| 2 | 110xxxxx 10xxxxxx | 11 |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx | 16 |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 |
- All ASCII (U+0000–U+007F) encodes as one byte: identical to ASCII bytes.
- Non-ASCII uses 2–4 bytes.
Examples:
H(U+0048):01001000(1 byte)é(U+00E9): binary payload11101001→ UTF-8:11000011 10101001(hexC3 A9)🙂(U+1F642): UTF-8:F0 9F 99 82(binary11110000 10011111 10011001 10000010)
5) Common pitfalls (and how to avoid them)
Mojibake (“é” instead of “é”)
Happens when bytes encoded in one encoding (often UTF-8) are decoded as another (often ISO-8859-1/Windows-1252).
Fix: ensure every boundary (file save, DB column, HTTP header, terminal) agrees on UTF-8.
BOM confusion
A BOM (Byte Order Mark) is optional in UTF-8 (EF BB BF). Some tools add it; others choke on it.
Tip: prefer UTF-8 without BOM unless a consumer explicitly requires it.
Surrogates in UTF-16
UTF-16 uses surrogate pairs for code points beyond U+FFFF. If you slice strings by code units, you can split pairs and corrupt text.
Tip: slice by code points (or grapheme clusters), not bytes or 16-bit units.
“One char == one byte” assumption
This only holds for ASCII. In UTF-8, a “character” can be 1–4 bytes; a user-perceived character (grapheme) can be multiple code points (e.g., 🇺🇳 is two code points).
6) Best practices
- Default to UTF-8 everywhere. Files, HTTP, databases, message queues.
- Declare it explicitly.
- HTTP:
Content-Type: text/html; charset=UTF-8 - HTML:
<meta charset="utf-8"> - Databases: choose UTF-8 encodings/collations; ensure client/driver settings match.
- HTTP:
- Validate inputs; reject illegal UTF-8 sequences.
- Normalize when comparing or hashing Unicode (
NFCrecommended unless you have a reason otherwise). - Be careful with length:
- bytes (storage, transport limits),
- code points (Unicode semantics),
- grapheme clusters (what users think of as “characters”).
7) Practical conversions
Python
# text -> bytes (UTF-8)
b = "Café 🙂".encode("utf-8") # b'Cafe\xcc\x81 \xf0\x9f\x99\x82'
# bytes -> text
s = b.decode("utf-8")
# binary for each byte
bits = " ".join(f"{byte:08b}" for byte in b)
# safe length concepts
import regex # pip install regex for grapheme support
codepoints = len(s) # number of Python code points
graphemes = len(regex.findall(r"\X", s)) # user-perceived characters
JavaScript (UTF-16 code units under the hood)
const enc = new TextEncoder(); // UTF-8
const dec = new TextDecoder("utf-8");
const bytes = enc.encode("Café 🙂"); // Uint8Array
const text = dec.decode(bytes);
// Beware: JS .length counts UTF-16 code units, not graphemes.
const codePoints = [..."Café 🙂"].length; // safer for code points
Go
s := "Café 🙂"
b := []byte(s) // UTF-8 bytes
runes := []rune(s) // Unicode code points
for _, r := range s { // range iterates runes
_ = r
}
Java
byte[] bytes = "Café 🙂".getBytes(StandardCharsets.UTF_8);
String s = new String(bytes, StandardCharsets.UTF_8);
int codePoints = s.codePointCount(0, s.length());
8) ASCII quick reference (selected)
- Digits:
0–9→ 48–57 →00110000–00111001 - Uppercase:
A–Z→ 65–90 →01000001–01011010 - Lowercase:
a–z→ 97–122 →01100001–01111010 - Space: 32 →
00100000 - Newline (LF): 10 →
00001010 - Carriage return (CR): 13 →
00001101
9) Debugging checklist
- What is the actual byte sequence? (hex dump or
xxd/hexdump/od -An -t x1) - What encoding was used to produce those bytes? (should be UTF-8)
- Where is it declared? (HTTP headers, DB collation, file metadata)
- Where is it decoded? (client/consumer code path)
- Normalize? (NFC vs NFD impacts equality)
- Are you slicing by bytes? (might split UTF-8 sequence)
10) Quick exercises
Convert “Hi!” to binary (UTF-8):
H01001000,i01101001,!00100001→01001000 01101001 00100001Convert
C3 A9(hex) back to a character:11000011 10101001→éEncode U+1F642 (
🙂) to UTF-8 by pattern:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx→F0 9F 99 82
11) TL;DR
- ASCII is the first 128 Unicode code points and maps 1:1 to UTF-8 bytes.
- Unicode is the universal character set; UTF-8 is the compact, compatible, and ubiquitous way to encode it.
- Always declare and enforce UTF-8 at system boundaries.
- Treat “length” carefully (bytes vs code points vs grapheme clusters).
If you want, tell me your language/runtime and I’ll give you a tiny, copy-paste utility for hex/binary dumps, UTF-8 validation, and NFC normalization tailored to your stack.
Comments
Post a Comment