ASCII, Unicode, and UTF-8 — a Practical Guide

Text looks simple until you ship it. Then “é” becomes “Ã©”, emoji break your logs, and databases refuse to sort correctly. This article gives you a solid mental model of how text becomes bytes, why ASCII still matters, how Unicode fixes the global text problem, and why UTF-8 is the default encoding you should reach for.

1) Characters, code points, bytes

Character: the abstract “letter/symbol” humans see (e.g., A, é, 🙂).
Code point: a number assigned to a character. In Unicode, A is U+0041, é is U+00E9, 🙂 is U+1F642.
Encoding: a method that turns code points into bytes (binary) and back.

Computers store and transmit bytes, not characters. Encodings are the agreement for mapping between the two.

2) ASCII: the OG mapping (7-bit)

ASCII defines 128 code points (0–127). It fits in 7 bits, commonly stored as a full byte (the top bit is 0). That’s what your chart shows: the 8-bit binary for each ASCII character.

Examples:

A → decimal 65 → hex 0x41 → binary 01000001
a → decimal 97 → hex 0x61 → binary 01100001
! → decimal 33 → binary 00100001

ASCII covers:

control characters (0–31, 127): NUL, LF, CR, etc.
printable characters (32–126): space, digits, punctuation, letters.

ASCII is a subset of most modern encodings (including UTF-8).

3) Unicode: one space for all writing systems

ASCII wasn’t enough for world languages. Unicode assigns a unique code point to virtually every character in use (over 149k characters and counting). Code points are written like U+XXXX (hex).

But a giant code space needs a way to serialize to bytes → encodings for Unicode:

UTF-8: variable length (1–4 bytes), ASCII-compatible. Dominant on the web.
UTF-16: 2 or 4 bytes (uses surrogate pairs). Common in some OS APIs.
UTF-32: fixed 4 bytes per character. Simple, but space-heavy.

4) How UTF-8 works (the short, useful version)

UTF-8 uses prefix bits to indicate how many bytes a character uses:

Bytes	Pattern (binary)	Payload bits
1	`0xxxxxxx`	7
2	`110xxxxx 10xxxxxx`	11
3	`1110xxxx 10xxxxxx 10xxxxxx`	16
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	21

All ASCII (U+0000–U+007F) encodes as one byte: identical to ASCII bytes.
Non-ASCII uses 2–4 bytes.

Examples:

H (U+0048): 01001000 (1 byte)
é (U+00E9): binary payload 11101001 → UTF-8: 11000011 10101001 (hex C3 A9)
🙂 (U+1F642): UTF-8: F0 9F 99 82 (binary 11110000 10011111 10011001 10000010)

5) Common pitfalls (and how to avoid them)

Mojibake (“Ã©” instead of “é”)

Happens when bytes encoded in one encoding (often UTF-8) are decoded as another (often ISO-8859-1/Windows-1252).
Fix: ensure every boundary (file save, DB column, HTTP header, terminal) agrees on UTF-8.

BOM confusion

A BOM (Byte Order Mark) is optional in UTF-8 (EF BB BF). Some tools add it; others choke on it.
Tip: prefer UTF-8 without BOM unless a consumer explicitly requires it.

Surrogates in UTF-16

UTF-16 uses surrogate pairs for code points beyond U+FFFF. If you slice strings by code units, you can split pairs and corrupt text.
Tip: slice by code points (or grapheme clusters), not bytes or 16-bit units.

“One char == one byte” assumption

This only holds for ASCII. In UTF-8, a “character” can be 1–4 bytes; a user-perceived character (grapheme) can be multiple code points (e.g., 🇺🇳 is two code points).

6) Best practices

Default to UTF-8 everywhere. Files, HTTP, databases, message queues.
Declare it explicitly.
- HTTP: Content-Type: text/html; charset=UTF-8
- HTML: <meta charset="utf-8">
- Databases: choose UTF-8 encodings/collations; ensure client/driver settings match.
Validate inputs; reject illegal UTF-8 sequences.
Normalize when comparing or hashing Unicode (NFC recommended unless you have a reason otherwise).
Be careful with length:
- bytes (storage, transport limits),
- code points (Unicode semantics),
- grapheme clusters (what users think of as “characters”).

7) Practical conversions

Python

# text -> bytes (UTF-8)
b = "Café 🙂".encode("utf-8")     # b'Cafe\xcc\x81 \xf0\x9f\x99\x82'
# bytes -> text
s = b.decode("utf-8")

# binary for each byte
bits = " ".join(f"{byte:08b}" for byte in b)

# safe length concepts
import regex  # pip install regex for grapheme support
codepoints = len(s)              # number of Python code points
graphemes = len(regex.findall(r"\X", s))  # user-perceived characters

JavaScript (UTF-16 code units under the hood)

const enc = new TextEncoder();   // UTF-8
const dec = new TextDecoder("utf-8");

const bytes = enc.encode("Café 🙂");      // Uint8Array
const text  = dec.decode(bytes);

// Beware: JS .length counts UTF-16 code units, not graphemes.
const codePoints = [..."Café 🙂"].length; // safer for code points

Go

s := "Café 🙂"
b := []byte(s)            // UTF-8 bytes
runes := []rune(s)        // Unicode code points
for _, r := range s {     // range iterates runes
    _ = r
}

Java

byte[] bytes = "Café 🙂".getBytes(StandardCharsets.UTF_8);
String s = new String(bytes, StandardCharsets.UTF_8);
int codePoints = s.codePointCount(0, s.length());

8) ASCII quick reference (selected)

Digits: 0–9 → 48–57 → 00110000–00111001
Uppercase: A–Z → 65–90 → 01000001–01011010
Lowercase: a–z → 97–122 → 01100001–01111010
Space: 32 → 00100000
Newline (LF): 10 → 00001010
Carriage return (CR): 13 → 00001101

9) Debugging checklist

What is the actual byte sequence? (hex dump or xxd/hexdump/od -An -t x1)
What encoding was used to produce those bytes? (should be UTF-8)
Where is it declared? (HTTP headers, DB collation, file metadata)
Where is it decoded? (client/consumer code path)
Normalize? (NFC vs NFD impacts equality)
Are you slicing by bytes? (might split UTF-8 sequence)

10) Quick exercises

Convert “Hi!” to binary (UTF-8):
H 01001000, i 01101001, ! 00100001 →
01001000 01101001 00100001
Convert C3 A9 (hex) back to a character:
11000011 10101001 → é
Encode U+1F642 (🙂) to UTF-8 by pattern:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → F0 9F 99 82

11) TL;DR

ASCII is the first 128 Unicode code points and maps 1:1 to UTF-8 bytes.
Unicode is the universal character set; UTF-8 is the compact, compatible, and ubiquitous way to encode it.
Always declare and enforce UTF-8 at system boundaries.
Treat “length” carefully (bytes vs code points vs grapheme clusters).

If you want, tell me your language/runtime and I’ll give you a tiny, copy-paste utility for hex/binary dumps, UTF-8 validation, and NFC normalization tailored to your stack.

Search This Blog

Web Development Notes