UTF-8 encodes Unicode code points using one to four bytes. ASCII characters occupy one byte—backward compatible with legacy ASCII tooling.
Declaring UTF-8
- Early
<meta charset="utf-8">inhead. - HTTP
Content-Typeheaders aligned with bytes. - Save files without conflicting BOM expectations unless tooling demands BOM.
Legacy encodings
Windows code pages and ISO-8859-* appear in older systems—transcode to UTF-8 when migrating.
Detection pitfalls
Mojibake occurs when bytes interpreted under wrong mapping—fix declarations rather than chasing symptoms.
Email + feeds
HTML mail often strips or rewrites encodings—test with real providers; RSS readers may be stricter than browsers.
Example — correct declaration early in head
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Encoding demo</title>
</head>
What mojibake looks like
If Café is stored UTF-8 but read as ISO-8859-1 you may see Café—fix bytes + headers rather than patching individual strings forever.
Important interview questions and answers
- Q: What is the safest default character encoding for modern HTML?
A: UTF-8, declared early with `` and matched by server `Content-Type` headers. - Q: When are HTML entities still useful in UTF-8 pages?
A: For reserved characters (`&`, `<`) and contexts where explicit escaping avoids parser ambiguity. - Q: What is the key difference between HTML5 parsing and XHTML parsing?
A: HTML5 recovers from many errors; XHTML (XML) treats many parse errors as fatal.
Pitfall: Charset meta must appear early in <head>.