The Story of 'Н': More Than Meets the Eye
Lecture 6

Binary and Bytes: 'Н' in the Machine

The Story of 'Н': More Than Meets the Eye

Transcript

SPEAKER_1: Alright, so last time we touched on the geometric consistency of н. Now, let's shift focus to its digital representation. Now I want to go somewhere completely different: what happens to н when it enters a computer? SPEAKER_2: Right, and this is where the visual trap we opened with in lecture one becomes a technical problem, not just a perceptual one. To a human eye, н and H appear almost identical, but in digital encoding, they are distinct. To a computer, they are completely different entities with different addresses in memory. SPEAKER_1: Different addresses—so the machine processes them as distinct entities? SPEAKER_2: Not shapes, no. Computers operate on the base-2 binary system—just zeros and ones. Everything, every character, every image, every instruction, gets reduced to sequences of those two digits. The smallest unit is a bit, a term coined by John Tukey, which stands for 'binary digit.' Eight bits make a byte, and a byte can represent 256 different values—two to the power of eight. SPEAKER_1: So where does н fit into that system? SPEAKER_2: Every character gets assigned a unique numerical code in Unicode, which is the global standard for text encoding. The Cyrillic capital н sits at code point U+041D. The Latin capital H sits at U+0048. Those are different numbers, which means different binary sequences, which means the machine treats them as entirely unrelated symbols—even though a human might glance at both and see the same shape. SPEAKER_1: How different are those binary sequences actually? Is it a small difference or a large one? SPEAKER_2: Meaningfully different. In UTF-8 encoding, the Cyrillic н requires two bytes—sixteen bits—to store. The Latin H fits in a single byte, eight bits. So н in UTF-8 is the binary sequence 1101 1110 1001 1011. H is just 0100 1000. Different length, different pattern, different everything at the hardware level. SPEAKER_1: So our listener—someone like Николай who grew up reading Cyrillic—might never think about this, but every time they type н, the machine is doing something more complex than when someone types H. SPEAKER_2: Exactly. And that complexity is intentional. Unicode was designed to give every character in every script its own unambiguous address. Before Unicode, systems often used the same byte values for different characters, leading to encoding conflicts. The separation of н and H into distinct code points prevents that chaos. SPEAKER_1: But here's what I want to push on—if the machine knows they're different, why does it matter that they look the same to humans? Where does that actually cause problems? SPEAKER_2: This leads to homoglyph attacks, a security concern. Because н and H are visually near-identical, a bad actor can register a domain name like 'bаnk.com' where the 'а' is Cyrillic, not Latin. The human eye reads 'bank.com.' The machine routes to a completely different server. It's a phishing technique that exploits the gap between human visual processing and machine-level encoding. SPEAKER_1: That's genuinely unsettling. So the same visual confusion we talked about in lecture one—the false friend problem—becomes an attack vector. SPEAKER_2: Precisely. And н specifically is one of the higher-risk characters because its Latin lookalike, H, appears in so many high-value domain names. The IDN system aimed to support non-Latin scripts online but inadvertently enabled these substitutions. SPEAKER_1: How do browsers and systems defend against that? SPEAKER_2: Several mechanisms. Browsers like Chrome and Firefox will display the raw Unicode encoding—what's called Punycode—instead of the readable script if a domain mixes scripts from different writing systems. Thus, a domain mixing Cyrillic н with Latin letters appears as 'xn--something' instead of the deceptive form. It's not perfect, but it forces the machine's knowledge of the distinction into the visible interface. SPEAKER_1: So the defense is essentially making the machine's internal distinction visible to the human. SPEAKER_2: That's a clean way to put it. The machine always knew н and H were different—U+041D versus U+0048. The challenge is surfacing that knowledge before the human makes a mistake. SPEAKER_1: And search engines—do they handle this the same way? SPEAKER_2: Search engines normalize queries, so a search for a word with Cyrillic н and the same word with Latin H will typically return different results or trigger a disambiguation. Google's indexing treats them as separate strings. That's actually useful for Cyrillic-language search, but it also means a typo that crosses scripts can send someone to a completely empty results page. SPEAKER_1: One thing I want to make sure we cover—the word 'bit' itself. There's a history there that most people don't know. SPEAKER_2: John Tukey coined it in 1946—a statistician, not a computer engineer. He needed a word for the fundamental unit of information and compressed 'binary digit' into 'bit.' It's one of those cases where a single syllable carries enormous conceptual weight. Every character, including н, ultimately reduces to a string of Tukey's bits. SPEAKER_1: So for everyone following this course, what's the frame that makes the whole digital picture click? SPEAKER_2: For our listener, the key insight is this: the visual confusion between н and H that we've been tracking across history, phonetics, and geography doesn't disappear in the digital world—it becomes a security vulnerability. But the machine itself is never confused. Unicode U+041D and U+0048 are entirely different entities, stored differently, routed differently, indexed differently. The problem only exists at the human-machine interface, where eyes see shapes and computers see numbers.