cyberme0w's Cat logo đŸŸMEOWđŸŸ

Published: 2026-03-19

HTML Email

Let me preface this by saying that I am going to be slightly pedantic (or, as the German say, a Korinthenkacker) for the next x < 10 minutes. If you’re not in the mood to read a bit of a rant, just skip this one.


Much like any other day, after having some breakfast, I checked my email. I used to do this on my phone, but given Android is unlikely to remain an option, I might as well get used to not using a (smart) phone, so I opened up Neomutt and told it to go fetch.

Other than the usual stuff, which promptly got replied to/archived/deleted, I noticed this email:

Concert Hall
World Water Day Celebration and Concert Highlights! (120K)

I’m not much of a newsletter person, and quite honestly think they are a strictly inferior version of RSS/Atom feeds, but I do subscribe to them on occasion whenever nothing else better is available.

“Ok, Mr. Nothing Burger. You got an email. What’s the issue?” I hear you say. Well, that last number was bothering me a bit. I had a few minutes to spare (also, my cat jumped into my lap, making getting up a criminal offense), so I decided to take a closer look. Mainly, I saw 120K and something in my brain went “yeah no that’s bullshit”.

what in the whitespace

I opened up the email’s html part in Neovim and looked for the <body> tag.

871 <body id="a" ...

Well that’s concerning. The entire file only has 1052 lines, and the body starts only at line 871. But mind you, not all of those 871 lines are nonsensical <style>-spam. Some of them are just empty newlines. To give you an idea of how much that is, here are those same 118 lines, numbered:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

Yeah.

There were a few other spots like this, but none as big:

It’s just whitespace, but for reference, getting rid of it would have reduced the total lines from 1052 to 815.

There are also 161 lines of [data-ogsc] styling which - as best I could gather in the two minutes I looked at this - is related to dark color schemes (do feel free to correct me via plaintext email, though). You know what’s better than having to CSS your way out of a light color scheme? Using plain text and allowing the reader to use their own color scheme. Removing those dropped the line count from 815 to 636.

That’s close to half the file (by line count) gone.

As a sidenote, using tidy to clean up the original HTML file made it grow slightly:

/tmp ❱ tidy -w html 2>/dev/null | wc -l
1426

But all that is meaningless anyway, since what we really care about (that is, the content of the newsletter) is somewhere in those 181 <body> lines. So let’s tidy those up instead:

/tmp ❱ sed -n '/<body/,/<\/body>$/p' html > body
/tmp ❱ tidy body > tidy-body

By default, tidy wraps lines at 80 characters, which increased the line count to *sigh* 1134. Disable line wrapping helps a bit:

/tmp ❱ tidy -w body > tidy-body
/tmp ❱ wc -l tidy-body
509 tidy-body

Neat.

to style or not to style

I had the poor idea of checking how many times they use inline styles


/tmp ❱ grep 'style=' tidy-body | wc -l
209


 and now I’m questioning my sanity. What even is the point of having ~1000 lines of <style> spam if you’re just going to put style="..." stuff in the <body> anyway?

At this point, my cat has decided to move on with her day, and so will I, so here’s a speed round.

  1. Open up Neovim.

  2. Remove all HTML tags:

    :%s/<.\{-}>//g.

  3. Get rid of blocks of newlines:

    :%s/\n\n*/\r\r/g.

  4. Move the table values around to work in plaintext.

  5. Get rid of further non-sense (like removing all the &nbsp; and other non-visible characters I found, as well as contact info, address, etc - the website and newsletter unsubscription link remained).

The result is a 48 (wrapped) line long plaintext email, still containing the entirety of the information, in a fraction of the space:

/tmp ❱ size html plaintext
html: 110473 bytes
plaintext: 2200 bytes

That’s 1.9% of the initial size. For reference, this entire rant (4.6 KB) would fit inside that original email 24 times.

Ugh.