参考：

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

https://unix.stackexchange.com/questions/11602/how-can-i-test-the-encoding-of-a-text-file-is-it-valid-and-what-is-it

$ echo "严" > a.txt

$ file -I a.txt

a.txt: text/plain; charset=utf-8

$ hexdump a.txt

0000000 e4 b8 a5 0a

0000004 # 表示 4 个字节

$ ls -alh a.txt

-rw-r--r-- 1 wanlerong wheel 4B 6 18 15:00 a.txt

file -I 检测文件的编码，但只是读取一些字节并作出"最佳猜测"。

文件的本质是一串二进制，而且不会存储自己是怎么编码来的，软件要读取这个文件只能靠猜。

这个文件大小就是 4 个字节，e4 b8 a5 0a

我们猜他是 utf8 编码的，于是可以解析到内容为：严\r

Unicode 是一个符号集, 一个序号 => 符号的 map。

汉字严的 Unicode 是十六进制序号为 4E25 即序号20005

即二进制的序号 0100 1110 0010 0101

utf8 的规则，x 用来表示序号，其他的信息用来确定，这个字符需要用多少字节表示

首个字节没有 1，表示用单字节就可以了。

首个字节有 n 个 1，表示这个字符需要用 n 个字节表示。

0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

要表示“严”，需要用到 15 个位，所以需要用 3 个字节

把 0100 1110 0010 0101 填到 x 里

得到 11100100 10111000 10100101 就是 utf8 编码后的内容了。

转成 16 进制即 e4 b8 a5