UTF-8 程序员该知的基本常识，否则“坑”你没商量

UTF是Unicode Transformation Format的缩写，意为Unicode转换格式。其中，UTF-8是UTF中最常用的一种转换格式，适用于信息存储，传递，网页等等。程序员应该都知道它，但是能理解并用好它的人不多，为此踩坑的人倒是不少。

一、编码

UTF-8是一种变长的Unicode编码方式，它是由 1 到 6 字节编码Unicode字符。如下图示：

1、字节数判断

如何判断一个字符所占字节数？

参阅上图编码格式，判断首字节即可。例如：首字节为E5，二进制：11100101，由此可知该字符由 3 字节组成，其它同理可得。

static uint8 utf8_get_bytes(const char * str){    if((str[0] & 0x80) == 0)        return 1;    else if((str[0] & 0xE0) == 0xC0)        return 2;    else if((str[0] & 0xF0) == 0xE0)        return 3;    else if((str[0] & 0xF8) == 0xF0)        return 4;    else if((str[0] & 0xFC) == 0xF8)        return 5;    else if((str[0] & 0xFE) == 0xFC)        return 6;    return 1; /*If the char was invalid step tell it's 1 byte long*/}

说明： 目前看一般做到4字节即可， 5、6可忽略。

2、编码转换

由于UTF-8与UTF-16（或UTF-32），都属于Unicode编码，只是存储方式不一样，故彼此编码转换只需简单移位处理即可。

2.1、UTF-8 转 UTF-32

// 返回 UTF-32编码static uint32 to_utf32(const char *txt, uint32 *i){    /* Unicode to UTF-8     * 00000000 00000000 00000000 0xxxxxxx -> 0xxxxxxx     * 00000000 00000000 00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx     * 00000000 00000000 zzzzyyyy yyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx     * 00000000 000wwwzz zzzzyyyy yyxxxxxx -> 11110www 10zzzzzz 10yyyyyy 10xxxxxx     * */    uint32 result = 0;    /*Dummy 'i' pointer is required*/    uint32 i_tmp = 0;    if(i == NULL) i = &i_tmp;    /*Normal ASCII*/    if((txt[*i] & 0x80) == 0) {        result = txt[*i];        (*i)  ;    }    /*Real UTF-8 decode*/    else {        /*2 bytes UTF-8 code*/        if((txt[*i] & 0xE0) == 0xC0) {            result = (uint32)(txt[*i] & 0x1F) << 6;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = (txt[*i] & 0x3F);            (*i)  ;        }        /*3 bytes UTF-8 code*/        else if((txt[*i] & 0xF0) == 0xE0) {            result = (uint32)(txt[*i] & 0x0F) << 12;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = (uint32)(txt[*i] & 0x3F) << 6;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = (txt[*i] & 0x3F);            (*i)  ;        }        /*4 bytes UTF-8 code*/        else if((txt[*i] & 0xF8) == 0xF0) {            result = (uint32)(txt[*i] & 0x07) << 18;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = (uint32)(txt[*i] & 0x3F) << 12;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = (uint32)(txt[*i] & 0x3F) << 6;            (*i)  ;            if((txt[*i] & 0xC0) != 0x80) return 0; /*Invalid UTF-8 code*/            result  = txt[*i] & 0x3F;            (*i)  ;        } else {            (*i)  ; /*Not UTF-8 char. Go the next.*/        }    }    return result;}// 返回转换后utf32字串长度uint32 utf8_to_utf32(const char * u8str, uint32 * u32str){  uint32 i = 0, j = 0;  for(i = 0, j = 0; (u32str[j] = to_utf32(u8str, &i)) != 0; j  );  return j;}

2.2、UTF-32 转 UTF-8

static void to_utf8(char *str, uint32 *index, uint32 letter_uni){  uint32 i = *index;  if(letter_uni < 0x80)  {    str[i] = letter_uni;    *index = i 1;  }  else if(letter_uni < 0x0800) {        str[i] = ((letter_uni >> 6) & 0x1F) | 0xC0;    str[i 1] = ((letter_uni >> 0) & 0x3F) | 0x80;    *index = i 2;          } else if(letter_uni < 0x010000) {    str[i] = ((letter_uni >> 12) & 0x0F) | 0xE0;    str[i 1] = ((letter_uni >> 6) & 0x3F) | 0x80;    str[i 2] = ((letter_uni >> 0) & 0x3F) | 0x80;    *index = i 3;  } else if(letter_uni < 0x110000) {    str[i] = ((letter_uni >> 18) & 0x07) | 0xF0;    str[i 1] = ((letter_uni >> 12) & 0x3F) | 0x80;    str[i 2] = ((letter_uni >> 6) & 0x3F) | 0x80;    str[i 3] = ((letter_uni >> 0) & 0x3F) | 0x80;    *index = i 4;  }  }// 返回转换后utf8字串长度uint32 utf32_to_utf8(const uint32 * u32str, char * u8str){  uint32 i = 0, j = 0;    do   {    to_utf8(u8str, &i, u32str[j]);  } while (u32str[j  ] != 0);  return i;}

说明：若是只需支持编码：0x0000-0xFFFF，可以“砍掉” UTF-8中的 4 字节处理部分，同时用 uint16 替换 uint32，这样存储空间直接折半。

二、BOM

什么是BOM?

BOM（byte-order mark）即字节序标记。这里主要针对unicode编码格式文件，会在文件头部插入几个字节作为标识头，UTF-8的BOM是：EF BB BF。如下图示：

针对BOM，UTF-8又分为 'UTF-8'(有BOM)，'UTF-8 -无BOM'。

这个看似简单的差异，在网上却是争论不休，并且不少网友反应为此淌过雷、踩过坑，争论异常激烈。

2.1、踩过坑的

还有很多，就不一一列举了，之所以产生这些原因，皆与BOM有关，原因分析如下：

1、对编码格式不熟悉，没有BOM的概念。

2、BOM标识为不可显示字符，若不查看文件标识头（16进制），很难一眼看出文件编码格式。若格式弄错了，后面再怎么弄，自然也就错了。

3、有些软件或平台不支持带BOM的UTF-8格式，可能还需二次处理（去BOM）。

2.2、中肯解答

是否带BOM？比较中肯的一个解答。

2.3、个人观点

由于UTF-8是变长编码，不加BOM，很容易与内码（没有BOM概念）产生二义性，导致解析错误，甚至乱码。例如：有两个字节的内容为C6H和BDH，它既可表示汉字“平”的内码，又可表示UTF-8中的“ƽ”。

a、汉字“平”

说明：字符“平”的内码为：C6H，BDH，二进制：11000110 10111101。

b、字符“ƽ”

说明：字符“ƽ”的UTF-8码值为：C6H，BDH，二进制：11000110 10111101，去掉UTF8格式标识，得：00000001 10111101，转成UTF-16，其码值为：0x01BD（十六进制）。

想想：若将举例的编码存成不含BOM的文件，请问编辑器该如何处理？

总结：UTF-8文件分为'有BOM'和“无BOM”两种，万一遇到问题，可以借助辅助软件（例如： UltraEdit，WinHex等）查看文件标识头（BOM），以便确认其编码格式。若是自己的工作环境，只会用到UTF-8，或其它因素制约，使用无BOM格式也未尝不可，否则为了更好的兼容性，容错性，强烈推荐：有BOM的UTF-8。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。