专业在线电子书网站

10.2　Python的字符编码

明白了Unicode和UTF-8的区别和关系后，再来看看Python的编码方式。在Python 3中，字符串的编码使用str和bytes两种类型。

（1）str字符串：使用Unicode编码。

（2）bytes字符串：使用将Unicode转化成的某种类型的编码，如UTF-8、GBK。

在Python 3中，字符串默认的编码为Unicode，所以基本上出现的问题比较少。而Python 2相对Python 3来说，由于字符串默认使用将Unicode转化成的某种类型的编码，可以采用的编码比较多，因此使用过程中经常遇到编码问题，为用户带来很多烦恼。

本书使用Python 3作为编程语言，为了让大家更容易理解，后面仅讨论Python 3的中文编码。

Python的默认编码如下：

In [1]:str1 = "我们"     print (str1)     print (type(str1))

我们

<class'str'>

可以看出，Python 3的字符串默认编码为str，也就是使用Unicode编码。

encode和decode

这些默认的str字符串怎么转化成bytes字符串呢？

这里就要用到encode和decode了。encode的作用是将Unicode编码转换成其他编码的字符串，而decode的作用是将其他编码的字符串转换成Unicode编码，如图10-1所示。

图10-1　encode与decode编码的转换

图10-1所示为Unicode和UTF-8之间编码转换的例子，代码实现如下：

In [2]:str1 = "我们"     str_utf8 = str1.encode('utf-8')     print (str_utf8)     print (type(str_utf8))

b'\xe6\x88\x91\xe4\xbb\xac'

<class'bytes'>

这里的str_utf8已经为UTF-8编码了，中文字符转换后，1个Unicode字符将变为3个UTF-8字符，\xe6就是其中一个字节，因为它的值是230，没有对应的字母可以显示，所以以十六进制显示字节的数值。\xe6\x88\x91三个字节代表“我”字，\xe4\xbb\xac三个字节代表“们”字，代码实现如下：

In [3]:str_decode = str1.encode('utf-8').decode('utf-8')     print (str_decode)     print (type(str_decode))

我们

<class'str'>

再用decode可以把用UTF-8编码的字符串解码为Unicode编码。要编码成其他类型的编码时，也可以用encode，如GBK。如果想要查看具体的编码类型，那么可以用到chardet，代码实现如下：

In [4]:import chardet     str_gbk = "我们".encode('gbk')     chardet.detect(str_gbk)

{'confidence':0.8095977270813678,

'encoding':'TIS-620'}

如果你脑洞大开，或许会问这样一个问题：Unicode还可以decode吗？显示结果如下：

In [5]:str_unicode_decode = "我们".decode()

------------------------------------------------

AttributeError

Traceback(most recent call last)

<ipython-input-5-0402a0b683b7>

in<module>()

---->1 str_unicode_decode=

"我们".decode()

AttributeError:'str'object has no ttribute'decode'

已经被编码的UTF-8还可以再encode吗？显示结果如下：

In [6]:str_utf8 = "我们".encode('utf-8')    str_gbk = str_utf8.encode('gbk')

-----------------------------------------

AttributeError

Traceback(most recent call last)

<ipython-input-6-5d0c32a4bf21>

in<module>()

1 str_utf8="我们".encode('utf-8')

---->2 str_gbk=str_utf8.encode('gbk')

AttributeError:'bytes'object has no attribute'encode'

答案都是否定的。因为在Python 3中，Unicode不可以再被解码。如果想把UTF-8转成其他非unicode编码，那么必须先decode成Unicode，再encode为其他非Unicode编码，如GBK。

encode转换为其他非Unicode编码的代码如下：

In [7]:str_utf8 = "我们".encode('utf-8')     str_gbk = str_utf8.decode('utf-8').encode('gbk')     print (str_gbk)

b'\xce\xd2\xc3\xc7'

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。

10.2 Python的字符编码

10.2　Python的字符编码