对于汉字的编码解包问题

我知道对于utf-8编码，每个汉字需要对应三个bytes，即24bit。现在因为项目需要，我想把24个bit拆分为单独的三个8bit（即拆分为三部分，每个部分不超过256，并且可逆向还原为汉字），用例代码如下：

>>> import struct as st
>>>
>>> string = "Nature期刊"
>>> byte = string.encode(encoding="utf-8")
>>> print(byte)
b'Nature\xe6\x9c\x9f\xe5\x88\x8a'
>>> info = st.unpack(f"<{len(byte)}B", byte)
>>> print(info)
(78, 97, 116, 117, 114, 101, 230, 156, 159, 229, 136, 138)

能够看出英文解包对应产生一个byte，汉字解包对应产生的三个byte，英文对应的是相应的ascii值（< 128），那么我想请教一下大佬们是否当前所有的汉字对应的三个byte在如案例代码解包后产生的256以内的数字是否均在[128, 256)这个区间内。（因为在后续的步骤会涉及到区分英文和汉字的解码恢复，需要一个标准去判断哪些数字对应的英文或者汉字的三分之一）。非常感谢

LongAfter

6 声望

暂无个人描述~

0 人点赞

推荐文章：

更多推荐...

置顶

[进度 100.00%] Python Masonite 4.0 中文翻译召集（Python 中的类 Laravel 框架） 15 / 19 |

公告

Python Masonite 框架中文翻译召集（Python 中的类 Laravel 框架） 24 / 25 |

博客

收集了一些各大网站 python 的登陆方式,希望对学习 python 的小白，和想写爬虫的你们有所帮助,,本项目用于研究和分享各大网站的模拟登陆方式 17 / 5 |

翻译

Python 3.7 的一些新特性 10 / 2 |

链接

快速掌握一个语言最常用的 50% 11 / 1 |

翻译

使用 Python 一步步搭建自己的区块链 22 / 1 |

Jason990420

1.9k 声望 / 個人 @ 個人

最佳答案

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the actual number assigned is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value.

If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

1年前评论

LongAfter （楼主）

Thanks for your answer！

讨论数量: 2

Jason990420

1.9k 声望 / 個人 @ 個人

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the actual number assigned is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value.

If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

1年前评论

LongAfter （楼主）

Thanks for your answer！

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

对于汉字的编码解包问题

我知道对于utf-8编码，每个汉字需要对应三个bytes，即24bit。现在因为项目需要，我想把24个bit拆分为单独的三个8bit（即拆分为三部分，每个部分不超过256，并且可逆向还原为汉字），用例代码如下：

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

对于汉字的编码解包问题

我知道对于utf-8编码，每个汉字需要对应三个bytes，即24bit。现在因为项目需要，我想把24个bit拆分为单独的三个8bit（即拆分为三部分，每个部分不超过256，并且可逆向还原为汉字），用例代码如下：

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录