对于汉字的编码解包问题

我知道对于utf-8编码,每个汉字需要对应三个bytes,即24bit。现在因为项目需要,我想把24个bit拆分为单独的三个8bit(即拆分为三部分,每个部分不超过256,并且可逆向还原为汉字),用例代码如下:
>>> import struct as st
>>>
>>> string = "Nature期刊"
>>> byte = string.encode(encoding="utf-8")
>>> print(byte)
b'Nature\xe6\x9c\x9f\xe5\x88\x8a'
>>> info = st.unpack(f"<{len(byte)}B", byte)
>>> print(info)
(78, 97, 116, 117, 114, 101, 230, 156, 159, 229, 136, 138)
能够看出英文解包对应产生一个byte,汉字解包对应产生的三个byte,英文对应的是相应的ascii值(< 128),那么我想请教一下大佬们是否当前所有的汉字对应的三个byte在如案例代码解包后产生的256以内的数字是否均在[128, 256)这个区间内。(因为在后续的步骤会涉及到区分英文和汉字的解码恢复,需要一个标准去判断哪些数字对应的英文或者汉字的三分之一)。非常感谢
Jason990420
最佳答案

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the actual number assigned is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

UTF-8 uses the following rules:

  • If the code point is < 128, it’s represented by the corresponding byte value.
  • If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
5个月前 评论
LongAfter (楼主) 5个月前
讨论数量: 2
Jason990420

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the actual number assigned is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

UTF-8 uses the following rules:

  • If the code point is < 128, it’s represented by the corresponding byte value.
  • If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
5个月前 评论
LongAfter (楼主) 5个月前

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!