打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Unicode Flex Character Classes

Unicode Flex Character Classes


From: Hans Aberg
Subject: Unicode Flex Character Classes
Date: Tue, 11 Jan 2005 20:20:25 +0100
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

I have hacked together some functions generating Flex style regular
expressions for the Unicode encodings UTF-8 and UTF-32 (big/endian). They
are written in Haskell, using Hugs <http://haskell.org/>. The input for each
one of these functions is a pair of Haskell Int's x y (same as a C 32-bit
int), designating a closed character interval [x-y] of Unicode numbers. The
function converts this to a regular expression matching this Unicode
character class as it appears in the function specific Unicode encoding.
None of these functions treat the excluded Unicode numbers U+D800 to U+DFFF
U+FFFE to U+FFFF different form other Unicode numbers. One can thus generate
regular expressions for these numbers to generate error messages. In
addition, I have supplied functions to generate regular expression for the
overloaded UTF-8 multibytes, illegal in UTF-8, also with the intent to
enable one to generate error message for such numbers. (More details below.)

One can use the functions to generate regular expressions to plug into Flex,
or one might integrate these functions into a future Unicode version of
Flex. Dwelling a bit on the latter question: It is possible to admit UTF-8
and UTF-32 character classes to exist side by side. The point of doing this
might be if one wants to write a lexer that can, say via start conditions,
switch between different encodings. One cannot, though, mix different
encodings in the same character class. Therefore, a Unicode extension of
Flex might look as follows:

The .l files are assumed to be in UTF-8. The standard character classes
[x-y] are extended to generate UTF-8 regular expressions. Writing \u........
translates into UTF-8 regular expressions. Writing \U........ translates
into bigendian UTF-32 regular expressions. Writing \L........ translates
into lowendian UTF-32 regular expressions. If big (resp. low) endian numbers
are used in a [x-y] character class, all numbers in that class must be big
(resp. low) endian, and the character class is translated into a big (resp.
low) endian UTF-32 regular expression.

Returning to the functions in the attached file Unicode.hs: all of type
urToRegU8 :: Int -> Int -> Reg -- Unicode to UTF-8 reg ex.
urToRegU32 :: Int -> Int -> Reg -- Unicode to bigendian UTF-32 reg ex.
urToRegU32L :: Int -> Int -> Reg -- Unicode to lowendian UTF-32 reg ex.

urToRegU8B1 :: Int -> Int -> Reg -- Unicode to 1-byte UTF-8 reg ex.
urToRegU8B2 :: Int -> Int -> Reg -- Unicode to 2-byte UTF-8 reg ex.
urToRegU8B3 :: Int -> Int -> Reg -- Unicode to 3-byte UTF-8 reg ex.
urToRegU8B4 :: Int -> Int -> Reg -- Unicode to 4-byte UTF-8 reg ex.
urToRegU8B5 :: Int -> Int -> Reg -- Unicode to 5-byte UTF-8 reg ex.
urToRegU8B6 :: Int -> Int -> Reg -- Unicode to 6-byte UTF-8 reg ex.

Just compute, for example:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])
[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0

Hans Aberg



本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
谡怎么读
觇怎么读
java: utf-8转unicode
求PHP高手:php 解密JS中的escape的代码
沅怎么读
阆怎么读
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服