diff options
author | Matthias Andree <matthias.andree@gmx.de> | 2010-05-27 10:25:15 +0200 |
---|---|---|
committer | Matthias Andree <matthias.andree@gmx.de> | 2010-05-27 10:25:15 +0200 |
commit | bab5fb8c711763e3731c80223481ea3fa64f8525 (patch) | |
tree | 19083a974c10956fe4a5787ac01f06e18eeded48 | |
parent | 107b6cd9074838c4c82ad0f92349c667b7c41fa8 (diff) | |
download | fetchmail-bab5fb8c711763e3731c80223481ea3fa64f8525.tar.gz fetchmail-bab5fb8c711763e3731c80223481ea3fa64f8525.tar.bz2 fetchmail-bab5fb8c711763e3731c80223481ea3fa64f8525.zip |
Add document on IMAP-Unicode for mailbox names.
-rw-r--r-- | Mailbox-Names-UTF7.txt | 234 |
1 files changed, 234 insertions, 0 deletions
diff --git a/Mailbox-Names-UTF7.txt b/Mailbox-Names-UTF7.txt new file mode 100644 index 00000000..80f0b961 --- /dev/null +++ b/Mailbox-Names-UTF7.txt @@ -0,0 +1,234 @@ +IMAP4r1 Mailbox Names vs. Unicode +================================= +:author: Matthias_Andree_(ed.)_and_Mark_Crispin +:email: matthias.andree@gmx.de +:author initials: MA and MC +:revision: 0.2 +:revdate: 2010-05-26 +:toc: +:data-uri: +:icons: +:numbered: + +'''' + +.Acknowledgment +**** +This article would not have been possible without the +substantial contributions from Mark Crispin. +— Matthias Andree, editor +**** + +.Abstract +**** +IMAP4rev1 is a widely used Internet Standards Track Protocol for remote +email access. Its adoption to international environments posed +interpretation problems as the construction and interpretation of +mailbox names, it particularly raised the question if there was +contractictory information within IMAP4rev1. + +This article describes the problem, and shows that IMAP4rev1 is +consistent with respect to mailbox names. We document how the evolution +of Unicode character sets and transformation formats made the +interpretation of the IMAP4rev1 standard difficult, and how it is to +interpret properly. + +Finally, we show that UTF-7, which is used in IMAP4rev1 to encode +mailbox names, does not impose artificial restrictions on the Unicode +character set. +**** + +== IMAP Mailbox Names in RFC-3501 + +In May 2010, some confusion arose on the getmail mailing list around a bug +report to Debian that complained getmail4 wouldn't allow non-ASCII characters +in an IMAP folder name http://bugs.debian.org/513116[Debian Bug#513116], and +the interpretation of support of international mailbox names +vs. http://tools.ietf.org/html/rfc3501[RFC-3501]. It seemed at first +glance that IMAP4rev1 were limited to the Basic Multilingual Plane of +Unicode. + +=== Problem statement + +Notably, RFC-3501 mandates that mailbox names are 7-bit, however clients are +supposed to accept 8-bit data and interpret it as UTF-8. This is apparently +contradictory or extraneous, because 7-bit ASCII data need not be encoded. + +Let us look at the IMAP4rev1 standard: + +[quote, Mark Crispin, RFC3501] +____ +5.1. Mailbox Naming + +Mailbox names are 7-bit. Client implementations MUST NOT attempt to +create 8-bit mailbox names, and SHOULD interpret any 8-bit mailbox names +returned by LIST or LSUB as UTF-8. Server implementations SHOULD +prohibit the creation of 8-bit mailbox names, and SHOULD NOT return +8-bit mailbox names in LIST or LSUB. See section 5.1.3 for more +information on how to represent non-ASCII mailbox names. [...] +____ + +[quote, Mark Crispin, RFC3501] +____ +5.1.3. Mailbox International Naming Convention + +By convention, international mailbox names in IMAP4rev1 are specified +using a modified version of the UTF-7 encoding described in [UTF-7]. +Modified UTF-7 may also be usable in servers that implement an earlier +version of this protocol. [...] +____ + +This appears to be contradictory, because UTF-7 is not UTF-8. However, a UTF-7 +mailbox name is not an 8-bit mailbox name, hence the clause "interpret any +8-bit mailbox names ... as UTF-8" does not apply. Mark writes: + +=== Clarification +_by Mark Crispin_ + +8-bit octets are prohibited in mailbox names. Clients MUST use 7-bit +names, and servers MUST reject CREATE commands that contain 8-bit +octets. + +However, clients MUST also interpret any 8-bit names in a list of +mailbox names (from LIST or LSUB) as UTF-8. + +To understand the history here, we must go back to the 1990s where +people (in spite of being told not to do so) were writing IMAP2 clients +and servers which used ISO-8859-1 and Shift-JIS mailbox names. At that +time, it was by no means certain that UTF-8 would become the standard +Internet character set; I played an important role in making that +happen, but that was still a few years in the future. + +The adoption of UTF-8 offered a chance to exterminate non-UTF-8 8-bit +mailbox names, and in 1996 the current rules were adopted. The +transition to IMAP4 (which required substantial changes to any IMAP2 +servers) provided an opportunity to exterminate these non-interoperable +names once and for all. + +The modified UTF-7 was a temporary expedient to allow non-ASCII mailbox +names while remaining with the 7-bit framework. Had punycode existed at +the time, it would have been a much better choice than UTF-7. But +punycode did not exist for several years later with IDN. In fact, +punycode was created because people learned the problems of UTF-7 from +IMAP. + +The intent was always to move to a UTF-8 only environment and leave +behind UTF-7. When that happens, clients will start encountering UTF-8 +names. It is therefore necessary to tell clients that, even though they +are not permitted to send them, they need to be written to handle them +so they work properly when the restriction is relaxed in the future. + +=== Recommendations +_by Mark Crispin_ + +*Options for server implementors* + +From the perspective of a server implementor, you have one of two choices +of how to implement MUTF-7: +footnote:[editor's note: Modified UTF-7 as specified by the ensemble of RFC-2152 and RFC-3501] + +[horizontal] +[S1]:: Ignore it; just forbid 8-bit octets in the CREATE command. +[S2]:: Convert mailbox names in commands from MUTF-7 to UTF-8. When doing a +LIST or LSUB, convert mailbox names from UTF-8 to MUTF-7 before sending +them to the client. + +Servers of type [S1] were far more common in the 1990s. [S2] is more +common today. However, a client neither knows, nor cares, which type of +server it is because the rules make both servers interoperate the same. + +*Options for client implementors* + +[horizontal] +[C1]:: Ignore it; you're an ASCII client. +[C2]:: Convert mailbox names from UTF-8 to MUTF-7 when sending a command. +When receiving a listing of mailboxes, convert MUTF-7 to UTF-8. + +This all works, and works well. The routines to do the conversions are +quite straightforward. The only thing that you can't do well are mixed +wildcards with strings with non-ASCII names; and that is primarily a +curiousity since no clients do that with ASCII names. + +== Unicode, UCS-2, UTF-16, and UTF-7 + +.Incomplete specification: +WARNING: This section and its subsections are not normative references, + and are insufficient to implement UCS-2, UTF-16 or UTF-7 based + software. + +=== UCS-2 and UTF-16 +_by Mark Crispin_ + +RFC-3501 uses http://tools.ietf.org/html/rfc2152[RFC-2152] by reference. +Some of the confusion on the getmail list arose from the fact that +RFC-2152 talks about UCS-2 representation, which is limited to the Basic +Multilingual Plane (BMP) range U+0000 to U+FFFF. + +However, RFC-2152 also (page 5) refers to the handling of surrogate +pairs, which are defined in UTF-16 but not UCS-2. + +The correct interpretation is that the wording in RFC-2152 was written +at a time when "UCS-2" was interpreted as a synonym for "16-bit value" +as opposed to "BMP-only codepoints". This happens frequently in older +standards. Since UTF-7 is deprecated, nobody has done the work to +update RFC-2152 to clarify this point. + +Using surrogate pairs extends the capability of 16-bit words beyond the +BMP range. + +The 0x0000 to 0xFFFF range comprises so-called surrogates, two character +ranges (0xD800 to 0xDBFF and 0xDC00 to 0xDFFF) of 1024 characters (2^10^) +each. These ranges are technically removed from the BMP (thus there is +no such thing as U+D800); and hence the BMP only contains 64,512 +possible codepoints. + +Both UTF-7 and UTF-16 transformation leverages these ranges to map +Unicode code points in the range from U+010000 to U+10FFFF (which is the +highest Unicode code point) to a pair of UCS-2 characters in the +surrogates ranges. + +This happens by first subtracting 0x10000, which maps the input into the +range 0x0 to 0xFFFFF, representable in 20 bits. The most significant +10-bit portion is mapped into the range 0xD800…0xDBFF, the least +significant 10-bit portion into the range 0xDC00…0xDFFF, and these two +16-bit values are used in this order. UTF-7 does a further step of +encoding in modified BASE64. + +Thus, UTF-7 and UTF-16 both deal with ``16-bit values'' and use the same +surrogate pair mechanism to access non-BMP codepoints. Although not +strictly accurate (the two are technically independent encodings of +Unicode), it may be helpful to think of UTF-7 as a further encoding of +UTF-16. + +=== UTF-7 + +UTF-7 is a 7-bit representation of Unicode that makes use of character set +shifting. A character that is directly representable represents itself. Other +characters are subjected to a modified BASE64-encoding (that omits the padding +"=" characters at the end of a group) which is preceded by a "+" character +and trailed by a "-" character, which is discarded, or any other character +not in the modified BASE64 set, which remains in the stream. + +As a special case, the sequence "\+-" is a shorthand to represent +the "+" character itself. + +The modified BASE64 character set uses the characters A-Z, a-z, digits 0-9, +and the characters "+" and "/", omitting "=" to avoid collisions with +RFC-2047 encoding. + +=== Modified UTF-7 + +This works similar to UTF-7, but mandates that printable ASCII characters +0x20...0x7E except 0x26 (the ampersand "&") represent themselves, and uses yet +another BASE64 alphabet consisting of the upper- and lowercase letters, the +digits, and the characters "+" and ",", with some further rules specified in +RFC-3501. + +== Conclusions + +IMAP Clients that want to support international mailbox names should send UTF-7, +and be prepared to handle UTF-7 (if no 8-bit data is found) and UTF-8 (if +8-bit data is found). + +Modified UTF-7 as per the IMAP RFC #3501 is not limited to the Unicode Basic +Multilingual Plane, but maps the entire Unicode range. |