Read the â€å“blades, Inc. Caseã¢â‚¬â on Page 215 in Chapter 6 of Your Textbook.

Aw man. I was using "WTF-eight" to mean "Double UTF-8", every bit I described near recently at [1]. Double UTF-8 is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such equally Windows-1252, then encodes those characters as UTF-8.

[1] http://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-...

It was such a perfect abridgement, but now I probably shouldn't employ it, as information technology would exist confused with Simon Sapin's WTF-8, which people would actually employ on purpose.


> ÃÆ'ÂÆ'‚ÃÆ'‚ the hereafter of publishing at W3C

That is an amazing example.

It'south not fifty-fifty "double UTF-8", it's UTF-8 vi times (including the one to become information technology on the Spider web), it'south been decoded as Latin-1 twice and Windows-1252 iii times, and at the end there's a non-breaking space that'due south been converted to a space. All to stand for what originated as a single not-breaking infinite anyway.

Which makes me happy that my module solves it.

                                                                      >>> from ftfy.fixes import fix_encoding_and_explain     >>> fix_encoding_and_explain("ÃÆ'ÂÆ'‚ÃÆ'‚ the futurity of publishing at W3C")     ('\xa0the future of publishing at W3C',      [('encode', 'sloppy-windows-1252', 0),       ('transcode', 'restore_byte_a0', 2),       ('decode', 'utf-8-variants', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-8', 0),       ('encode', 'latin-1', 0),       ('decode', 'utf-8', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-8', 0),       ('encode', 'latin-1', 0),       ('decode', 'utf-8', 0)])                                


Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a agglomeration of data in a MySQL database every bit part of a larger migration to UTF-8. I hadn't washed that much pencil-and-newspaper bit manipulation since I was 13.


Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet earlier.


I love this.

                                                                      The key words "WHAT", "DAMNIT", "Proficient GRIEF", "FOR HEAVEN'S SAKE",     "RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG Chemical Burn down"     in this memo are to be interpreted every bit described in [RFC2119].                                


You really desire to call this WTF (8)? Is information technology april 1st today? Am I the just ane that thought this commodity is about a new funny project that is called "what the fuck" encoding, like when somebody announced he had written a to_nil gem https://github.com/mrThe/to_nil ;) Sorry but I can't cease laughing.


This is intentional. I wish we didn't have to do stuff like this, merely nosotros exercise and that's the "what the fuck". All considering the Unicode Commission in 1989 actually wanted 16 bits to be enough for everybody, and of course information technology wasn't.


The fault is older than that. Wide graphic symbol encodings in general are just hopelessly flawed.

WinNT, Java and a lot of more software apply wide character encodings UCS2/UTF-16(/UTF-32?). And it was added to C89/C++ (wchar_t). WinNT actually predates the Unicode standard by a yr or and then. http://en.wikipedia.org/wiki/Wide_character , http://en.wikipedia.org/wiki/Windows_NT#Development

Converting between UTF-8 and UTF-xvi is wasteful, though often necessary.

> broad characters are a hugely flawed thought [parent mail service]

I know. Back in the early on nineties they idea otherwise and were proud that they used information technology in hindsight. But nowadays UTF-8 is ordinarily the meliorate choice (except for perchance some asian and exotic later added languages that may require more infinite with UTF-viii) - I am not saying UTF-16 would be a better choice then, in that location are certain other encodings for special cases.

And equally the linked article explains, UTF-xvi is a huge mess of complication with dorsum-dated validation rules that had to exist added because it stopped being a wide-graphic symbol encoding when the new code points were added. UTF-16, when implemented correctly, is actually significantly more complicated to get correct than UTF-8.

UTF-32/UCS-4 is quite simple, though apparently it imposes a 4x penalty on bytes used. I don't know annihilation that uses information technology in practice, though surely something does.

Again: broad characters are a hugely flawed idea.

Sure, go to 32 bits per graphic symbol. Merely it cannot exist said to be "simple" and will not allow you to make the assumption that i integer = ane glyph.

Namely it won't relieve you from the following bug:

                                                                  * Precomposed vs multi-codepoint diacritics (Do yous write á with       1 32 bit char or with two? If it'southward Unicode the answer is both)      * Variation selectors (see also Han unification)      * Bidi, RTL and LTR embedding chars                                                              
And possibly others I don't know virtually. I feel similar I am learning of these dragons all the fourth dimension.

I near like that utf-sixteen and more than so utf-8 break the "1 character, 1 glyph" rule, considering it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avert such problems. Unicode just isn't uncomplicated whatsoever way y'all slice it, then yous might as well shove the complexity in everybody'south face up and have them confront it early.

Yous can't use that for storage.

> The mapping betwixt negative numbers and graphemes in this class is non guaranteed constant, even between strings in the same process.


What's your storage requirement that's not adequately solved by the existing encoding schemes?


What are you suggesting, store strings in UTF8 and then "normalize" them into this bizarre format whenever you load/save them purely and so that offsets stand for to grapheme clusters? Doesn't seem worth the overhead to my eyes.

In-retention string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well equally some well-known libraries (ICU, Windows API, Qt) use UTF-sixteen internally. How much information do you have lying around that'southward UTF-sixteen?

Sure, more recently, Go and Rust have decided to become with UTF-8, only that'due south far from common, and it does take some drawbacks compared to the Perl6 (NFG) or Python3 (latin-ane, UCS-two, UCS-4 equally appropriate) model if you accept to do actual processing instead of but passing opaque strings around.

Besides note that y'all have to go through a normalization step anyway if you don't desire to be tripped up past having multiple ways to stand for a single grapheme.

NFG enables O(North) algorithms for grapheme level operations.

The overhead is entirely wasted on code that does no character level operations.

For lawmaking that does do some character level operations, fugitive quadratic behavior may pay off handsomely.

i call back linux/mac systems default to UCS-4, certainly the libc implementations of wcs* exercise.

i concur its a flawed idea though. four billion characters seems like enough for at present, but i'd guess UTF-32 will need extending to 64 too... and really how about decoupling the size from the data entirely? it works well enough in the general case of /every type of data we know about/ that i'thousand pretty sure this specialised use case is not very special.

The Unixish C runtimes of the world uses a four-byte wchar_t. I'one thousand not aware of annihilation in "Linux" that actually stores or operates on 4-byte character strings. Obviously some software somewhere must, merely the overwhelming bulk of text processing on your linux box is done in UTF-8.

That'due south not remotely comparable to the situation in Windows, where file names are stored on deejay in a 16 bit not-quite-wide-graphic symbol encoding, etc... And it'due south leaked into firmware. GPT partition names and UEFI variables are 16 chip despite never once beingness used to store annihilation but ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.

We don't even have 4 billion characters possible now. The Unicode range is only 0-10FFFF, and UTF-xvi can't represent any more than that. And then UTF-32 is restricted to that range likewise, despite what 32 bits would allow, never mind 64.

But we don't seem to be running out -- Planes 3-xiii are completely unassigned so far, covering 30000-DFFFF. That's nigh 65% of the Unicode range completely untouched, and planes 1, two, and xiv still have big gaps besides.

> Just nosotros don't seem to be running out

The outcome isn't the quantity of unassigned codepoints, it'south how many private use ones are available, only 137,000 of them. Publicly available private apply schemes such as Induct are fast filling upwards this space, mainly by encoding block characters in the same way Unicode encodes Korean Hangul, i.east. past using a formula over a pocket-sized set of base components to generate all the block characters.

My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to two billion as originally specified by using the peak 75% of the private use codepoints equally 2nd tier surrogates. This scheme tin can hands exist fitted on tiptop of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) available as private utilize; the balance are unassigned.

I created this scheme to aid in using a formulaic method to generate a normally used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would be more difficult than the Hangul scheme because CJK characters are congenital recursively. If successful, I'd expect at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to 2 billion characters.


NFG uses the negative numbers down to virtually -ii billion every bit a implementation-internal private use expanse to temporarily store graphemes. Enables fast grapheme-based manipulation of strings in Perl 6. Though such negative-numbered codepoints could but be used for private use in data interchange betwixt tertiary parties if the UTF-32 was used, considering neither UTF-eight (even pre-2003) nor UTF-xvi could encode them.


Aye. sizeof(wchar_t) is ii on Windows and 4 on Unix-like systems, so wchar_t is pretty much useless. That's why C11 added char16_t and char32_t.


I'thousand wondering how common the "mistake" of storing UTF-16 values in wchar_t on Unix-similar systems? I know I thought I had my code carefully basing whether it was UTF-16 or UTF-32 based on the size of wchar_t, but to discover that 1 of the supposedly portable libraries I used had UTF-16 no affair how large wchar_t was.


Oh ok it's intentional. Thx for explaining the choice of the proper noun. Not only because of the name itself only also past explaining the reason behind the choice, you achieved to get my attention. I will try to detect out more than about this problem, because I estimate that equally a developer this might accept some impact on my work sooner or later and therefore I should at to the lowest degree be enlightened of it.

to_nil is actually a pretty important part! Completely trivial, obviously, only information technology demonstrates that there's a approved way to map every value in Ruby-red to nil. This is essentially the defining feature of cipher, in a sense.

With typing the interest here would be more than clear, of course, since it would be more than apparent that cipher inhabits every type.


The chief motivator for this was Servo'south DOM, although it ended up getting deployed first in Rust to bargain with Windows paths. Nosotros haven't determined whether we'll need to use WTF-8 throughout Servo—it may depend on how certificate.write() is used in the wild.

And so nosotros're going to encounter this on web sites. Oh, joy.

It'south time for browsers to kickoff saying no to really bad HTML. When a browser detects a major error, information technology should put an error bar across the summit of the page, with something similar "This page may display improperly due to errors in the page source (click for details)". Start doing that for serious errors such equally Javascript code aborts, security errors, and malformed UTF-viii. Then extend that to pages where the grapheme encoding is ambiguous, and finish trying to guess character encoding.

The HTML5 spec formally defines consistent handling for many errors. That's OK, there's a spec. Stop in that location. Don't try to outguess new kinds of errors.

No. This is an internal implementation item, not to be used on the Web.

As to callous fault treatment, that'southward what XHTML is nearly and why it failed. Only define a somewhat sensible behavior for every input, no affair how ugly.


Yes, that bug is the best place to start. We've future proofed the architecture for Windows, just there is no direct work on it that I'yard enlightened of.


What does the DOM do when it receives a surrogate half from Javascript? I thought that the DOM APIs (eastward.g. createTextNode, innerHTML setter, setAttribute, HTMLInputElement.value setter, document.write) would all strip out the lone surrogate code units?


In electric current browsers they'll happily pass around lone surrogates. Zippo special happens to them (5. whatever other UTF-16 code-unit) till they reach the layout layer (where they obviously cannot be fatigued).


I plant this through https://news.ycombinator.com/item?id=9609955 -- I discover it fascinating the solutions that people come up upward with to deal with other people's problems without damaging right code. Rust uses WTF-8 to interact with Windows' UCS2/UTF-xvi hybrid, and from a quick look I'one thousand hopeful that Rust's story effectually treatment Unicode should be much nicer than (say) Python or Java.


Have yous looked at Python iii yet? I'chiliad using Python iii in production for an internationalized website and my experience has been that it handles Unicode pretty well.

Not that great of a read. Stuff similar:

> I accept been told multiple times now that my signal of view is incorrect and I don't sympathize beginners, or that the "text model" has been changed and my request makes no sense.

"The text model has changed" is a perfectly legitimate reason to reject ideas consistent with the previous text model and inconsistent with the electric current model. Keeping a coherent, consistent model of your text is a pretty important function of curating a language. I of Python's greatest strengths is that they don't just pile on random features, and keeping old crufty features from previous versions would amount to the same affair. To dismiss this reasoning is extremely shortsighted.


Many people who prefer Python3's way of treatment Unicode are enlightened of these arguments. It isn't a position based on ignorance.


Hey, never meant to imply otherwise. In fact, even people who take bug with the py3 mode often concord that it's nonetheless meliorate than two's.


Python 3 doesn't handle Unicode any meliorate than Python two, information technology just fabricated information technology the default string. In all other aspects the situation has stayed as bad as information technology was in Python 2 or has gotten significantly worse. Good examples for that are paths and annihilation that relates to local IO when you're locale is C.

> Python 3 doesn't handle Unicode any better than Python 2, it only made it the default string. In all other aspects the state of affairs has stayed as bad as it was in Python two or has gotten significantly worse.

Maybe this has been your experience, just it hasn't been mine. Using Python 3 was the single best decision I've fabricated in developing a multilingual website (nosotros support English/German/Spanish). There's not a ton of local IO, but I've upgraded all my personal projects to Python iii.

Your complaint, and the complaint of the OP, seems to exist basically, "Information technology'south different and I have to change my code, therefore it'due south bad."

My complaint is not that I accept to alter my code. My complaint is that Python 3 is an effort at breaking equally little compatibilty with Python 2 every bit possible while making Unicode "easy" to use. They failed to accomplish both goals.

At present we take a Python three that'southward incompatible to Python ii merely provides almost no meaning benefit, solves none of the large well known issues and introduces quite a few new problems.


I take to disagree, I remember using Unicode in Python three is currently easier than in any language I've used. It certainly isn't perfect, but it's improve than the alternatives. I certainly have spent very little fourth dimension struggling with it.


That is not quite true, in the sense that more of the standard library has been made unicode-aware, and implicit conversions between unicode and bytestrings have been removed. And so if y'all're working in either domain you become a coherent view, the trouble beingness when y'all're interacting with systems or concepts which straddle the carve up or (fifty-fifty worse) may be in either domain depending on the platform. Filesystem paths is the latter, it's text on OSX and Windows — although peradventure ill-formed in Windows — but information technology's handbag-o-bytes in most unices. There Python 2 is just "amend" in that issues volition probably fly nether the radar if you lot don't prod things too much.

There is no coherent view at all. Bytes even so take methods like .upper() that make no sense at all in that context, while unicode strings with these methods are cleaved because these are locale dependent operations and at that place is no advisable API. You can also index, slice and iterate over strings, all operations that yous actually shouldn't practise unless you lot really at present what you are doing. The API in no way indicates that doing any of these things is a problem.

Python two handling of paths is non skillful because at that place is no skillful abstraction over unlike operating systems, treating them every bit byte strings is a sane lowest common denominator though.

Python 3 pretends that paths can exist represented every bit unicode strings on all OSes, that's not true. That is held up with a very leaky brainchild and means that Python code that treats paths as unicode strings and not equally paths-that-happenhoped-for-unicode-but-really-arent is broken. Most people aren't aware of that at all and it's definitely surprising.

On pinnacle of that implicit coercions take been replaced with implicit cleaved guessing of encodings for example when opening files.

When you say "strings" are you referring to strings or bytes? Why shouldn't you lot slice or alphabetize them? It seems similar those operations make sense in either case just I'm sure I'thou missing something.

On the guessing encodings when opening files, that's non really a problem. The caller should specify the encoding manually ideally. If you don't know the encoding of the file, how can you lot decode it? You could still open it as raw bytes if required.

I used strings to hateful both. Byte strings can be sliced and indexed no problems because a byte as such is something you may actually want to deal with.

Slicing or indexing into unicode strings is a trouble because it's non clear what unicode strings are strings of. Yous can look at unicode strings from different perspectives and come across a sequence of codepoints or a sequence of characters, both can be reasonable depending on what y'all desire to do. Virtually of the time nevertheless you certainly don't want to bargain with codepoints. Python notwithstanding merely gives y'all a codepoint-level perspective.

Guessing encodings when opening files is a problem precisely because - as you lot mentioned - the caller should specify the encoding, not but sometimes merely always. Guessing an encoding based on the locale or the content of the file should be the exception and something the caller does explicitly.

It slices by codepoints? That'due south just featherbrained, so we've gone through this whole unicode everywhere process so we tin can end thinking about the underlying implementation details just the api forces you to have to deal with them anyway.

Fortunately it's non something I deal with often but thanks for the info, will end me getting caught out later.


I think y'all are missing the departure betwixt codepoints (equally distinct from codeunits) and characters.

And unfortunately, I'm not anymore enlightened every bit to my misunderstanding.

I get that every different affair (character) is a different Unicode number (code point). To store / transmit these yous need some standard (encoding) for writing them down as a sequence of bytes (code units, well depending on the encoding each code unit is made up of different numbers of bytes).

How is whatever of that in conflict with my original points? Or is some of my to a higher place understanding incorrect.

I know you take a policy of non reply to people so possibly someone else could step in and articulate upward my confusion.


Codepoints and characters are not equivalent. A graphic symbol can consist of one or more than codepoints. More than importantly some codepoints merely alter others and cannot stand on their own. That means if you lot slice or index into a unicode strings, you might get an "invalid" unicode string back. That is a unicode cord that cannot exist encoded or rendered in any meaningful way.

Right, ok. I recall something almost this - ü tin can exist represented either by a single lawmaking point or by the letter 'u' preceded by the modifier.

As the user of unicode I don't really care about that. If I slice characters I expect a slice of characters. The multi code signal thing feels similar it's just an encoding detail in a different identify.

I guess you need some operations to get to those details if you demand. Human being, what was the drive behind adding that extra complication to life?!

Thanks for explaining. That was the slice I was missing.

bytes.upper is the Right Thing when y'all are dealing with ASCII-based formats. It too has the advantage of breaking in less random ways than unicode.upper.

And I mean, I can't actually recollect of any cross-locale requirements fulfilled by unicode.upper (mayhap case-insensitive matching, but then you lot too desire to practice lots of other filtering).

> In that location Python two is only "better" in that issues volition probably fly under the radar if you lot don't prod things too much.

Ah yeah, the JavaScript solution.


Well, Python 3'southward unicode support is much more complete. Equally a little example, case conversions at present cover the whole unicode range. This holds pretty consistently - Python 2's `unicode` was incomplete.

> It is unclear whether unpaired surrogate byte sequences are supposed to exist well-formed in CESU-8.

According to the Unicode Technical Written report #26 that defines CESU-8[one], CESU-eight is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the way the encoding is defined, the source data must be represented in UTF-16 prior to converting to CESU-8. Since UTF-xvi cannot represent unpaired surrogates, I recall it'due south rubber to say that CESU-8 cannot stand for them either.

[1] http://www.unicode.org/reports/tr26/

From the article:

>UTF-16 is designed to represent whatever Unicode text, simply it can not represent a surrogate code point pair since the corresponding surrogate sixteen-bit code unit pairs would instead correspond a supplementary lawmaking point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain any surrogate lawmaking point. (This was presumably deemed simpler that only restricting pairs.)

This is all gibberish to me. Can someone explain this in laymans terms?

People used to think sixteen bits would be enough for anyone. It wasn't, and so UTF-16 was designed equally a variable-length, backwards-compatible replacement for UCS-two.

Characters outside the Basic Multilingual Plane (BMP) are encoded as a pair of 16-flake code units. The numeric value of these code units announce codepoints that lie themselves within the BMP. While these values tin exist represented in UTF-8 and UTF-32, they cannot be represented in UTF-16. Because we desire our encoding schemes to be equivalent, the Unicode code space contains a hole where these so-chosen surrogates prevarication.

Because non everyone gets Unicode right, real-earth data may comprise unpaired surrogates, and WTF-eight is an extension of UTF-8 that handles such data gracefully.

I understand that for efficiency we want this to be as fast every bit possible. Simple compression can take care of the wastefulness of using excessive space to encode text - and then information technology actually just leaves efficiency.

If was to make a beginning endeavour at a variable length, but well defined backwards compatible encoding scheme, I would utilize something like the number of $.25 upto (and including) the beginning 0 bit every bit defining the number of bytes used for this graphic symbol. So,

> 0xxxxxxx, ane byte > 10xxxxxx, 2 bytes > 110xxxxx, iii bytes.

Nosotros would never run out of codepoints, and lecagy applications tin simple ignore codepoints it doesn't understand. We would only waste 1 bit per byte, which seems reasonable given just how many problems encoding usually represent. Why wouldn't this work, autonomously from already existing applications that does non know how to do this.

That's roughly how UTF-eight works, with some tweaks to make it self-synchronizing. (That is, y'all tin can bound to the middle of a stream and find the next code point by looking at no more 4 bytes.)

Every bit to running out of code points, we're limited by UTF-16 (upward to U+10FFFF). Both UTF-32 and UTF-eight unchanged could go upwardly to 32 bits.


Pretty unrelated merely I was thinking about efficiently encoding Unicode a calendar week or two ago. I think there might be some value in a stock-still length encoding simply UTF-32 seems a chip wasteful. With Unicode requiring 21 (20.09) bits per code point packing three lawmaking points into 64 bits seems an obvious idea. Only would it exist worth the hassle for example as internal encoding in an operating system? It requires all the extra shifting, dealing with the potentially partially filled concluding 64 bits and encoding and decoding to and from the external world. Is the desire for a stock-still length encoding misguided because indexing into a string is style less common than it seems?

When you use an encoding based on integral bytes, you can use the hardware-accelerated and oft parallelized "memcpy" majority byte moving hardware features to dispense your strings.

Only inserting a codepoint with your approach would require all downstream bits to exist shifted inside and across bytes, something that would exist a much bigger computational burden. It's unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-uncomplicated and memcpy-able UTF-32.

I retrieve you'd lose one-half of the already-pocket-sized benefits of fixed indexing, and there would exist enough extra complexity to go out y'all worse off.

In addition, there's a 95% adventure y'all're not dealing with plenty text for UTF-32 to hurt. If you're in the other 5%, then a packing scheme that'due south i/3 more than efficient is still going to hurt. There'southward no skilful utilize case.

Coding for variable-width takes more effort, but it gives y'all a amend issue. You can divide strings appropriate to the utilise. Sometimes that's code points, merely more often it'south probably characters or bytes.

I'm non even certain why you would desire to discover something like the 80th code point in a string. It'southward rare enough to not be a top priority.


Yep. For example, this allows the Rust standard library to convert &str (UTF-viii) to &std::ffi::OsStr (WTF-8 on Windows) without converting or even copying data.


An interesting possible application for this is JSON parsers. If JSON strings contain unpaired surrogate code points, they could either throw an mistake or encode as WTF-8. I bet some JSON parsers think they are converting to UTF-8, just are actually converting to GUTF-viii.

The proper noun is unserious but the project is very serious, its writer has responded to a few comments and linked to a presentation of his on the subject field[0]. It'southward an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-surrogates: while UTF8 is the modern encoding you accept to interact with legacy systems, for UNIX's bags of bytes you may exist able to assume UTF8 (perchance ill-formed) but a number of other legacy systems used UCS2 and added visible surrogates (rather than proper UTF-16) afterwards.

Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates. Having to interact with those systems from a UTF8-encoded world is an issue because they don't guarantee well-formed UTF-16, they might comprise unpaired surrogates which can't exist decoded to a codepoint allowed in UTF-eight or UTF-32 (neither allows unpaired surrogates, for obvious reasons).

WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.

WTF8 exists solely as an internal encoding (in-retentiveness representation), just it'due south very useful in that location. It was initially created for Servo (which may need it to have an UTF8 internal representation yet properly interact with javascript), but turned out to outset be a benefaction to Rust's Bone/filesystem APIs on Windows.

[0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf

> WTF8 exists solely equally an internal encoding (in-retentiveness representation)

Today.

Want to bet that someone will cleverly decide that it's "but easier" to utilize it equally an external encoding also? This kind of cat always gets out of the handbag somewhen.


Better WTF8 than invalid UCS2-plus-surrogates. And UTF-8 decoders will only plow invalid surrogates into the replacement character.


I thought he was tackling the other problem which is that you oft detect web pages that accept both UTF-8 codepoints and unmarried bytes encoded equally ISO-latin-i or Windows-1252

The nature of unicode is that there's e'er a trouble y'all didn't (merely should) know existed.

And considering of this global defoliation, everyone important ends upwardly implementing something that somehow does something moronic - so then everyone else has still another trouble they didn't know existed and they all fall into a self-harming spiral of depravity.


Some fourth dimension ago, I made some ASCII art to illustrate the various steps where things can go wrong:

                                                                      [user-perceived characters]                 ^                 |                 v        [grapheme clusters] <-> [characters]                 ^                   ^                 |                   |                 v                   v             [glyphs]           [codepoints] <-> [code units] <-> [bytes]                                


So basically it goes wrong when someone assumes that whatsoever two of the above is "the same thing". It's often implicit.

That's certainly one of import source of errors. An obvious example would exist treating UTF-32 as a fixed-width encoding, which is bad considering you might end up cutting grapheme clusters in half, and you can easily forget about normalization if you think about it that way.

So, it's possible to make mistakes when converting between representations, eg getting endianness wrong.

Some issues are more subtle: In principle, the conclusion what should be considered a unmarried character may depend on the language, nevermind the debate nigh Han unification - but as far as I'1000 concerned, that's a WONTFIX.

Permit me see if I have this direct. My understanding is that WTF-eight is identical to UTF-8 for all valid UTF-sixteen input, simply information technology tin can too circular-trip invalid UTF-16. That is the ultimate goal.

Below is all the background I had to learn about to understand the motivation/details.

UCS-two was designed as a xvi-scrap fixed-width encoding. When it became clear that 64k code points wasn't enough for Unicode, UTF-xvi was invented to deal with the fact that UCS-two was assumed to exist fixed-width, but no longer could be.

The solution they settled on is weird, but has some useful properties. Basically they took a couple code point ranges that hadn't been assigned notwithstanding and allocated them to a "Unicode within Unicode" coding scheme. This scheme encodes (1 big code point) -> (2 small code points). The small code points will fit in UTF-xvi "code units" (this is our name for each two-byte unit in UTF-xvi). And for some more terminology, "big code points" are called "supplementary code points", and "small code points" are chosen "BMP lawmaking points."

The weird thing about this scheme is that we bothered to make the "2 small lawmaking points" (known as a "surrogate" pair) into existent Unicode code points. A more normal affair would exist to say that UTF-sixteen code units are totally separate from Unicode code points, and that UTF-16 code units have no meaning outside of UTF-16. An number like 0xd801 could have a lawmaking unit meaning as part of a UTF-16 surrogate pair, and besides exist a totally unrelated Unicode lawmaking indicate.

But the one nice property of the mode they did this is that they didn't break existing software. Existing software assumed that every UCS-two character was also a lawmaking bespeak. These systems could be updated to UTF-sixteen while preserving this assumption.

Unfortunately information technology fabricated everything else more complicated. Considering now:

- UTF-16 can be ill-formed if it has any surrogate code units that don't pair properly.

- we have to figure out what to do when these surrogate lawmaking points — code points whose only purpose is to help UTF-16 intermission out of its 64k limit — occur outside of UTF-16.

This becomes especially complicated when converting UTF-16 -> UTF-8. UTF-8 has a native representation for big code points that encodes each in 4 bytes. But since surrogate code points are existent code points, you lot could imagine an alternative UTF-viii encoding for big code points: brand a UTF-sixteen surrogate pair, so UTF-8 encode the 2 lawmaking points of the surrogate pair (hey, they are real code points!) into UTF-8. But UTF-8 disallows this and just allows the approved, 4-byte encoding.

If you lot feel this is unjust and UTF-8 should be allowed to encode surrogate code points if it feels like it, and so yous might like Generalized UTF-8, which is exactly like UTF-8 except this is allowed. It'due south easier to convert from UTF-16, because you don't demand any specialized logic to recognize and handle surrogate pairs. You even so need this logic to go in the other direction though (GUTF-8 -> UTF-xvi), since GUTF-viii tin have large code points that you'd need to encode into surrogate pairs for UTF-16.

If you like Generalized UTF-8, except that you always want to use surrogate pairs for big code points, and you want to totally disallow the UTF-8-native 4-byte sequence for them, you might like CESU-eight, which does this. This makes both directions of CESU-viii <-> UTF-16 easy, because neither conversion requires special handling of surrogate pairs.

A nice property of GUTF-8 is that information technology tin can round-trip any UTF-16 sequence, fifty-fifty if it's ill-formed (has unpaired surrogate lawmaking points). It'south pretty easy to get sick-formed UTF-sixteen, because many UTF-sixteen-based APIs don't enforce wellformedness.

But both GUTF-8 and CESU-8 take the drawback that they are non UTF-8 compatible. UTF-eight-based software isn't generally expected to decode surrogate pairs — surrogates are supposed to exist a UTF-16-just peculiarity. Well-nigh UTF-8-based software expects that once it performs UTF-8 decoding, the resulting lawmaking points are existent code points ("Unicode scalar values", which brand up "Unicode text"), not surrogate code points.

So basically what WTF-eight says is: encode all code points every bit their real code point, never equally a surrogate pair (like UTF-8, unlike GUTF-eight and CESU-8). Nonetheless, if the input UTF-sixteen was ill-formed and contained an unpaired surrogate code bespeak, so y'all may encode that code point directly with UTF-8 (like GUTF-8, not allowed in UTF-8).

So WTF-8 is identical to UTF-8 for all valid UTF-16 input, only it can also round-trip invalid UTF-xvi. That is the ultimate goal.

By the way, one thing that was slightly unclear to me in the doc. In department 4.ii (https://simonsapin.github.io/wtf-8/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate code betoken pair, the conversion will exist incorrect and the resulting sequence volition not represent the original lawmaking points.

It might be more articulate to say: "the resulting sequence will not correspond the surrogate code points." Information technology might be by some fluke that the user actually intends the UTF-16 to interpret the surrogate sequence that was in the input. And this isn't actually lossy, since (AFAIK) the surrogate code points be for the sole purpose of representing surrogate pairs.

The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-sixteen volition actually end up existence ill-formed.


The encoding that was designed to exist fixed-width is called UCS-2. UTF-16 is its variable-length successor.

hmmm... await... UCS-2 is just a broken UTF-sixteen?!?!

I thought information technology was a distinct encoding and all related problems were largely imaginary provided you /just/ handle things correct...

UCS2 is the original "wide graphic symbol" encoding from when code points were defined as sixteen bits. When codepoints were extended to 21 bits, UTF-16 was created as a variable-width encoding compatible with UCS2 (so UCS2-encoded data is valid UTF-xvi).

Sadly systems which had previously opted for fixed-width UCS2 and exposed that detail as part of a binary layer and wouldn't break compatibility couldn't go along their internal storage to 16 flake code units and move the external API to 32.

What they did instead was go on their API exposing 16 $.25 code units and declare information technology was UTF16, except almost of them didn't bother validating anything so they're actually exposing UCS2-with-surrogates (not fifty-fifty surrogate pairs since they don't validate the data). And that's how you observe solitary surrogates traveling through the stars without their mate and shit's all fucked upwardly.

The given history of UTF-sixteen and UTF-8 is a fleck muddled.

> UTF-16 was redefined to be ill-formed if information technology contains unpaired surrogate 16-fleck code units.

This is incorrect. UTF-16 did not be until Unicode two.0, which was the version of the standard that introduced surrogate code points. UCS-2 was the xvi-bit encoding that predated it, and UTF-16 was designed as a replacement for UCS-2 in order to handle supplementary characters properly.

> UTF-8 was similarly redefined to exist sick-formed if it contains surrogate byte sequences.

Not really true either. UTF-8 became part of the Unicode standard with Unicode 2.0, and and then incorporated surrogate code indicate handling. UTF-8 was originally created in 1992, long earlier Unicode 2.0, and at the time was based on UCS. I'thousand not really certain it'south relevant to talk about UTF-8 prior to its inclusion in the Unicode standard, but even so, encoding the lawmaking point range D800-DFFF was not allowed, for the same reason information technology was actually non allowed in UCS-2, which is that this code indicate range was unallocated (information technology was in fact part of the Special Zone, which I am unable to find an actual definition for in the scanned dead-tree Unicode 1.0 book, but I haven't read it cover-to-cover). The distinction is that it was non considered "ill-formed" to encode those code points, and then information technology was perfectly legal to receive UCS-2 that encoded those values, process it, and re-transmit information technology (every bit it's legal to process and retransmit text streams that represent characters unknown to the procedure; the supposition is the process that originally encoded them understood the characters). So technically yes, UTF-8 changed from its original definition based on UCS to one that explicitly considered encoding D800-DFFF as ill-formed, but UTF-8 every bit it has existed in the Unicode Standard has always considered it sick-formed.

> Unicode text was restricted to not contain whatever surrogate lawmaking point. (This was presumably deemed simpler that just restricting pairs.)

This is a bit of an odd parenthetical. Regardless of encoding, it's never legal to emit a text stream that contains surrogate code points, as these points take been explicitly reserved for the employ of UTF-xvi. The UTF-viii and UTF-32 encodings explicitly consider attempts to encode these code points equally ill-formed, simply there's no reason to ever allow information technology in the start place as it'south a violation of the Unicode conformance rules to practice so. Because in that location is no process that can peradventure accept encoded those code points in the showtime place while befitting to the Unicode standard, at that place is no reason for any process to attempt to interpret those code points when consuming a Unicode encoding. Allowing them would but be a potential security hazard (which is the same rationale for treating not-shortest-course UTF-viii encodings as ill-formed). Information technology has nothing to do with simplicity.

garciaquishe.blogspot.com

Source: https://news.ycombinator.com/item?id=9611710

Related Posts

0 Response to "Read the â€å“blades, Inc. Caseã¢â‚¬â on Page 215 in Chapter 6 of Your Textbook."

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel