[dns-operations] Emoji "Female" symbol fails to resolve at Google's 8.8.8.8 & 8.8.4.4

Wed Jul 12 13:56:32 UTC 2017

On Wed, Jul 12, 2017 at 02:31:02PM +0100, James Stevens wrote:
> So you could argue that a resolver that is "fully IDNA2008 compliant" might
> block an emoji domain name?

Yes, it will.  Anything that does IDNA2008 will not look up or
register domain names that are not letters or digits, according to the
Unicode properties of those characters.  There are a bunch of other
rules too (NFC, stable under case folding -- which is why IDNA2008 has
no capitals -- and so on).  I don't know of any browsers today that
implement only 2008, however.  Most do something approximating UTS#46.

> /Personally/, I think emoji domain names are a bit silly, but I don't think
> its necessary to ban or block them.

It's got nothing to do with "banning" or "blocking".  The Unicode
Technical Committee sets the properties on emojis, and they're
symbols.  For good reasons, Unicode itself (and Mark Davis in
particular) recommended to IDNABIS that symbols not be used in domain
names.  This is consistent with their stance on symbols in
identifiers, which they have encoded in UAX #31 and UTS #39.  (Why
emojis are permitted or even encouraged by UTS #46 is anyone's guess.)

> I can sympathise with ICANN not wanting them in the ROOT, but they seem
> pretty harmless at the second level - especially as Egyptian Hieroglyphs are
> allowed.

That's because they're letters.  Hieroglyphs, in particular, do not
have presentation modes that are entirely up to the implementer, and
they don't have the coloration business, and they have normalization
forms (though IIRC the hieroglyphs are sufficiently clean that they
don't need normalization).  Emojis, on the other hand, do not have any
of those tricks, so you can't tell whether a face with a dark skin
tone is a distinct identifier to the yellow one (for a simple and
obviously troublesome example).

> However, it would be easy for browsers to warn users if the URL is made from
> a mixture of different character sets - we did this in a registry system to
> trigger a warning for such applications - so I'm not sure why they haven't
> done that.

Because there hasn't been a consistent standard because it's very,
very hard to do this algorithmically.  Some languages are written in
more than one script (which is what I think you mean by "character
set" -- it's all one coded character set), and in some local contexts
mixed script identifiers might be sane too.

Internationalization is really hard.  The IETF has a miserable time at
it, partly because there are a lot of engineers who don't use anything
other than Latin.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com