“Hello, World!” with Emoji (Stan edition)

Brian Ward sorted out our byte-level string escaping in Stan in this stanc3 pull request and the corresponding docs pull request.

In the next release of Stan (2.28), we’ll be able to include UTF-8 in our strings and write “Hello, World!” without words.

transformed data {
  print("🙋🏼 🌎❗");
}

I’m afraid we still don’t allow Unicode identifiers. So if you want identifiers like α or ℵ or 平均数 then you’ll have to use Julia.

4 thoughts on ““Hello, World!” with Emoji (Stan edition)

  1. So what’s the problem with making Unicode available? Not that I need them, but just curious.

    Is it a security risk? Or does it need additional code to make them available.

  2. Personally, I think Unicode identifiers in code is a bad idea. There are lots of difficulties in getting multi-lingual display right, and, in principle, Unicode text doesn’t include all the information you need, e.g. such as what language you think you are writing. If you use “i平均数” as an integer that’s probably an average of some sort, it probably needs to be displayed differently (with different glyphs) if the bloke who wrote it thought it was for consumption in China, Taiwan, or Japan. And it might mean something different as well.

    Handling Unicode text as data can be hard because the “correct” internal representation depends what you want to do with it. Some folks like UTF8 as an internal representation. It makes file reading/writing transparent and fast and doesn’t bloat your file size. It’s fine if you are not going to be parsing that text, counting or doing hash table lookups on substrings (“words”), and the like. If you want to parse your data, search for substrings, use hash tables, you might prefer a flat 32-bit internal representation. (Although even that has some craziness with certain special codes.)

  3. I actually had no idea that Stan had character strings at all. I still don’t think we have labeled vectors (for example if you fit a multilevel model with varying intercepts for the 50 states, I don’t think Stan let you tag these intercepts with state names), so I’m not really clear on what’s the point of the Unicode thing in practice. I assume it must add some useful functionality but I don’t know what it is.

  4. @Andrew: This is just for comments or providing output, so not very useful at all. It was just a side effect of fixing an internal bug. We don’t have any plans to label vectors as there’s not a good place in Stan to output those labels and it’s really inefficient to access by label compared to accessing by index.

    @Rahul: Partly it’s what @David J. Littleboy said—we’re worried users will mess up encodings.

    @David J. Littleboy: Agreed. I worked in natural language processing for 20+ years and saw just about every mistake someone can make with encodings. You’re right that the glyphs don’t determine language. If I write “voyage”, nobody can tell if it’s meant to be French or English, so they won’t know how to pronounce it. But I don’t see how that’s more of a problem outside of ASCII. Technically, unicode defines code points and each code point has one or more glyphs that can be used to render it (the character ‘a’ doesn’t determine font, for example). The internal encoding issue is tricky. UTF-8 is ideal for languages that are mostly ASCII, like English, German, French, or Spanish. But it’s not as compact as UTF-16 for languages like Hindi or Chinese. Java does a really good job of providing an API on top of this mess. Code points aren’t even enough for search. You need something like the International Components for Unicode (ICU) package to standardize. Otherwise, you can’t search for the character ä because there’s more than one way to write that as a sequence of unicode code points (the two obvious ones are the a-with-umlauts character or a character followed by the umlauts combining character). Language is a mess.

Leave a Reply

Your email address will not be published. Required fields are marked *