Le Menu Dit : a translation app

This post is by Phil Price.

“Le Menu Dit” is an iPhone app that some friends and I wrote, which translates restaurant menus from English into French. (The name is French for “The Menu Says.”) The friends are Nathan Addy and another excellent programmer who would like to remain nameless for now.

Here’s how the app works from the user perspective: You take a photo of a printed (as opposed to handwritten) English-language menu, and the app translates it into French. C’est formidable! We don’t yet have a version that goes the other way, which might be of much more interest to readers of this blog, but the whole problem is interesting and has enough of a statistical component to it that it seems fine for this blog. Of course if you or your French-speaking friends want to buy the app, that would be good too! And we will soon be releasing versions that go from English into Italian and from English into Chinese; and a bit farther down the road, probably around May, we will finally have one that goes from French into English. See how we do it, below the fold.

Lines of a menu are shown along with the translation into French

Screenshot of Le Menu Dit results

The toughest competition for this app is Google Translate, which (1) is famous, (2) is free, and (3) does real-time translations in a totally cool and remarkable way that you should see for yourself. But Google Translate is pretty frustrating to use on a menu — try it and see — and aside from ease-of-use issues it doesn’t actually do a great job at the menu translation either. Don’t get me wrong, it’s an incredible program, almost magical…but for menus, ours is better even though it’s less magical.

The steps are:
1. The user takes a photo of a page of the menu, and crops it to one column of text. It’s unfortunate that we require this user intervention, and this is an obvious place for improvement someday. But for now this is it.

2. We pre-process the image to remove shadows and glare and to straighten it so the text lines are horizontal (we do this locally, not globally, so if the menu has a fold or is floppy, we still end up with horizontal lines). I (Phil) developed the algorithm in R, and Nathan implemented it using Leptonica, an open-source image processing library. Using that library required a few compromises that lead to the results not being as good as my R algorithm, so this is an area we could revisit if it’s worth putting more resources into it.

3. We send the image to an open-source optical character recognition (OCR) program called Tesseract. Tesseract does a decent but not great job. This is an area where we are way behind Google Translate: they do fantastic OCR really quickly, whereas we do decent OCR rather slowly. One of the problems is that the program hasn’t been “trained” on a lot of fonts that are common on menus: fancy italics and display fonts. We have tried training Tesseract on more fonts and seen very little improvement, we don’t know why. We also have problems with ampersands, bullets, and ellipsises (ellipses?)…you know, dot dot dot like these…

4. We use our own “error-tolerant phrase-matching” algorithm to do the translation. We take the OCR input and convert it to translated output, so “sp1cy chiokon over nce” gets converted to “spicy chicken over rice,” which is a phrase in our dictionary, so we replace it with its translation. I wrote the algorithm in Swift, which is a new language Apple created (and which I really like), and we thought we could drop that right into the app — that’s the main point of the language, after all — but we needed more control over when to release memory, so the Mystery Programmer translated my code into c, pretty much line-by-line. This is another area where we can improve, not so much on accuracy but on speed, and this is the most statistical of the tasks. I’ll describe it a bit here.

The first thing to understand is that we don’t try to solve the “real” translation problem, which is really hard: we don’t parse the sentence to try to virtually diagram it in order to figure out the subject, verb, which adjectives modify which nouns, etc., and then apply language grammar rules to translate everything. Instead, we just try to match phrases. This is why we only do menus: the restricted corpus means we have a good chance of matching multi-word phrases. We get the right order for “spicy chicken” when translated into French simply because “spicy chicken” is in our phrase dictionary (which contains about 38K phrases, including single words) so we simply replace it with “poulet épicé” or whatever.

There are two complications. One is that a phrase like “sp1cy chiokon over nce” can be broken up into shorter phrases in many ways: [sp1cy chiokon][over mce], [sp1cy chiokon over][mce], and [sp1cy][chiokon over mce] and so on. The other issue, of course, is that there are errors in the OCR so we need to know when to say two phrases match well enough and when they don’t. We have “chicken marsala” in our phrase dictionary. We have “moussaka” in our dictionary. Suppose the OCR output is “chicken mosaka.” We could match the OCR output to [chicken marsala], which is a two-word phrase that we do have in our dictionary; or [chicken][moussaka], which are two individual words that we have in the dictionary; or [chicken][mosaka], which has one word that that is in our dictionary and one that isn’t, in which case we would translate the “chicken” but pass “mosaka” through un-altered. Which of these should we choose?

Our current way of handling the multiple ways of partitioning the phrase is simply to try all of them, score them all using an approach described below, and take the one with the highest score. We try all possible ways of partitioning any phrase of 5 words or fewer. For longer phrases, we break it into the initial 5 words, plus whatever comes later. If the solution that we accept from the first five words ends with an orphan word or a two-word phrase at the end, then we stick that word/words onto the start of the next set of words. So for “Our special spicy chicken over rice” we would first do the five-word phrase “Our special spicy chicken over”, for which the best partition might be [Our special][spicy chicken][over], so we would go ahead and take the first two phrases and then be left with “over rice”, which it turns out we also have in the dictionary. This works really well but not perfectly.

Our scoring system takes four things into account:
(1) How many words are in the phrase?
(2) How many errors are there if we accept a given match to a given partition? If the OCR input is “chicken mosaka”, that matches [chicken marsala] with 3 errors. (Note that we have to be able to compare a 6-letter word with a 7-letter word, so we can’t just count whether the first characters agree, the second characters, etc. “Hannburger” and “hamburger” agree with 2 errors, not 8.
(3) How common is the phrase? “chicken marsala” gets a -3, on our logarithmic scale (base 10) of relative frequency, in which the most common phrases get a 0. We generated our phrase dictionary by screen-scraping 21,000 menus from OpenTable, a restaurant reservations website, and then had them all translated into French (and Italian, an Chinese).
(4) How “translatable” is the phrase (or the word)? “Hot” is not very translatable, because it could mean either “spicy” or “high-temperature.” Similarly, “with hot” is not very translatable. Other two-word phrases like “hot sauce” or “hot soup” or “hot dog” are a lot better. At the moment almost everything in our phrase dictionary is given a default “translatability” score: only about 600 out of our 37000 phrases have a different value.

Figuring out how to use the four sources of information seems like a great place where we should apply Bayes’s rule, but although I did start out thinking about it that way, and found it helpful, in the end I made up a scoring system that seemed to make sense, based on a linear combination of the four individual factors, and then tinkered with it by hand.

The translation procedure works well, actually…really quite well…but it’s a bit slow. It can take more than 30 seconds to translate a page of a menu on an iPhone 6, and noticeably longer on a 5. The main problem is that in order to cope with OCR problems, we have to be very tolerant of errors by accepting matches even if quite a few characters are wrong. We really want to match “ch1cken somdwoch” to “chicken sandwich.” At the moment, in order to do that we calculate a number of errors we are willing to tolerate for a string of a given length, and then search every string in the phrase dictionary until we either find that it matches well enough for consideration or until we have encountered that number of errors. (Well, actually we first check to see if we have the string with 0 errors. If not, we try with 1 error. And so on. This does mean doing a lot of redundant searches, but it is much better than searching for the string with five errors when in fact we have it with 0 or 1). Anyway, an obvious improvement would be to have a stopping rule so we reject a phrase if _either_ the total number of errors exceeds our threshold OR if we encounter a given number of errors within a moving window of a given length; so, even if we’re willing to accept 5 errors in a long string, if we hit three in a row we would reject the match…something like that.

Also, we aren’t thrilled with the way we present the results:
1. We do the translation and present the results line-by-line, not item-by-item. We should be able to combine several types of information in order to do a pretty good job at recognizing when one multi-line item starts and the next ends, but we haven’t put time into this. Amount of space above and below a given line, whether it starts with a capital letter or a CAPITALIZED WORD, whether the first few words are in bold or are larger than the following, and so on… we could come up with a way to combine all of this information to separate menu items.
2. To facilitate ordering, it’s important to show both the original text and the translation together: otherwise, if someone decides they want the “Vivaneau”, they have no way of knowing that they should tell the waiter they want the “Snapper.” But the facts that we can’t currently recognize when one item ends and the next begins, and that the optical character recognition can be fairly poor, mean that we don’t want to simply display the raw OCR output followed by its translation into French; instead, we show a horizontal slice of the photograph that contains a line of text, and then follow that with the translation of the text. But when we tried showing those two components on their own we had some severe legibility issues with some menus, unless people zoomed in on the results…which made it very inconvenient to read the text because they had to scroll around to read the translation. Ugh. So for now we show: A slice of the photo; the OCR output; AND the translation. This makes the result a bit busy, and is another obvious thing to improve.

Overall, we are very proud of the app, although we do not expect it to sell very well or to make us any money.

Le Menu Dit is available to try for free on the App Store. You can use it for three sessions without paying — a “session” counts as a launch of the app during which you do at least one translation, so if you launch it and use it to translate every page of a multi-page menu, for example, that counts as one session. The app rents for $0.99 per week, or you can buy unlimited uses (and why wouldn’t you?) for $2.99. (Actually the prices are set by Apple, we just choose the price tier. If you do not use the U.S. app store, the rental price will be Apple’s lowest non-free price, and the purchase price will be their third-lowest non-free price).

We include a few test menus so people can try the app if they don’t have access to a printed English-language menus. (These do not count against your free uses). We deliberately chose real-world, imperfect photos of menus, complete with glare and skew, to give a fair impression of the results people can expect.

Interestingly, you can’t really test the app (or use it) with a photo of a menu on a computer screen: the moire effect messes up the photo. Actually it is possible to play around with the angle of the phone to the screen, and the distance between phone and screen, in order to get it to work, but it’s a fairly frustrating experience.

Comments welcome, either here or via the feedback page.

This post is by Phil Price.

5 thoughts on “Le Menu Dit : a translation app

  1. Random thought: Can you use the GPS info to increase accuracy? e.g. If you can cross correlate the device GPS with something like the yelp street info for a hotel you could get a headstart on accessing a menu without OCR? At least to arbitrate the unsure words from OCR.

    Alternatively, as you grow in use, you can cross correlate multiple OCR results from the same geofenced location to boost accuracy of badly captured words?

  2. Perhaps you could leverage the other information on the page, and adjust your scoring based on other menu items. e.g. chicken mausaka would become chicken masala if ‘vindaloo’ appears elsewhere on the page, or chicken moussaka if ‘hummus’ appears on the page. Obviously you dont want to make it one huge optimization, but maybe classify all the sure bets first, then go back to resolve the uncertain ones.

    • Rahul’s idea and this one have a few things in common so I’m doing both comments here.

      To address Rahul’s issue first:
      One thing I didn’t mention in the writeup is that our entire app lives on the phone; it doesn’t require or use an internet connection (except to send us feedback, and of course to download or buy the app in the first place). Even now, lots of foreign travelers don’t have a data plan. So we really want to emphasize doing as well as we can without external data. That said, of course in principle we could use additional information, and yes, one source could be: find out where the phone is, find out if there’s a restaurant there, look up its website, see if it has a menu, and use that to inform the translation. If we had a lot more development time and money and if it were worth doing, we could even have an external server do all of that stuff (or, indeed, do all of the translating), which would speed things up and allow people to always be getting the best results without having to upgrade the app. If we went this route we wouldn’t try to have an extremely intelligent system, we would at least start by seeing if the restaurant is in OpenTable or one of the other restaurant aggregators, since they have a standardized method url that makes it easy to find the menu if there is one. Of course we wouldn’t (probably) just use the online menu and translate that, since a lot of those are not updated, but we could certainly read the menu and use that to adjust the probabilities in our phrase-matching. Which is how this connects to Nick’s idea.

      So…
      Yes, it’s a good idea to use information about the other items on the menu (or the items that we know are typically sold at the restaurant, if we’re using info as discussed in the paragraph above). This is something we have thought about and it would not be hard for us to implement a probably-good-enough version: instead of having a single “frequency” score for each word or phrase (see item (3) in the list of things we take into account), we could create ten to twenty genres of food and each phrase could have a different frequency score in each. Genres could be things like “Steakhouse”, “Seafood”, “Chinese”, and so on. We could do a first pass through a menu and do a first pass using a simplified algorithm (maybe just match 1- and 2-word phrases) and use that to pick a genre, and then use the same algorithm we use now, but with the frequencies appropriate to the genre.

      But we are probably not going to do this. It’s not a lot of work but it’s not trivial either; our current solution is pretty good so any improvement would necessarily be fairly small; and from a user perspective it would be much better to put our effort into a better display of results than into a small improvement in translation quality. If the app catches on (or, rather, if the apps catch on, since we will release a few more for other languages) then we might eventually get around to this.

      Thanks for the suggestions. They are good ideas.

  3. This looks like an interesting theoretical exercise, but I can see very little practical use for it. Someone whose native language is French who visits the US (I presume that these are American menus) will typically have a minimal understanding of English words anyway. But a bigger problem is that the food itself is culturally-specific. That is, you have to ask why someone would want the translation, and what they would hope to gain from it.

    Take, for example, “cinnamon stuffed french toast cinnamon raisin bread with mascarpone jam filling”. I’m presuming that this has been reformatted from something like:
    (Name of dish: Bold, 16pt type) Cinnamon stuffed French toast
    (Description: Normal, 12pt type) Cinnamon raisin bread with mascarpone jam filling

    Here’s the translation:
    cannelle pain perdu farci cannelle raisin pain avec mascarpone confiture remplissage
    That is just horrible. It’s basically gibberish to a French speaker. It’s the kind of thing that, when the equivalent appears on a menu in Paris or Beijing translated into English, we post images of it on Facebook for people to point and laugh at. To a French speaker, it looks like a text message, or something that someone who was bleeding to death might write in a last attempt to say who killed them. It might as well be written in ALL CAPS.

    But why is that? Aren’t we getting quite good at machine translation?

    Well, one issue is that English (particularly modern American English in a retail food service setting) has the interesting property that you hardly need any articles, participles, or prepositions to get your message across. Other languages simply don’t work like that. Sure, you can take all those parts of speech out of a French sentence, but you will cause major discomfort in the reader’s mind. (Also, “Remplissage” means “filling” in the sense of the procedure by which you fill a bottle of water. The reader might, or might not, work out that out. Again, this probably arises because the software has no context. At least it didn’t translate “filling” as “plombage”, which is the kind of filling you get at the dentist.)

    A reasonable translation might be:
    (Bold, 16pt type) Pain perdu à la cannelle, farci
    (Normal, 12pt type) Pain perdu de brioche à la cannelle et aux raisins secs, farci de fromage mascarpone et de confiture
    See those short (one-, two-, and three-letter) words? They are really important. Without them, a French speaker is almost as uneasy reading the sentence as if they had had to look up the words in a dictionary. (I’ll assume that the app works quicker than looking words up in a dictionary, although I do wonder about how big the overlap is between “French speakers who don’t know enough English vocabulary to order breakfast” and “French speakers who are happy cropping a photo of text into one column in a restaurant during the three minutes that typically elapse between sitting down and having someone standing there expecting you to have decided what to eat”). We have a long way to go before automatic translation systems can fill in the words that weren’t on the page because they weren’t necessary in the national, linguistic, and commercial context in which the original sentence was written.

    Oh, and I haven’t even started on what most French people would make of the idea of cinnamon French toast, period, never mind the version filled with cream cheese and jam. :-))

Leave a Reply

Your email address will not be published. Required fields are marked *