Simplified lookup - too simple?

Hi, Referring to the simplified lookup ( http://cal10n.qos.ch/manual.html#simplifiedLookup), I think this has been simplified too far. Firstly, I think your simplification by removing the default properties file is justified because it makes it easier to test the robustness of the application, which is a Good Thing. The decision may be controversial because it means that translations have to be done to completion or not at all; partially complete translations will fail. But it is consistent with Cal10n's stated objective of supporting verifiably self-consistent resource bundles. However, there is another issue that I am unhappy about. Java's Locale allows language, country and region to be specified. You have removed the region (or not documented it). Although the region is not often used, I think it is necessary and should be supported. Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

Hi Rick, Thank you for your comments. As you stated, the simplification is driven be verification purposes. Partially incomplete translations will fail only if the keys are missing, otherwise, verification for such translations will succeed. As for region support, it could be easily added. However, if someone comes with a concrete use-case, then the issue can be rectified. I just don't see anyone doing a translation by region. In Switzerland for example, we have 4 national languages, German, French, Italian and Romanch, which are spoken in different linguistic regions. Romanch is spoken by a very small minority. Bundles for each of the 3 major regions can be specified as ge_CH, fr_CH, it_CH (ignoring romanch). I am actually considering removing the parent-child relationship between language and language+country. It just does not make sense to do a different translation for en_UK and en_US. You would do one translation for English "en" serving all English speaking countries. Similarly, it does not make sense to do a French translation for France and another for Switzerland as the language is the same (French). Doing a Swiss version of a German web-site could make sense because Swiss-German is considerably different than the German spoken in Germany. Actually, the Swiss-German spoken in different regions of Switzerland have notable differences as well. But given that Swiss-German is *not* a written language, a web-site for Swiss-German customers would be written in German. As far as I can tell, for most practical purposes, it's the language that matters, not the country and (almost) certainly not the region. For translation purposes, "en_UK" and "en_US" can just map to "en". This does not mean that en_UK and en_US should be reduced to "en" because the respective currency or date format conventions are likely to be different. In CAI10 the emphasis is on translations. Translating applications to different languages is an expensive and cumbersome process. Hopefully, CAL10N can alleviate some of the pain with a more streamline process. My next goal is to eliminate native2ascii converter from the translation process. Rick Beton wrote:
Hi,
Referring to the simplified lookup (http://cal10n.qos.ch/manual.html#simplifiedLookup), I think this has been simplified too far.
Firstly, I think your simplification by removing the default properties file is justified because it makes it easier to test the robustness of the application, which is a Good Thing. The decision may be controversial because it means that translations have to be done to completion or not at all; partially complete translations will fail. But it is consistent with Cal10n's stated objective of supporting verifiably self-consistent resource bundles.
However, there is another issue that I am unhappy about. Java's Locale allows language, country and region to be specified. You have removed the region (or not documented it). Although the region is not often used, I think it is necessary and should be supported.
Rick
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch

2009/9/3 Ceki Gulcu <ceki@qos.ch>
<snipped> I am actually considering removing the parent-child relationship between language and language+country. It just does not make sense to do a different translation for en_UK and en_US. You would do one translation for English "en" serving all English speaking countries. Similarly, it does not make sense to do a French translation for France and another for Switzerland as the language is the same (French). Doing a Swiss version of a German web-site could make sense because Swiss-German is considerably different than the German spoken in Germany. Actually, the Swiss-German spoken in different regions of Switzerland have notable differences as well. But given that Swiss-German is *not* a written language, a web-site for Swiss-German customers would be written in German. As far as I can tell, for most practical purposes, it's the language that matters, not the country and (almost) certainly not the region.
For translation purposes, "en_UK" and "en_US" can just map to "en". This does not mean that en_UK and en_US should be reduced to "en" because the respective currency or date format conventions are likely to be different. In CAI10 the emphasis is on translations.
... Translating applications to different languages is an expensive and
cumbersome process.
I understand and agree with your emphasis on translation, and the difference between translation and other localisation issues. But I don't agree that this leads us away from Java's Locale as the basis for Cal10n. I like things to be kept simple - however, on the principle of least surprise, maintaining the existing well-known model keeps Cal10n more approachable to potential users. In the case of English, UK and US share almost all locale settings except for date format, but the grammar and spellings are a little different so the translation is a little different. It is clear that organisations *do* feel they need to support finer-grained translation than just "English" or "French". For example, Mozilla and Open Office are both available with en-UK language packs as well as en-US. Cal10n *must not* cut off such usage, IMHO. Therefore I recommend you keep the parent-child relationship between language and language+country. As I said before, I think it would be wise to retain the region level as well, although I have a less-strong view on that (I can't remember ever seeing a serious use of it).
Hopefully, CAL10N can alleviate some of the pain with a more streamline process. My next goal is to eliminate native2ascii converter from the translation process.
Wahey!!! One UTF-8 to rule them all! (Or maybe UTF-8 or UTF-16 using the Unicode BOM to indicate which, with UTF-8 being the default.) Native2ascii was one of Sun's early short-sighted mistakes. I hope you can appreciate my well-intended frank views. :) Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

On Sep 3, 2009, at 6:24 AM, Rick Beton wrote:
2009/9/3 Ceki Gulcu <ceki@qos.ch> <snipped> I am actually considering removing the parent-child relationship between language and language+country. It just does not make sense to do a different translation for en_UK and en_US. You would do one translation for English "en" serving all English speaking countries. Similarly, it does not make sense to do a French translation for France and another for Switzerland as the language is the same (French). Doing a Swiss version of a German web-site could make sense because Swiss-German is considerably different than the German spoken in Germany. Actually, the Swiss-German spoken in different regions of Switzerland have notable differences as well. But given that Swiss-German is *not* a written language, a web-site for Swiss-German customers would be written in German. As far as I can tell, for most practical purposes, it's the language that matters, not the country and (almost) certainly not the region.
For translation purposes, "en_UK" and "en_US" can just map to "en". This does not mean that en_UK and en_US should be reduced to "en" because the respective currency or date format conventions are likely to be different. In CAI10 the emphasis is on translations. ... Translating applications to different languages is an expensive and cumbersome process.
I understand and agree with your emphasis on translation, and the difference between translation and other localisation issues. But I don't agree that this leads us away from Java's Locale as the basis for Cal10n. I like things to be kept simple - however, on the principle of least surprise, maintaining the existing well-known model keeps Cal10n more approachable to potential users. In the case of English, UK and US share almost all locale settings except for date format, but the grammar and spellings are a little different so the translation is a little different.
It is clear that organisations do feel they need to support finer- grained translation than just "English" or "French". For example, Mozilla and Open Office are both available with en-UK language packs as well as en-US. Cal10n must not cut off such usage, IMHO.
LOL. You must be in the UK! (See my prior message on spelling differences).
Therefore I recommend you keep the parent-child relationship between language and language+country. As I said before, I think it would be wise to retain the region level as well, although I have a less- strong view on that (I can't remember ever seeing a serious use of it).
Hopefully, CAL10N can alleviate some of the pain with a more streamline process. My next goal is to eliminate native2ascii converter from the translation process.
Wahey!!! One UTF-8 to rule them all! (Or maybe UTF-8 or UTF-16 using the Unicode BOM to indicate which, with UTF-8 being the default.) Native2ascii was one of Sun's early short-sighted mistakes.
I hope you can appreciate my well-intended frank views. :) Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ _______________________________________________ cal10n-dev mailing list cal10n-dev@qos.ch http://qos.ch/mailman/listinfo/cal10n-dev

Rick Beton wrote:
It is clear that organisations /do/ feel they need to support finer-grained translation than just "English" or "French". For example, Mozilla and Open Office are both available with en-UK language packs as well as en-US. Cal10n /must not/ cut off such usage, IMHO.
Point well taken.
Therefore I recommend you keep the parent-child relationship between language and language+country. As I said before, I think it would be wise to retain the region level as well, although I have a less-strong view on that (I can't remember ever seeing a serious use of it).
OK.
Hopefully, CAL10N can alleviate some of the pain with a more streamline process. My next goal is to eliminate native2ascii converter from the translation process.
Wahey!!! One UTF-8 to rule them all! (Or maybe UTF-8 or UTF-16 using the Unicode BOM to indicate which, with UTF-8 being the default.) Native2ascii was one of Sun's early short-sighted mistakes.
The user will able to specify the encoding used for the property file for a given locale. This means that you will be able to encode the property file for for the "en_US" locale in US-ASCII charset, the file for the "fr" locale in ISO-8859-1 charset, the file for "gr" (Greek) in ISO_8859-7 charset, the file for "hb" (Hebrew) in ISO_8859-8, the "jp" file in whatever is appropriate in Japanese. The @LocaleNames annotation had to be expande to something more verbose to cope with this need. So, you'd say: @BaseName("colors") @LocaleData({ @Locale("en_UK"), @Locale("fr", "ISO_8859-1") @Locale("gr", "ISO_8859-7") }) public enum Colors { RED, BLUE, GREEN; } You could also opt for "UTF-8" for everything...
I hope you can appreciate my well-intended frank views. :)
Indeed I do. Thank you. -- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch

On Sep 3, 2009, at 4:33 AM, Ceki Gulcu wrote:
Hi Rick,
Thank you for your comments.
As you stated, the simplification is driven be verification purposes. Partially incomplete translations will fail only if the keys are missing, otherwise, verification for such translations will succeed.
As for region support, it could be easily added. However, if someone comes with a concrete use-case, then the issue can be rectified. I just don't see anyone doing a translation by region. In Switzerland for example, we have 4 national languages, German, French, Italian and Romanch, which are spoken in different linguistic regions. Romanch is spoken by a very small minority. Bundles for each of the 3 major regions can be specified as ge_CH, fr_CH, it_CH (ignoring romanch).
Nonsense. English in Lousiana is very different than in California.
I am actually considering removing the parent-child relationship between language and language+country. It just does not make sense to do a different translation for en_UK and en_US.
That is even more nonsensical. US english vs UK english vs Austriallian english are not even close. The same is true for spanish in Spain vs Mexico. The folks who thought up locales put this stuff in for good reason. You're free to not support it but then this project won't have much of a following.
You would do one translation for English "en" serving all English speaking countries. Similarly, it does not make sense to do a French translation for France and another for Switzerland as the language is the same (French). Doing a Swiss version of a German web-site could make sense because Swiss-German is considerably different than the German spoken in Germany. Actually, the Swiss-German spoken in different regions of Switzerland have notable differences as well. But given that Swiss-German is *not* a written language, a web-site for Swiss-German customers would be written in German. As far as I can tell, for most practical purposes, it's the language that matters, not the country and (almost) certainly not the region.
For translation purposes, "en_UK" and "en_US" can just map to "en". This does not mean that en_UK and en_US should be reduced to "en" because the respective currency or date format conventions are likely to be different. In CAI10 the emphasis is on translations.
There are lots of words that are spelled differently. For example, it is Organization in the U.S and Organisation in the UK. (See http://en.wikipedia.org/wiki/American_and_British_English_spelling_differenc... to be thoroughly confused).
Translating applications to different languages is an expensive and cumbersome process. Hopefully, CAL10N can alleviate some of the pain with a more streamline process. My next goal is to eliminate native2ascii converter from the translation process.
This is why you should be using XML instead of property files. XML allows the encoding to be specified in the document. Ironically, java property files can be specified in XML but I don't believe resource bundles support them.
Rick Beton wrote:
Hi, Referring to the simplified lookup (http://cal10n.qos.ch/manual.html#simplifiedLookup ), I think this has been simplified too far. Firstly, I think your simplification by removing the default properties file is justified because it makes it easier to test the robustness of the application, which is a Good Thing. The decision may be controversial because it means that translations have to be done to completion or not at all; partially complete translations will fail. But it is consistent with Cal10n's stated objective of supporting verifiably self-consistent resource bundles. However, there is another issue that I am unhappy about. Java's Locale allows language, country and region to be specified. You have removed the region (or not documented it). Although the region is not often used, I think it is necessary and should be supported. Rick
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch _______________________________________________ cal10n-dev mailing list cal10n-dev@qos.ch http://qos.ch/mailman/listinfo/cal10n-dev

On Sep 3, 2009, at 3:57 AM, Rick Beton wrote:
Hi,
Referring to the simplified lookup (http://cal10n.qos.ch/manual.html#simplifiedLookup ), I think this has been simplified too far.
Firstly, I think your simplification by removing the default properties file is justified because it makes it easier to test the robustness of the application, which is a Good Thing. The decision may be controversial because it means that translations have to be done to completion or not at all; partially complete translations will fail. But it is consistent with Cal10n's stated objective of supporting verifiably self-consistent resource bundles.
However, there is another issue that I am unhappy about. Java's Locale allows language, country and region to be specified. You have removed the region (or not documented it). Although the region is not often used, I think it is necessary and should be supported.
Actually, a locale supports language, country, region and variant. If an I18n component doesn't support all of those it isn't a viable solution. Ralph

2009/9/3 Ralph Goers <ralph.goers@dslextreme.com>
Actually, a locale supports language, country, region and variant. If an I18n component doesn't support all of those it isn't a viable solution.
Yes, I made a mistake - see http://java.sun.com/javase/6/docs/api/java/util/Locale.html It's actually language, country, variant - there is no region component. The documentation for Locale cites target machine type as a possible use-case for the variant part. Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

Rick Beton wrote:
Yes, I made a mistake - see http://java.sun.com/javase/6/docs/api/java/util/Locale.html It's actually language, country, variant - there is no region component. The documentation for Locale cites target machine type as a possible use-case for the variant part.
Let's reconsider this issue when it presents itself in the field, OK?
Rick
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch

On Sep 3, 2009, at 6:37 AM, Rick Beton wrote:
2009/9/3 Ralph Goers <ralph.goers@dslextreme.com>
Actually, a locale supports language, country, region and variant. If an I18n component doesn't support all of those it isn't a viable solution.
Yes, I made a mistake - see http://java.sun.com/javase/6/docs/api/java/util/Locale.html It's actually language, country, variant - there is no region component. The documentation for Locale cites target machine type as a possible use-case for the variant part.
Thanks, I should have realized (another difference in spelling) that you meant variant when you said region. Ralph

Hello all, CAL10N just added the functionality to load resource bundles in the encoding chosen by the user. Say bye bye to native2ascii. In order to specify the encoding for the resource bundle corresponding to a given locale, the @LocaleNames annotation had to be changed. Previously, one wrote: @BaseName("colors") @LocaleNames({"en_UK", "fr", "tr_TR", "el_GR"d }) public enum Colors { BLUE, RED, YELLOW; } To express the same information, now (in 0.7-SNAPSHOT) one has writte: @BaseName("colors") @LocaleData({ @Locale("en_UK"), @Locale("fr_FR") @Locale("tr_TR") @Locale("el_GR") }) public enum Colors { BLUE, RED, GREEN; } As you can see, the @LocaleData annotation includes multiple @Locale annotations. This is more verbose than what we had previously but allows us to write: @BaseName("colors") @LocaleData( defaultCharset="UTF8", value = { @Locale("en_UK"), @Locale("fr_FR"), @Locale( value="tr_TR", charset = "ISO8859_3"), @Locale( value="el_GR", charset = "ISO8859_7") } ) public enum Colors { BLUE, RED, GREEN; } It would have been preferable to write @BaseName("colors") @Locale("en_UK") @Locale( value="fr_FR", charset = "ISO8859_1") // compiler error @Locale( value="tr_TR", charset = "ISO8859_3") // compiler error @Locale( value="el_GR", charset = "ISO8859_7") // compiler error public enum Fruit { APPLE, ORANGE; } but the compiler forbids multiple instances of the same annotation. Can you think of a more elegant approach which still allows the user to designate the charset for a given locale? Cheers, -- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch

2009/9/4 Ceki Gulcu <ceki@qos.ch>
<snipped>
Can you think of a more elegant approach which still allows the user to designate the charset for a given locale?<http://qos.ch/mailman/listinfo/cal10n-dev>
Hmm - it has become rather messy. Can I suggest an alternative tack: require the properties files to be always UTF-8 (or UTF-16 with a BOM). Then the original simple syntax is viable. KISS principle. As far as producing UTF-8 files is concerned, I imagine a spreadsheet 'compiler' that will take a CSV, ODS or XLS file and extract the necessary separate properties files bundle. The spreadsheet would be a simple single sheet containing the keys in the first column and any number of translation strings in the following columns, each of which has the locale name as its header (e.g. "en_UK "). Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

If the bundles are produced in different countries, it may be more convenient to allow the producers to use the charset most convenient to them. If however, a single charset could be imposed, say UTF-8, we could simplify the expression to: @LocaleNames(charset="UTF-8", values={"en_UK", "fr", "tr_TR", "el_GR"}) instead of @LocaleData( defaultCharset="UTF8", value = { @Locale("en_UK"), @Locale("fr_FR"), @Locale(value="tr_TR", charset="ISO8859_3"), @Locale("el_GR") } ) However, between an inelegant syntax targeted at developers and imposing to a charset to translators, well, the former seems as the lesser weevil. If I understand you correctly, the spreadsheet would be used to transform a resource bundle from one encoding to another? It would read in one encoding and write in another. Right? Rick Beton wrote:
2009/9/4 Ceki Gulcu <ceki@qos.ch <mailto:ceki@qos.ch>>
<snipped>
Can you think of a more elegant approach which still allows the user to designate the charset for a given locale? <http://qos.ch/mailman/listinfo/cal10n-dev>
Hmm - it has become rather messy. Can I suggest an alternative tack: require the properties files to be always UTF-8 (or UTF-16 with a BOM). Then the original simple syntax is viable. KISS principle.
As far as producing UTF-8 files is concerned, I imagine a spreadsheet 'compiler' that will take a CSV, ODS or XLS file and extract the necessary separate properties files bundle. The spreadsheet would be a simple single sheet containing the keys in the first column and any number of translation strings in the following columns, each of which has the locale name as its header (e.g. "en_UK ").
Rick
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.che

2009/9/4 Ceki Gulcu <ceki@qos.ch>
<snipped> If I understand you correctly, the spreadsheet would be used to transform a resource bundle from one encoding to another? It would read in one encoding and write in another. Right?
No, sorry I didn't explain it more clearly. The spreadsheet would be 'the' source file for all the languages in a bundle. Each language has its own column. The first column lists the keys so that reading across each row, the key is followed by one or more language strings, each in a different language and each being a translation of the others. Using the spreadsheet, the hypothetical compiler would simply extract the keys and in turn each language column would produce a corresponding properties file. The properties file encoding can be whatever you want (so choose UTF-8 or stick with nasty old 'native2ascii' ASCII files). Or output to XML. IIUC, spreadsheets allow the complexities of character encoding to be hidden from the user. In my limited experience of working with translation services, this is an approach they like. Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

On Sep 4, 2009, at 3:35 AM, Ceki Gulcu wrote:
As you can see, the @LocaleData annotation includes multiple @Locale annotations. This is more verbose than what we had previously but allows us to write:
@BaseName("colors") @LocaleData( defaultCharset="UTF8", value = { @Locale("en_UK"), @Locale("fr_FR"), @Locale( value="tr_TR", charset = "ISO8859_3"), @Locale( value="el_GR", charset = "ISO8859_7") } ) public enum Colors { BLUE, RED, GREEN; }
It would have been preferable to write
@BaseName("colors") @Locale("en_UK") @Locale( value="fr_FR", charset = "ISO8859_1") // compiler error @Locale( value="tr_TR", charset = "ISO8859_3") // compiler error @Locale( value="el_GR", charset = "ISO8859_7") // compiler error public enum Fruit { APPLE, ORANGE; }
This is fragile as it will break if someone uses a different charset than what is now hardcoded in the software. This information should be obtained from the file itself.
but the compiler forbids multiple instances of the same annotation.
Can you think of a more elegant approach which still allows the user to designate the charset for a given locale?
Yes. At the risk of repeating myself, build your framework using XML files and let the XML parser use the encoding specified in the file. Property files were not designed for this. If you insist on using ResourceBundles then require Java 6 and use XML prooperty files and ResourceBundle.Control as described at http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.Control.html in Example 2. Ralph

Ralph Goers wrote:
This is fragile as it will break if someone uses a different charset than what is now hardcoded in the software. This information should be obtained from the file itself.
True. However, once the encoding for a given translation is established, it won't change frequently. So it should be more or less acceptable to hard code it in the enum. This does not mean that your suggestion to place the encoding information in the file is not a good one. It's not so easy to accomplish however. To correctly interpret the encoding information contained in the file, you have to open it with the correct charset... the chicken or the egg?
but the compiler forbids multiple instances of the same annotation.
Can you think of a more elegant approach which still allows the user to designate the charset for a given locale?
Yes. At the risk of repeating myself, build your framework using XML files and let the XML parser use the encoding specified in the file. Property files were not designed for this. If you insist on using ResourceBundles then require Java 6 and use XML prooperty files and ResourceBundle.Control as described at http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.Control.html in Example 2.
XML files are necessarily more verbose than property files. Also note that a large number of shops do their localization using property files. I will now go read ResourceBundle.Control.
Ralph
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch

2009/9/4 Ceki Gulcu <ceki@qos.ch>
<snipped> I will now go read ResourceBundle.Control.
I noticed that ResourceBundle.Control is only in JDK1.6 - are you planning for Cal10n to support JDK1.5? Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ

Hello Rick, At this time, CAL10N targets JDK 1.5. Ralph's comment about ResourceBundle.Control is valuable and could potentially justify targeting JDK 1.6 in future versions of CAL10N. Cheers, Rick Beton wrote:
2009/9/4 Ceki Gulcu <ceki@qos.ch <mailto:ceki@qos.ch>>
<snipped> I will now go read ResourceBundle.Control.
I noticed that ResourceBundle.Control is only in JDK1.6 - are you planning for Cal10n to support JDK1.5?
Rick -- Big Bee Consultants Limited : Registered in England & Wales No. 6397941 Registered Office: 71 The Hundred, Romsey, Hampshire, SO51 8BZ
------------------------------------------------------------------------
_______________________________________________ cal10n-dev mailing list cal10n-dev@qos.ch http://qos.ch/mailman/listinfo/cal10n-dev
-- Ceki Gülcü Logback: The reliable, generic, fast and flexible logging framework for Java. http://logback.qos.ch
participants (4)
-
Ceki Gulcu
-
Ralph Goers
-
Rick Beton
-
Rick Beton