questions about "Language" info packets

nut.txt says: | "Language" | ISO 639 and ISO 3166 for language/country code Does "ISO 639" mean ISO 639-1 or ISO 639-2? Are both codes required or allowed? If yes, in what format? | something like "eng" (US English) When using a three-letter code from ISO 639-2, should a nut writer use the bibliographic or the terminology code? Are two-letter codes allowed at all? | can be 0 if unknown Does this mean that there is no Language entry, or that it is an emtpy string, or that it is a string containing a zero byte, or that the string is "0"? | and "multi" if several languages ISO 639-2 already has "mul" for multiple languages. Does this mean that both "mul" and "multi" are allowed? Regards, Clemens

Hi On Tue, Feb 13, 2007 at 12:38:35PM +0100, Clemens Ladisch wrote:
nut.txt says: | "Language" | ISO 639 and ISO 3166 for language/country code
Does "ISO 639" mean ISO 639-1 or ISO 639-2? Are both codes required or allowed? If yes, in what format?
that is a very good question, as the example below is a ISO 639-2 code i think its clear that ISO 639-2 is allowed furthermore there is a link http://www.loc.gov/standards/iso639-2/englangn.html pointng to 639-2 but none to 639-1 so id say 639-1 is not allowed also all 639-1 codes have a code in 639-2 while many 639-2 codes do not have one in 639-1 comments are of course welcome ...
| something like "eng" (US English)
When using a three-letter code from ISO 639-2, should a nut writer use the bibliographic or the terminology code?
that is also a very good question, i think none of us was aware that there are 2 different codes for some languages (that is one based on the native word for the language and one based on the english word) but luckily the majority of the languages has just 1 code
Are two-letter codes allowed at all?
id say no
| can be 0 if unknown
Does this mean that there is no Language entry, or that it is an emtpy string, or that it is a string containing a zero byte, or that the string is "0"?
hmm ISO 639-2 contains a "und" for undetermined and nothing in our spec forbids its use so iam tempted to say that "und" must/should be used if unknown and applications must treat a empty string like "und"
| and "multi" if several languages
ISO 639-2 already has "mul" for multiple languages. Does this mean that both "mul" and "multi" are allowed?
id handle this like above: "mul" must/should be used if multiple languages but demuxers must treat "multi" like "mul" oppinions?, comments? [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The greatest way to live with honor in this world is to be what we pretend to be. -- Socrates

Michael Niedermayer wrote:
opinions?, comments?
Fine by me. Who is going to update the spec with such clarification? Maybe is better to produce an addendum/errata lu -- Luca Barbato Gentoo/linux Gentoo/PPC http://dev.gentoo.org/~lu_zero

Hi On Tue, Feb 13, 2007 at 03:38:28PM +0100, Luca Barbato wrote:
Michael Niedermayer wrote:
opinions?, comments?
Fine by me.
Who is going to update the spec with such clarification?
well i guess i will after we agree on what to do exactly
Maybe is better to produce an addendum/errata
maybe adding a history chapter to the spec would do which contains such non trivial changes? [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB it is not once nor twice but times without number that the same ideas make their appearance in the world. -- Aristotle

Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 12:38:35PM +0100, Clemens Ladisch wrote:
nut.txt says: | "Language" | ISO 639 and ISO 3166 for language/country code
Does "ISO 639" mean ISO 639-1 or ISO 639-2? Are both codes required or allowed? If yes, in what format?
that is a very good question, as the example below is a ISO 639-2 code i think its clear that ISO 639-2 is allowed
furthermore there is a link http://www.loc.gov/standards/iso639-2/englangn.html pointng to 639-2 but none to 639-1 so id say 639-1 is not allowed also all 639-1 codes have a code in 639-2 while many 639-2 codes do not have one in 639-1 comments are of course welcome ...
| something like "eng" (US English)
When using a three-letter code from ISO 639-2, should a nut writer use the bibliographic or the terminology code?
that is also a very good question, i think none of us was aware that there are 2 different codes for some languages (that is one based on the native word for the language and one based on the english word) but luckily the majority of the languages has just 1 code
And we Germans are out of luck and cannot use nut? ;-) If the language code were just used as a code, it wouldn't matter which one is to be used, but there are certain players that just display the raw code instead of converting it to a language name, so I think it makes sense to let the encoder choose which one to use.
Are two-letter codes allowed at all?
id say no
So ISO 3166 is out, too?
| can be 0 if unknown
Does this mean that there is no Language entry, or that it is an emtpy string, or that it is a string containing a zero byte, or that the string is "0"?
hmm ISO 639-2 contains a "und" for undetermined and nothing in our spec forbids its use so iam tempted to say that "und" must/should be used if unknown and applications must treat a empty string like "und"
| and "multi" if several languages
ISO 639-2 already has "mul" for multiple languages. Does this mean that both "mul" and "multi" are allowed?
id handle this like above: "mul" must/should be used if multiple languages but demuxers must treat "multi" like "mul"
OK. Proposed new description: "Language" An ISO 639-2 (three-letter) language code, e.g. "eng" for English (see <http://www.loc.gov/standards/iso639-2/php/code_list.php>). All codes defined in ISO 639-2 are allowed, including "und" (Undetermined), "mul" (Multiple languages) and the bibliographic/ terminology variants. For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und". Regards, Clemens

On Tue, Feb 13, 2007 at 04:26:11PM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 12:38:35PM +0100, Clemens Ladisch wrote:
nut.txt says: | "Language" | ISO 639 and ISO 3166 for language/country code
Does "ISO 639" mean ISO 639-1 or ISO 639-2? Are both codes required or allowed? If yes, in what format?
that is a very good question, as the example below is a ISO 639-2 code i think its clear that ISO 639-2 is allowed
furthermore there is a link http://www.loc.gov/standards/iso639-2/englangn.html pointng to 639-2 but none to 639-1 so id say 639-1 is not allowed also all 639-1 codes have a code in 639-2 while many 639-2 codes do not have one in 639-1 comments are of course welcome ...
| something like "eng" (US English)
When using a three-letter code from ISO 639-2, should a nut writer use the bibliographic or the terminology code?
that is also a very good question, i think none of us was aware that there are 2 different codes for some languages (that is one based on the native word for the language and one based on the english word) but luckily the majority of the languages has just 1 code
And we Germans are out of luck and cannot use nut? ;-)
Huh??
If the language code were just used as a code, it wouldn't matter which one is to be used, but there are certain players that just display the raw code instead of converting it to a language name, so I think it makes sense to let the encoder choose which one to use.
If this isn't acceptable to the user then the user should choose a player with more "user friendly" display. Existing legacy devices won't play nut files anyway so it's something of a non-issue.
Are two-letter codes allowed at all?
id say no
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
OK. Proposed new description:
"Language" An ISO 639-2 (three-letter) language code, e.g. "eng" for English (see <http://www.loc.gov/standards/iso639-2/php/code_list.php>). All codes defined in ISO 639-2 are allowed, including "und" (Undetermined), "mul" (Multiple languages) and the bibliographic/ terminology variants. For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und".
Historical reasons?? There are no such files, and this is a draft (albeit frozen) spec. I don't see any way that translating "multi" to "mul" and "" to "und" would improve functionality over just treating them as an unexpected value. If there's cruft in the spec that can be removed without really hurting anything, I'd like to remove it. Rich

Hi On Tue, Feb 13, 2007 at 03:36:34PM -0500, Rich Felker wrote: [...]
Are two-letter codes allowed at all?
id say no
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
ISO 3166 is about country codes and about half of the codes from the 2 letter codespace seem to be used or reserved in some way ...
OK. Proposed new description:
"Language" An ISO 639-2 (three-letter) language code, e.g. "eng" for English (see <http://www.loc.gov/standards/iso639-2/php/code_list.php>). All codes defined in ISO 639-2 are allowed, including "und" (Undetermined), "mul" (Multiple languages) and the bibliographic/ terminology variants. For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und".
Historical reasons?? There are no such files, and this is a draft (albeit frozen) spec. I don't see any way that translating "multi" to "mul" and "" to "und" would improve functionality over just treating them as an unexpected value. If there's cruft in the spec that can be removed without really hurting anything, I'd like to remove it.
well then lets add a "a muxer MUST ignore unknown language and country codes instead of treating them as an error" [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws. -- Plato

On Tue, Feb 13, 2007 at 10:44:57PM +0100, Michael Niedermayer wrote:
OK. Proposed new description:
"Language" An ISO 639-2 (three-letter) language code, e.g. "eng" for English (see <http://www.loc.gov/standards/iso639-2/php/code_list.php>). All codes defined in ISO 639-2 are allowed, including "und" (Undetermined), "mul" (Multiple languages) and the bibliographic/ terminology variants. For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und".
Historical reasons?? There are no such files, and this is a draft (albeit frozen) spec. I don't see any way that translating "multi" to "mul" and "" to "und" would improve functionality over just treating them as an unexpected value. If there's cruft in the spec that can be removed without really hurting anything, I'd like to remove it.
well then lets add a "a muxer MUST ignore unknown language and country codes instead of treating them as an error"
Certainly. It's almost essential from a practical standpoint anyway, since (I suppose... am I wrong?) language codes could be added to 639-2 after your implementation was released, making your implementation suddenly become non-compliant if you rejected them. Anyway from a usability standpoint, I think the important feature is that a piece of software, when searching for a given (known) language, is able to find such a stream if one exists. This doesn't require any semantic interpretation of the codes, just an agreement on which codes will be used. Rich

Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 03:36:34PM -0500, Rich Felker wrote: [...]
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
ISO 3166 is about country codes and about half of the codes from the 2 letter codespace seem to be used or reserved in some way ...
But should three-letter country codes be allowed? In that case, how should the entire language string be formatted? Something like "lll-ccc" where both "lll" and "-ccc" are optional?
For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und".
Historical reasons?? There are no such files, and this is a draft (albeit frozen) spec.
Well, I interpreted "frozen" to mean that no incompatible changes could be made at all ...
I don't see any way that translating "multi" to "mul" and "" to "und" would improve functionality over just treating them as an unexpected value. If there's cruft in the spec that can be removed without really hurting anything, I'd like to remove it.
well then lets add a "a muxer MUST ignore unknown language and country codes instead of treating them as an error"
Agreed. Regards, Clemens

Hi On Wed, Feb 14, 2007 at 09:32:48AM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 03:36:34PM -0500, Rich Felker wrote: [...]
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
ISO 3166 is about country codes and about half of the codes from the 2 letter codespace seem to be used or reserved in some way ...
But should three-letter country codes be allowed?
i dont know, but i would prefer if either 2 or 3 letter codes would be used but not both ...
In that case, how should the entire language string be formatted? Something like "lll-ccc" where both "lll" and "-ccc" are optional?
that is a good question, i also thought of lll-ccc when i was working on that part of the spec seems it was never explicitly writen in the spec though :( "-ccc" though seems invalid to me it rather should be "unk-ccc" or "mul-ccc"
For historical reasons, demuxers MUST treat "multi" like "mul" and "" (the empty string) like "und".
Historical reasons?? There are no such files, and this is a draft (albeit frozen) spec.
Well, I interpreted "frozen" to mean that no incompatible changes could be made at all ...
well, "" will be interpreted as unknown anyway based on the ignore unknown strings, "multi" would be incorrectly interpreted as unknown but there is AFAIK no muxer which generates such files !if anyone knows of a muxer which does please say so! and the other way around (old demxuer new muxer) is no problem as mul/unk are part of the language codes so they theoretically must be supported i certainly dont like changing a frozen spec but in this case it really seems like the simplest, as there shouldnt be any adverse effects, we should of course add a note in the spec about "multi"/"" but not make their support mandatory IMHO also note if anyone thinks this is unproffessional, ISO completely changed the motion compensation specification in MPEG4 after the spec was a full international standard IIRC ... they changed it to what their reference sw did becasue many seem to have based their codecs on the reference sw instead of the written spec ...
I don't see any way that translating "multi" to "mul" and "" to "und" would improve functionality over just treating them as an unexpected value. If there's cruft in the spec that can be removed without really hurting anything, I'd like to remove it.
well then lets add a "a muxer MUST ignore unknown language and country codes instead of treating them as an error"
Agreed.
added [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB It is dangerous to be right in matters on which the established authorities are wrong. -- Voltaire

Michael Niedermayer wrote:
On Wed, Feb 14, 2007 at 09:32:48AM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 03:36:34PM -0500, Rich Felker wrote: [...]
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
ISO 3166 is about country codes and about half of the codes from the 2 letter codespace seem to be used or reserved in some way ...
But should three-letter country codes be allowed?
i dont know, but i would prefer if either 2 or 3 letter codes would be used but not both ...
Each 3-letter code has a corresponding 2-letter code, and I just noticed that ISO doesn't publish the 3-letter codes for free, so I think country codes should use the 2-letter form.
In that case, how should the entire language string be formatted? Something like "lll-ccc" where both "lll" and "-ccc" are optional?
that is a good question, i also thought of lll-ccc when i was working on that part of the spec seems it was never explicitly writen in the spec though :( "-ccc" though seems invalid to me it rather should be "unk-ccc" or "mul-ccc"
New proposal: "Language" An ISO 639-2 (three-letter) language code, optionally followed by an ISO 3166-1 two-letter country code that is separated from the language code by a hyphen. All codes defined in ISO 639-2 are allowed, including "und" (Undetermined), "mul" (Multiple languages) and the bibliographic/terminology variants. see http://www.loc.gov/standards/iso639-2/php/code_list.php and http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list... a demuxer MUST ignore unknown language and country codes instead of treating them as an error Regards, Clemens

Hi On Wed, Feb 14, 2007 at 04:27:55PM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Wed, Feb 14, 2007 at 09:32:48AM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 03:36:34PM -0500, Rich Felker wrote: [...]
So ISO 3166 is out, too?
I'm against 2-letter codes. The number of languages is way too large for these codes to be remotely sufficient.
ISO 3166 is about country codes and about half of the codes from the 2 letter codespace seem to be used or reserved in some way ...
But should three-letter country codes be allowed?
i dont know, but i would prefer if either 2 or 3 letter codes would be used but not both ...
Each 3-letter code has a corresponding 2-letter code, and I just noticed that ISO doesn't publish the 3-letter codes for free, so I think country codes should use the 2-letter form.
iam not arguing against that but http://www.davros.org/misc/iso3166.html amongth others lists the 3 letter codes [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The worst form of inequality is to try to make unequal things equal. -- Aristotle

Hi On Tue, Feb 13, 2007 at 04:26:11PM +0100, Clemens Ladisch wrote:
Michael Niedermayer wrote:
On Tue, Feb 13, 2007 at 12:38:35PM +0100, Clemens Ladisch wrote:
nut.txt says: | "Language" | ISO 639 and ISO 3166 for language/country code
Does "ISO 639" mean ISO 639-1 or ISO 639-2? Are both codes required or allowed? If yes, in what format?
that is a very good question, as the example below is a ISO 639-2 code i think its clear that ISO 639-2 is allowed
furthermore there is a link http://www.loc.gov/standards/iso639-2/englangn.html pointng to 639-2 but none to 639-1 so id say 639-1 is not allowed also all 639-1 codes have a code in 639-2 while many 639-2 codes do not have one in 639-1 comments are of course welcome ...
| something like "eng" (US English)
When using a three-letter code from ISO 639-2, should a nut writer use the bibliographic or the terminology code?
that is also a very good question, i think none of us was aware that there are 2 different codes for some languages (that is one based on the native word for the language and one based on the english word) but luckily the majority of the languages has just 1 code
And we Germans are out of luck and cannot use nut? ;-)
If the language code were just used as a code, it wouldn't matter which one is to be used, but there are certain players that just display the raw code instead of converting it to a language name, so I think it makes sense to let the encoder choose which one to use.
hmm i understand both "deu" and "ger" equally good/bad
Are two-letter codes allowed at all?
id say no
So ISO 3166 is out, too?
ISO 3166 has 2 and 3 letter codes too but i wasnt speaking about that ... [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB No snowflake in an avalanche ever feels responsible. -- Voltaire
participants (4)
-
Clemens Ladisch
-
Luca Barbato
-
Michael Niedermayer
-
Rich Felker