[MPlayer-dev-eng] [PATCH] ftp charset selection support

Sat Apr 19 21:18:42 CEST 2008

Hi,

On Sunday 13 April 2008 17:12, Alban Bedel wrote:
[...]
> Yeah, the encoding used for slave commands should be
> configurable. The problem is the default, I dunno if it would be
> too wise to have different default depending on the fd type. It
> might lead to quiet some confusions.

I dunno either. Having them both may be confusing, having just 
single one may be inconvenient.

> > > * playlist: some format/transport protocol might specify it,
> > >             fallback on local fs encoding?
> >
> > Perhaps in case of transport protocol it will be reasonable to
> > use server's encoding as fallback, and for local playlist use
> > local encoding.
>
> That what I was thinking, although again we probably want to
> have the "local encoding" configurable.

Ok, I think this would be easy to implement.

> > Also, ENCA seems to be a good choice as fallback for this and
> > previous issue. Currently it is used in subtitle routines and
> > saves me a lot of time. Only Russian subtitles I own are
> > present in 5 different encodings and it would be a pain to
> > deal with them if not ENCA. Of course this approach in not
> > ideal and may fail, but this is better than nothing and seems
> > to work perfectly in the most of cases.
>
> The problem is that most path are relatively short, it make
> guessing a lot harder.

Well, I made small test: created some files with single word, 
converted them to different encodings and made ENCA to guess 
encoding for that files. 2/3 of tests were passed successfull, 
this is still better than nothing.

Moreover, another trick can be made: root directory listing may be 
obtained for the root directory (if server allows to list dirs of 
course) and ENCA may be used against this listing. However it is 
possible that root directory listing doesn't contain enough 
symbols or contain English characters only, so this procedure 
should be done recursively; as a drawback this may lead into 
perceptible lag.

Futher more such kind of autodetection may lead to playback of 
another file. In some encodings a lot of symbols are 
interchangeable, but with different codes. For example it is 
possible to create filename in cp1251 charset, convert it into 
koi8r, but write it again into cp1251 encoding -- hence we will 
have two file both of them may be recoded in the single filename.
However, probability of this case is very low.

> > > but AFAIK back and forth conversion is not reliable with all
> > > encoding.
> >
> > At least it seems to be reliable for cyrillic and CJK
> > charsets, I can't tell about the rest.
>
> AFAIK CJK are the problem, bcs with the Han unification
> different chars in the legacy encodings give the same unicode
> char. 

Hmm, I'm surprised with this. I have work only with Japan charsets 
so haven't encountered this problem. It is more interesting why 
unicode doesn't handle this chars as separate characters. Afaik 
UTF-8 is supposed to be able to contain any written character on 
the Earth by design.

> > And why not introduce new command line option? Inventing new
> > URL schemes seems not good from my standpoint, but this may be
> > implemented optionally. And do not forget about ENCA. We can
> > use the same approach as in subcp:
> > enca[:fallback language[:fallbach charset]].
>
> That's what I meant. To be more precise I thought about adding
> an extra parameter to open_stream() to indicate the filename
> encoding.

Seconded.

> > > Don't use declarations in middle of code blocks.
> >
> > Will be enough to place declaration just after "{" clause
> > after "if" statement?
>
> Sure.

Fixed.

> > > This is leaking memory.
> >
> > Why?! This is realloc(), not malloc(). It will reuse (or
> > free() and malloc()) memory of p->filename. At the end,
> > close_f() will free this memory inside m_struct_free() call in
> > the same way as it will be freed in the case of original
> > p->filename pointer. From the point of m_option_free() it
> > should not be any difference between handling the old and the
> > new pointer.
>
> The iconv context is not freed AFAICT.

Fixed.

The first patch contains mp_recode_to() implementation. Now it uses 
no static buffer and can grow buffer as needed. Static buffer was 
acceptable for text messages, but playback failure due to 
insufficient buffer size is not acceptable. Now each caller 
function must handle its own buffer and iconv_t handle on its own.

The second patch is -ftp-charset implementation itself with an 
appropriate documentation.

With best regards,
Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ftp-charset-recode.patch
Type: text/x-diff
Size: 4532 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080419/1fe584c4/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ftp-charset.patch
Type: text/x-diff
Size: 2912 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080419/1fe584c4/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080419/1fe584c4/attachment.pgp>