[MPlayer-dev-eng] [PATCH] ftp charset selection support

Andrew Savchenko Bircoph at list.ru
Sat Apr 12 20:29:14 CEST 2008


Hi,

On Saturday 12 April 2008 14:30, Alban Bedel wrote:
> Andrew Savchenko <Bircoph at list.ru> wrote:
> > Hello,
> >
> > Currently there are a lot of ftp servers with different
> > charsets and it is a problem to work with these hosts if local
> > encoding differs, only for Russian there are over 10 different
> > encoding and 3 of them are widely used. So I made a patch
> > which allows to select remote server's encoding and recode URL
> > into it.
>
> Just out of interest, how is this problem handled in HTML? Is it
> needed to write the document in the encoding used by the FTP
> server one link to?

Yes, it is required to be written in ftp server's encoding. I have 
an ffsearch server on my host and encountered this problem long 
time ago...

Originally ftp supported only ASCII charset (rfc959) and there was 
no problems, but afterwards file names in national encodings began 
to occur, and all this mess begins. Well, UTF-8 will be a fix 
after all, but this wouldn't be soon...

> Generally I'm OK with the idea of this patch, but we might want
> to also take a look at the URL/filename encoding problem in
> general. The problem have 2 side, source encoding and
> destination encoding. As source we currently have:
>
> * command line: the terminal encoding will do
> * slave command: terminal encoding or fs encoding?

IMHO if mplayer reads slave commands from stdin, it should use 
terminal encoding. If it reads them from file it should use FS 
encoding. Maybe it will be better to introduce some input_charset 
internal variable to make input-encoding-dependent functions to be 
independent of type of input: where we use -slave from file or 
normal stdin, or from command line. Perhaps it is worthy to make 
this variable configurable by the user.

> * playlist: some format/transport protocol might specify it,
>             fallback on local fs encoding?

Perhaps in case of transport protocol it will be reasonable to use 
server's encoding as fallback, and for local playlist use local 
encoding.

Also, ENCA seems to be a good choice as fallback for this and 
previous issue. Currently it is used in subtitle routines and 
saves me a lot of time. Only Russian subtitles I own are present 
in 5 different encodings and it would be a pain to deal with them 
if not ENCA. Of course this approach in not ideal and may fail, 
but this is better than nothing and seems to work perfectly in the 
most of cases.

> On the destination side only the stream itself can tell what is
> needed. Using a common format, like UTF-8, would be tempting;

Are you talking about network streams or about an internal streams 
inside mplayer? Most likely that network streams are the subject. 
I dunno exactly, but I'm not sure that all type of streams support 
encoding information and all network servers set it properly.

> but AFAIK back and forth conversion is not reliable with all
> encoding.

At least it seems to be reliable for cyrillic and CJK charsets, I 
can't tell about the rest.

> The best would then probably be to simply pass the 
> encoding along with the URL/filename when opening the stream.
> Suggestions welcome :)

And why not introduce new command line option? Inventing new URL 
schemes seems not good from my standpoint, but this may be 
implemented optionally. And do not forget about ENCA. We can use 
the same approach as in subcp: 
enca[:fallback language[:fallbach charset]].

> > +#ifdef USE_ICONV
> > +#include <iconv.h>
> > +#endif /* USE_ICONV */
> > +
> > +/* maximum message length of mp_msg */
> > +#define MSGSIZE_MAX 3072
> > +
> > +const char* filename_recode_common(iconv_t inv_msgiconv,
> > +    const char* filename,const char *tocode, const char
> > *fromcode); +
>
> This is going to break if iconv is unavailable. 

Will fix it.

> The function 
> name could also be better. This function is not specific to
> filenames and _common say very little. mp_recode_to()?

Yes, it is a good choice. And what is more this function becomes a 
general i/o routine rather than mp_msg specific stuff. I think it 
is better to move this function somewhere else, but where exactly? 
I'm in doubt in finding appropriate file to place this function.

> I would suggest making this function static and adding a wrapper
> that hide iconv. That would allow mp_msg to keep using a static
> iconv context and make things simpler for other uses.

Isn't filename_recode() is an appropriate wrapper to mp_recode_to() 
(using new proposed function name)? It handles iconv context for 
mp_recode_to() currently. I have no objections against making 
these functions static, so it will be done. Also I want to extract 
storage buffer for recoded_name from mp_recode_to() in order to 
make this function thread safe.

> > -/* maximum message length of mp_msg */
> > -#define MSGSIZE_MAX 3072
> > -
>
> Why do you move this define out of mp_msg.c?

It is a remnant from earlier code changes. Will be removed.

> > +/* defined in stream_ftp.c */
> > +#ifdef USE_ICONV
> > +extern char *ftp_charset;
> > +#endif /* USE_ICONV */
>
> No need to put extern declarations under #ifdef.

Ok, but I'm surprised by this language feature ;-).

> > +#ifdef USE_ICONV
> > +#include <iconv.h>
> > +#include <string.h>
>
> The need for iconv should be abstracted away.
>
> > +#ifdef USE_ICONV
>
> Get rid of the #ifdef.

ok

> > +  //recode filename if needed
> > +  if (ftp_charset) {
> > +    mp_msg(MSGT_OPEN, MSGL_V, "[ftp] recoding filename %s
> > from %s to %s\n", +            p->filename,
> > get_term_charset(), ftp_charset); +
> > +    iconv_t iconv_filename = (iconv_t)(-1);
> > +    /* There is no guarantee that filename_recode_common
> > wouldn't +     * be called during playback. Thus buffer must
> > be copied to +     * p->filename with appropriate memory
> > reallocation. */ +    char *buf =
> > filename_recode_common(iconv_filename, +           
> > p->filename, ftp_charset, get_term_charset());
>
> Don't use declarations in middle of code blocks.

Will be enough to place declaration just after "{" clause after 
"if" statement?

> > +    size_t len = strlen(buf) + 1;
> > +    p->filename = realloc(p->filename, len);
> > +    memcpy(p->filename, buf, len);
> > +  }
> > +#endif /* USE_ICONV */
>
> This is leaking memory.

Why?! This is realloc(), not malloc(). It will reuse (or free() and 
malloc()) memory of p->filename. At the end, close_f() will free 
this memory inside m_struct_free() call in the same way as it will 
be freed in the case of original p->filename pointer. From the 
point of m_option_free() it should not be any difference between 
handling the old and the new pointer.

> > Currently user must specify encoding on his/her own, but in
> > several cases it is possible to autodetect encoding:
> > 1) Negotiation with server using LANG feature (rfc2640) if
> > server supports it.
> > 2) Alternatively try to server's guess encoding by
> > descendingly parsing directories in URL (starting from /) in
> > order to identify encoding using ENCA.
> >
> > This will be implemented later, I hope.
>
> Patch welcome :)

I'll work on patches later. Today I feel myself too tired...
It would be great to hear from you some words about (possible) 
memory leak and mp_recode_to() issues before the new patch will be 
made.

Well, I have look over some RFC's concerning ftp, and some real 
servers... I have no idea what ftp server supports LANG extension. 
Even my favourite vsftpd doesn't do this. Anyway I can use 
reference code from lftp. 

Also I have found another interesting feature which will be 
usefull: ff server supports FEAT and OPTS UTF8 ON, the last 
command can be used to established utf8 filename transmission for 
sure; vsftpd do this by default.

With best regards,
Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080412/d4702ac7/attachment.pgp>


More information about the MPlayer-dev-eng mailing list