Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
Warning: Cannot modify header information - headers already sent by (output started at /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code:102) in /var/www/iplanru/data/www/intesco.ru/d59ed/index.php(1) : eval()'d code(2) : eval()'d code on line 4
KR.pm 0000666 00000004026 15077514243 0005434 0 ustar 00 package Encode::KR;
BEGIN {
if ( ord("A") == 193 ) {
die "Encode::KR not supported on EBCDIC\n";
}
}
use strict;
use warnings;
use Encode;
our $VERSION = do { my @r = ( q$Revision: 2.2 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
use Encode::KR::2022_KR;
1;
__END__
=head1 NAME
Encode::KR - Korean Encodings
=head1 SYNOPSIS
use Encode qw/encode decode/;
$euc_kr = encode("euc-kr", $utf8); # loads Encode::KR implicitly
$utf8 = decode("euc-kr", $euc_kr); # ditto
=head1 DESCRIPTION
This module implements Korean charset encodings. Encodings supported
are as follows.
Canonical Alias Description
--------------------------------------------------------------------
euc-kr /\beuc.*kr$/i EUC (Extended Unix Character)
/\bkr.*euc$/i
ksc5601-raw Korean standard code set (as is)
cp949 /(?:x-)?uhc$/i
/(?:x-)?windows-949$/i
/\bks_c_5601-1987$/i
Code Page 949 (EUC-KR + 8,822
(additional Hangul syllables)
MacKorean EUC-KR + Apple Vendor Mappings
johab JOHAB A supplementary encoding defined in
Annex 3 of KS X 1001:1998
iso-2022-kr iso-2022-kr [RFC1557]
--------------------------------------------------------------------
To find how to use this module in detail, see L.
=head1 BUGS
When you see C on mails and web pages, they really
mean "cp949" encodings. To fix that, the following aliases are set;
qr/(?:x-)?uhc$/i => '"cp949"'
qr/(?:x-)?windows-949$/i => '"cp949"'
qr/ks_c_5601-1987$/i => '"cp949"'
The ASCII region (0x00-0x7f) is preserved for all encodings, even
though this conflicts with mappings by the Unicode Consortium. See
L
to find out why it is implemented that way.
=head1 SEE ALSO
L
=cut
Byte.pm 0000666 00000004615 15077514243 0006027 0 ustar 00 package Encode::Byte;
use strict;
use warnings;
use Encode;
our $VERSION = do { my @r = ( q$Revision: 2.3 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
1;
__END__
=head1 NAME
Encode::Byte - Single Byte Encodings
=head1 SYNOPSIS
use Encode qw/encode decode/;
$greek = encode("iso-8859-7", $utf8); # loads Encode::Byte implicitly
$utf8 = decode("iso-8859-7", $greek); # ditto
=head1 ABSTRACT
This module implements various single byte encodings. For most cases it uses
\x80-\xff (upper half) to map non-ASCII characters. Encodings
supported are as follows.
Canonical Alias Description
--------------------------------------------------------------------
# ISO 8859 series
(iso-8859-1 is in built-in)
iso-8859-2 latin2 [ISO]
iso-8859-3 latin3 [ISO]
iso-8859-4 latin4 [ISO]
iso-8859-5 [ISO]
iso-8859-6 [ISO]
iso-8859-7 [ISO]
iso-8859-8 [ISO]
iso-8859-9 latin5 [ISO]
iso-8859-10 latin6 [ISO]
iso-8859-11
(iso-8859-12 is nonexistent)
iso-8859-13 latin7 [ISO]
iso-8859-14 latin8 [ISO]
iso-8859-15 latin9 [ISO]
iso-8859-16 latin10 [ISO]
# Cyrillic
koi8-f
koi8-r cp878 [RFC1489]
koi8-u [RFC2319]
# Vietnamese
viscii
# all cp* are also available as ibm-*, ms-*, and windows-*
# also see L
cp424
cp437
cp737
cp775
cp850
cp852
cp855
cp856
cp857
cp860
cp861
cp862
cp863
cp864
cp865
cp866
cp869
cp874
cp1006
cp1250 WinLatin2
cp1251 WinCyrillic
cp1252 WinLatin1
cp1253 WinGreek
cp1254 WinTurkish
cp1255 WinHebrew
cp1256 WinArabic
cp1257 WinBaltic
cp1258 WinVietnamese
# Macintosh
# Also see L
MacArabic
MacCentralEurRoman
MacCroatian
MacCyrillic
MacFarsi
MacGreek
MacHebrew
MacIcelandic
MacRoman
MacRomanian
MacRumanian
MacSami
MacThai
MacTurkish
MacUkrainian
# More vendor encodings
AdobeStandardEncoding
nextstep
hp-roman8
=head1 DESCRIPTION
To find how to use this module in detail, see L.
=head1 SEE ALSO
L
=cut
Guess.pm 0000666 00000023461 15077514243 0006212 0 ustar 00 package Encode::Guess;
use strict;
use warnings;
use Encode qw(:fallbacks find_encoding);
our $VERSION = do { my @r = ( q$Revision: 2.3 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
my $Canon = 'Guess';
sub DEBUG () { 0 }
our %DEF_SUSPECTS = map { $_ => find_encoding($_) } qw(ascii utf8);
$Encode::Encoding{$Canon} = bless {
Name => $Canon,
Suspects => {%DEF_SUSPECTS},
} => __PACKAGE__;
use base qw(Encode::Encoding);
sub needs_lines { 1 }
sub perlio_ok { 0 }
our @EXPORT = qw(guess_encoding);
our $NoUTFAutoGuess = 0;
our $UTF8_BOM = pack( "C3", 0xef, 0xbb, 0xbf );
sub import { # Exporter not used so we do it on our own
my $callpkg = caller;
for my $item (@EXPORT) {
no strict 'refs';
*{"$callpkg\::$item"} = \&{"$item"};
}
set_suspects(@_);
}
sub set_suspects {
my $class = shift;
my $self = ref($class) ? $class : $Encode::Encoding{$Canon};
$self->{Suspects} = {%DEF_SUSPECTS};
$self->add_suspects(@_);
}
sub add_suspects {
my $class = shift;
my $self = ref($class) ? $class : $Encode::Encoding{$Canon};
for my $c (@_) {
my $e = find_encoding($c) or die "Unknown encoding: $c";
$self->{Suspects}{ $e->name } = $e;
DEBUG and warn "Added: ", $e->name;
}
}
sub decode($$;$) {
my ( $obj, $octet, $chk ) = @_;
my $guessed = guess( $obj, $octet );
unless ( ref($guessed) ) {
require Carp;
Carp::croak($guessed);
}
my $utf8 = $guessed->decode( $octet, $chk );
$_[1] = $octet if $chk;
return $utf8;
}
sub guess_encoding {
guess( $Encode::Encoding{$Canon}, @_ );
}
sub guess {
my $class = shift;
my $obj = ref($class) ? $class : $Encode::Encoding{$Canon};
my $octet = shift;
# sanity check
return "Empty string, empty guess" unless defined $octet and length $octet;
# cheat 0: utf8 flag;
if ( Encode::is_utf8($octet) ) {
return find_encoding('utf8') unless $NoUTFAutoGuess;
Encode::_utf8_off($octet);
}
# cheat 1: BOM
use Encode::Unicode;
unless ($NoUTFAutoGuess) {
my $BOM = pack( 'C3', unpack( "C3", $octet ) );
return find_encoding('utf8')
if ( defined $BOM and $BOM eq $UTF8_BOM );
$BOM = unpack( 'N', $octet );
return find_encoding('UTF-32')
if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe0000 ) );
$BOM = unpack( 'n', $octet );
return find_encoding('UTF-16')
if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe ) );
if ( $octet =~ /\x00/o )
{ # if \x00 found, we assume UTF-(16|32)(BE|LE)
my $utf;
my ( $be, $le ) = ( 0, 0 );
if ( $octet =~ /\x00\x00/o ) { # UTF-32(BE|LE) assumed
$utf = "UTF-32";
for my $char ( unpack( 'N*', $octet ) ) {
$char & 0x0000ffff and $be++;
$char & 0xffff0000 and $le++;
}
}
else { # UTF-16(BE|LE) assumed
$utf = "UTF-16";
for my $char ( unpack( 'n*', $octet ) ) {
$char & 0x00ff and $be++;
$char & 0xff00 and $le++;
}
}
DEBUG and warn "$utf, be == $be, le == $le";
$be == $le
and return
"Encodings ambiguous between $utf BE and LE ($be, $le)";
$utf .= ( $be > $le ) ? 'BE' : 'LE';
return find_encoding($utf);
}
}
my %try = %{ $obj->{Suspects} };
for my $c (@_) {
my $e = find_encoding($c) or die "Unknown encoding: $c";
$try{ $e->name } = $e;
DEBUG and warn "Added: ", $e->name;
}
my $nline = 1;
for my $line ( split /\r\n?|\n/, $octet ) {
# cheat 2 -- \e in the string
if ( $line =~ /\e/o ) {
my @keys = keys %try;
delete @try{qw/utf8 ascii/};
for my $k (@keys) {
ref( $try{$k} ) eq 'Encode::XS' and delete $try{$k};
}
}
my %ok = %try;
# warn join(",", keys %try);
for my $k ( keys %try ) {
my $scratch = $line;
$try{$k}->decode( $scratch, FB_QUIET );
if ( $scratch eq '' ) {
DEBUG and warn sprintf( "%4d:%-24s ok\n", $nline, $k );
}
else {
use bytes ();
DEBUG
and warn sprintf( "%4d:%-24s not ok; %d bytes left\n",
$nline, $k, bytes::length($scratch) );
delete $ok{$k};
}
}
%ok or return "No appropriate encodings found!";
if ( scalar( keys(%ok) ) == 1 ) {
my ($retval) = values(%ok);
return $retval;
}
%try = %ok;
$nline++;
}
$try{ascii}
or return "Encodings too ambiguous: ", join( " or ", keys %try );
return $try{ascii};
}
1;
__END__
=head1 NAME
Encode::Guess -- Guesses encoding from data
=head1 SYNOPSIS
# if you are sure $data won't contain anything bogus
use Encode;
use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
my $utf8 = decode("Guess", $data);
my $data = encode("Guess", $utf8); # this doesn't work!
# more elaborate way
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);
ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);
# or
$utf8 = decode($enc->name, $data)
=head1 ABSTRACT
Encode::Guess enables you to guess in what encoding a given data is
encoded, or at least tries to.
=head1 DESCRIPTION
By default, it checks only ascii, utf8 and UTF-16/32 with BOM.
use Encode::Guess; # ascii/utf8/BOMed UTF
To use it more practically, you have to give the names of encodings to
check (I as follows). The name of suspects can either be
canonical names or aliases.
CAVEAT: Unlike UTF-(16|32), BOM in utf8 is NOT AUTOMATICALLY STRIPPED.
# tries all major Japanese Encodings as well
use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
value, no heuristics will be applied to UTF8/16/32, and the result
will be limited to the suspects and C.
=over 4
=item Encode::Guess->set_suspects
You can also change the internal suspects list via C
method.
use Encode::Guess;
Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);
=item Encode::Guess->add_suspects
Or you can use C method. The difference is that
C flushes the current suspects list while
C adds.
use Encode::Guess;
Encode::Guess->add_suspects(qw/euc-jp shiftjis 7bit-jis/);
# now the suspects are euc-jp,shiftjis,7bit-jis, AND
# euc-kr,euc-cn, and big5-eten
Encode::Guess->add_suspects(qw/euc-kr euc-cn big5-eten/);
=item Encode::decode("Guess" ...)
When you are content with suspects list, you can now
my $utf8 = Encode::decode("Guess", $data);
=item Encode::Guess->guess($data)
But it will croak if:
=over
=item *
Two or more suspects remain
=item *
No suspects left
=back
So you should instead try this;
my $decoder = Encode::Guess->guess($data);
On success, $decoder is an object that is documented in
L. So you can now do this;
my $utf8 = $decoder->decode($data);
On failure, $decoder now contains an error message so the whole thing
would be as follows;
my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);
=item guess_encoding($data, [, I])
You can also try C function which is exported by
default. It takes $data to check and it also takes the list of
suspects by option. The optional suspect list is I to
the internal suspects list.
my $decoder = guess_encoding($data, qw/euc-jp euc-kr euc-cn/);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);
# check only ascii and utf8
my $decoder = guess_encoding($data);
=back
=head1 CAVEATS
=over 4
=item *
Because of the algorithm used, ISO-8859 series and other single-byte
encodings do not work well unless either one of ISO-8859 is the only
one suspect (besides ascii and utf8).
use Encode::Guess;
# perhaps ok
my $decoder = guess_encoding($data, 'latin1');
# definitely NOT ok
my $decoder = guess_encoding($data, qw/latin1 greek/);
The reason is that Encode::Guess guesses encoding by trial and error.
It first splits $data into lines and tries to decode the line for each
suspect. It keeps it going until all but one encoding is eliminated
out of suspects list. ISO-8859 series is just too successful for most
cases (because it fills almost all code points in \x00-\xff).
=item *
Do not mix national standard encodings and the corresponding vendor
encodings.
# a very bad idea
my $decoder
= guess_encoding($data, qw/shiftjis MacJapanese cp932/);
The reason is that vendor encoding is usually a superset of national
standard so it becomes too ambiguous for most cases.
=item *
On the other hand, mixing various national standard encodings
automagically works unless $data is too short to allow for guessing.
# This is ok if $data is long enough
my $decoder =
guess_encoding($data, qw/euc-cn
euc-jp shiftjis 7bit-jis
euc-kr
big5-eten/);
=item *
DO NOT PUT TOO MANY SUSPECTS! Don't you try something like this!
my $decoder = guess_encoding($data,
Encode->encodings(":all"));
=back
It is, after all, just a guess. You should alway be explicit when it
comes to encodings. But there are some, especially Japanese,
environment that guess-coding is a must. Use this module with care.
=head1 TO DO
Encode::Guess does not work on EBCDIC platforms.
=head1 SEE ALSO
L, L
=cut
CN.pm 0000666 00000004150 15077514243 0005416 0 ustar 00 package Encode::CN;
BEGIN {
if ( ord("A") == 193 ) {
die "Encode::CN not supported on EBCDIC\n";
}
}
use strict;
use warnings;
use Encode;
our $VERSION = do { my @r = ( q$Revision: 2.2 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
# Relocated from Encode.pm
use Encode::CN::HZ;
# use Encode::CN::2022_CN;
1;
__END__
=head1 NAME
Encode::CN - China-based Chinese Encodings
=head1 SYNOPSIS
use Encode qw/encode decode/;
$euc_cn = encode("euc-cn", $utf8); # loads Encode::CN implicitly
$utf8 = decode("euc-cn", $euc_cn); # ditto
=head1 DESCRIPTION
This module implements China-based Chinese charset encodings.
Encodings supported are as follows.
Canonical Alias Description
--------------------------------------------------------------------
euc-cn /\beuc.*cn$/i EUC (Extended Unix Character)
/\bcn.*euc$/i
/\bGB[-_ ]?2312(?:\D.*$|$)/i (see below)
gb2312-raw The raw (low-bit) GB2312 character map
gb12345-raw Traditional chinese counterpart to
GB2312 (raw)
iso-ir-165 GB2312 + GB6345 + GB8565 + additions
MacChineseSimp GB2312 + Apple Additions
cp936 Code Page 936, also known as GBK
(Extended GuoBiao)
hz 7-bit escaped GB2312 encoding
--------------------------------------------------------------------
To find how to use this module in detail, see L.
=head1 NOTES
Due to size concerns, C (an extension to C) is distributed
separately on CPAN, under the name L. That module
also contains extra Taiwan-based encodings.
=head1 BUGS
When you see C on mails and web pages, they really
mean C encodings. To fix that, C is aliased to C.
Use C when you really mean it.
The ASCII region (0x00-0x7f) is preserved for all encodings, even though
this conflicts with mappings by the Unicode Consortium. See
L
to find out why it is implemented that way.
=head1 SEE ALSO
L
=cut
Symbol.pm 0000666 00000001517 15077514243 0006367 0 ustar 00 package Encode::Symbol;
use strict;
use warnings;
use Encode;
our $VERSION = do { my @r = ( q$Revision: 2.2 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
1;
__END__
=head1 NAME
Encode::Symbol - Symbol Encodings
=head1 SYNOPSIS
use Encode qw/encode decode/;
$symbol = encode("symbol", $utf8); # loads Encode::Symbol implicitly
$utf8 = decode("", $symbol); # ditto
=head1 ABSTRACT
This module implements symbol and dingbats encodings. Encodings
supported are as follows.
Canonical Alias Description
--------------------------------------------------------------------
symbol
dingbats
AdobeZDingbat
AdobeSymbol
MacDingbats
=head1 DESCRIPTION
To find out how to use this module in detail, see L.
=head1 SEE ALSO
L
=cut
Encoder.pm 0000666 00000014214 15077514243 0006477 0 ustar 00 #
# $Id: Encoder.pm,v 2.1 2006/05/03 18:24:10 dankogai Exp $
#
package Encode::Encoder;
use strict;
use warnings;
our $VERSION = do { my @r = ( q$Revision: 2.1 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT_OK = qw ( encoder );
our $AUTOLOAD;
sub DEBUG () { 0 }
use Encode qw(encode decode find_encoding from_to);
use Carp;
sub new {
my ( $class, $data, $encname ) = @_;
unless ($encname) {
$encname = Encode::is_utf8($data) ? 'utf8' : '';
}
else {
my $obj = find_encoding($encname)
or croak __PACKAGE__, ": unknown encoding: $encname";
$encname = $obj->name;
}
my $self = {
data => $data,
encoding => $encname,
};
bless $self => $class;
}
sub encoder { __PACKAGE__->new(@_) }
sub data {
my ( $self, $data ) = @_;
if ( defined $data ) {
$self->{data} = $data;
return $data;
}
else {
return $self->{data};
}
}
sub encoding {
my ( $self, $encname ) = @_;
if ($encname) {
my $obj = find_encoding($encname)
or confess __PACKAGE__, ": unknown encoding: $encname";
$self->{encoding} = $obj->name;
return $self;
}
else {
return $self->{encoding};
}
}
sub bytes {
my ( $self, $encname ) = @_;
$encname ||= $self->{encoding};
my $obj = find_encoding($encname)
or confess __PACKAGE__, ": unknown encoding: $encname";
$self->{data} = $obj->decode( $self->{data}, 1 );
$self->{encoding} = '';
return $self;
}
sub DESTROY { # defined so it won't autoload.
DEBUG and warn shift;
}
sub AUTOLOAD {
my $self = shift;
my $type = ref($self)
or confess "$self is not an object";
my $myname = $AUTOLOAD;
$myname =~ s/.*://; # strip fully-qualified portion
my $obj = find_encoding($myname)
or confess __PACKAGE__, ": unknown encoding: $myname";
DEBUG and warn $self->{encoding}, " => ", $obj->name;
if ( $self->{encoding} ) {
from_to( $self->{data}, $self->{encoding}, $obj->name, 1 );
}
else {
$self->{data} = $obj->encode( $self->{data}, 1 );
}
$self->{encoding} = $obj->name;
return $self;
}
use overload
q("") => sub { $_[0]->{data} },
q(0+) => sub { use bytes(); bytes::length( $_[0]->{data} ) },
fallback => 1,
;
1;
__END__
=head1 NAME
Encode::Encoder -- Object Oriented Encoder
=head1 SYNOPSIS
use Encode::Encoder;
# Encode::encode("ISO-8859-1", $data);
Encode::Encoder->new($data)->iso_8859_1; # OOP way
# shortcut
use Encode::Encoder qw(encoder);
encoder($data)->iso_8859_1;
# you can stack them!
encoder($data)->iso_8859_1->base64; # provided base64() is defined
# you can use it as a decoder as well
encoder($base64)->bytes('base64')->latin1;
# stringified
print encoder($data)->utf8->latin1; # prints the string in latin1
# numified
encoder("\x{abcd}\x{ef}g")->utf8 == 6; # true. bytes::length($data)
=head1 ABSTRACT
B allows you to use Encode in an object-oriented
style. This is not only more intuitive than a functional approach,
but also handier when you want to stack encodings. Suppose you want
your UTF-8 string converted to Latin1 then Base64: you can simply say
my $base64 = encoder($utf8)->latin1->base64;
instead of
my $latin1 = encode("latin1", $utf8);
my $base64 = encode_base64($utf8);
or the lazier and more convoluted
my $base64 = encode_base64(encode("latin1", $utf8));
=head1 Description
Here is how to use this module.
=over 4
=item *
There are at least two instance variables stored in a hash reference,
{data} and {encoding}.
=item *
When there is no method, it takes the method name as the name of the
encoding and encodes the instance I with I. If successful,
the instance I is set accordingly.
=item *
You can retrieve the result via -Edata but usually you don't have to
because the stringify operator ("") is overridden to do exactly that.
=back
=head2 Predefined Methods
This module predefines the methods below:
=over 4
=item $e = Encode::Encoder-Enew([$data, $encoding]);
returns an encoder object. Its data is initialized with $data if
present, and its encoding is set to $encoding if present.
When $encoding is omitted, it defaults to utf8 if $data is already in
utf8 or "" (empty string) otherwise.
=item encoder()
is an alias of Encode::Encoder-Enew(). This one is exported on demand.
=item $e-Edata([$data])
When $data is present, sets the instance data to $data and returns the
object itself. Otherwise, the current instance data is returned.
=item $e-Eencoding([$encoding])
When $encoding is present, sets the instance encoding to $encoding and
returns the object itself. Otherwise, the current instance encoding is
returned.
=item $e-Ebytes([$encoding])
decodes instance data from $encoding, or the instance encoding if
omitted. If the conversion is successful, the instance encoding
will be set to "".
The name I was deliberately picked to avoid namespace tainting
-- this module may be used as a base class so method names that appear
in Encode::Encoding are avoided.
=back
=head2 Example: base64 transcoder
This module is designed to work with L.
To make the Base64 transcoder example above really work, you could
write a module like this:
package Encode::Base64;
use base 'Encode::Encoding';
__PACKAGE__->Define('base64');
use MIME::Base64;
sub encode{
my ($obj, $data) = @_;
return encode_base64($data);
}
sub decode{
my ($obj, $data) = @_;
return decode_base64($data);
}
1;
__END__
And your caller module would be something like this:
use Encode::Encoder;
use Encode::Base64;
# now you can really do the following
encoder($data)->iso_8859_1->base64;
encoder($base64)->bytes('base64')->latin1;
=head2 Operator Overloading
This module overloads two operators, stringify ("") and numify (0+).
Stringify dumps the data inside the object.
Numify returns the number of bytes in the instance data.
They come in handy when you want to print or find the size of data.
=head1 SEE ALSO
L,
L
=cut
Unicode.pm 0000666 00000021363 15077514243 0006511 0 ustar 00 package Encode::Unicode;
use strict;
use warnings;
no warnings 'redefine';
our $VERSION = do { my @r = ( q$Revision: 2.6 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
#
# Object Generator 8 transcoders all at once!
#
require Encode;
our %BOM_Unknown = map { $_ => 1 } qw(UTF-16 UTF-32);
for my $name (
qw(UTF-16 UTF-16BE UTF-16LE
UTF-32 UTF-32BE UTF-32LE
UCS-2BE UCS-2LE)
)
{
my ( $size, $endian, $ucs2, $mask );
$name =~ /^(\w+)-(\d+)(\w*)$/o;
if ( $ucs2 = ( $1 eq 'UCS' ) ) {
$size = 2;
}
else {
$size = $2 / 8;
}
$endian = ( $3 eq 'BE' ) ? 'n' : ( $3 eq 'LE' ) ? 'v' : '';
$size == 4 and $endian = uc($endian);
$Encode::Encoding{$name} = bless {
Name => $name,
size => $size,
endian => $endian,
ucs2 => $ucs2,
} => __PACKAGE__;
}
use base qw(Encode::Encoding);
sub renew {
my $self = shift;
$BOM_Unknown{ $self->name } or return $self;
my $clone = bless {%$self} => ref($self);
$clone->{renewed}++; # so the caller knows it is renewed.
return $clone;
}
# There used to be a perl implemntation of (en|de)code but with
# XS version is ripe, perl version is zapped for optimal speed
*decode = \&decode_xs;
*encode = \&encode_xs;
1;
__END__
=head1 NAME
Encode::Unicode -- Various Unicode Transformation Formats
=cut
=head1 SYNOPSIS
use Encode qw/encode decode/;
$ucs2 = encode("UCS-2BE", $utf8);
$utf8 = decode("UCS-2BE", $ucs2);
=head1 ABSTRACT
This module implements all Character Encoding Schemes of Unicode that
are officially documented by Unicode Consortium (except, of course,
for UTF-8, which is a native format in perl).
=over 4
=item L says:
I A character encoding form plus byte
serialization. There are Seven character encoding schemes in Unicode:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and
UTF-32LE (UCS-4LE), and UTF-7.
Since UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of
Unicode's Character Encoding Scheme. It is separately implemented in
Encode::Unicode::UTF7. For details see L.
=item Quick Reference
Decodes from ord(N) Encodes chr(N) to...
octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} ==
---------------+-----------------+------------------------------
UCS-2BE 2 N N is bogus Not Available
UCS-2LE 2 N N bogus Not Available
UTF-16 2/4 Y Y is S.P S.P BE/LE
UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd
UTF-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf
UTF-32 4 Y - is bogus As is BE/LE
UTF-32BE 4 N - bogus As is 0x0001abcd
UTF-32LE 4 N - bogus As is 0xcdab0100
UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d
---------------+-----------------+------------------------------
=back
=head1 Size, Endianness, and BOM
You can categorize these CES by 3 criteria: size of each character,
endianness, and Byte Order Mark.
=head2 by size
UCS-2 is a fixed-length encoding with each character taking 16 bits.
It B support I. When a surrogate pair
is encountered during decode(), its place is filled with \x{FFFD}
if I is 0, or the routine croaks if I is 1. When a
character whose ord value is larger than 0xFFFF is encountered,
its place is filled with \x{FFFD} if I is 0, or the routine
croaks if I is 1.
UTF-16 is almost the same as UCS-2 but it supports I.
When it encounters a high surrogate (0xD800-0xDBFF), it fetches the
following low surrogate (0xDC00-0xDFFF) and Cs them to
form a character. Bogus surrogates result in death. When \x{10000}
or above is encountered during encode(), it Cs them and
pushes the surrogate pair to the output stream.
UTF-32 (UCS-4) is a fixed-length encoding with each character taking 32 bits.
Since it is 32-bit, there is no need for I.
=head2 by endianness
The first (and now failed) goal of Unicode was to map all character
repertoires into a fixed-length integer so that programmers are happy.
Since each character is either a I or I in C, you have to
pay attention to the endianness of each platform when you pass data
to one another.
Anything marked as BE is Big Endian (or network byte order) and LE is
Little Endian (aka VAX byte order). For anything not marked either
BE or LE, a character called Byte Order Mark (BOM) indicating the
endianness is prepended to the string.
CAVEAT: Though BOM in utf8 (\xEF\xBB\xBF) is valid, it is meaningless
and as of this writing Encode suite just leave it as is (\x{FeFF}).
=over 4
=item BOM as integer when fetched in network byte order
16 32 bits/char
-------------------------
BE 0xFeFF 0x0000FeFF
LE 0xFFFe 0xFFFe0000
-------------------------
=back
This modules handles the BOM as follows.
=over 4
=item *
When BE or LE is explicitly stated as the name of encoding, BOM is
simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
=item *
When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
what the BOM says. If no BOM is found, the routine dies.
=item *
When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended. So when you want to encode a whole text
file, make sure you encode() the whole text at once, not line by line
or each line, not file, will have a BOM prepended.
=item *
C is an exception. Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.
=back
=head1 Surrogate Pairs
To say the least, surrogate pairs were the biggest mistake of the
Unicode Consortium. But according to the late Douglas Adams in I Trilogy, C. Their mistake was not of this
magnitude so let's forgive them.
(I don't dare make any comparison with Unicode Consortium and the
Vogons here ;) Or, comparing Encode to Babel Fish is completely
appropriate -- if you can only stick this into your ear :)
Surrogate pairs were born when the Unicode Consortium finally
admitted that 16 bits were not big enough to hold all the world's
character repertoires. But they already made UCS-2 16-bit. What
do we do?
Back then, the range 0xD800-0xDFFF was not allocated. Let's split
that range in half and use the first half to represent the C and the second half to represent the C. That way, you can represent 1024 * 1024 =
1048576 more characters. Now we can store character ranges up to
\x{10ffff} even with 16-bit encodings. This pair of half-character is
now called a I and UTF-16 is the name of the encoding
that embraces them.
Here is a formula to ensurrogate a Unicode character \x{10000} and
above;
$hi = ($uni - 0x10000) / 0x400 + 0xD800;
$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
And to desurrogate;
$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
Note this move has made \x{D800}-\x{DFFF} into a forbidden zone but
perl does not prohibit the use of characters within this range. To perl,
every one of \x{0000_0000} up to \x{ffff_ffff} (*) is I.
(*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit
integer support!
=head1 Error Checking
Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.
% perl -MEncode -e'$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \
-e'Encode::from_to($_, "utf16","shift_jis", 0); print'
UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
% perl -MEncode -e'$a = "BOM missing"' \
-e' Encode::from_to($a, "utf16", "shift_jis", 0); print'
UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.
Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another. So Encode
is more strict on UTFs.
Consider that "division by zero" of Encode :)
=head1 SEE ALSO
L, L, L,
L,
RFC 2781 L,
The whole Unicode standard L
Ch. 15, pp. 403 of C
by Larry Wall, Tom Christiansen, Jon Orwant;
O'Reilly & Associates; ISBN 0-596-00027-8
=cut
KR/2022_KR.pm 0000666 00000003552 15077514243 0006420 0 ustar 00 package Encode::KR::2022_KR;
use strict;
use warnings;
our $VERSION = do { my @r = ( q$Revision: 2.2 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use Encode qw(:fallbacks);
use base qw(Encode::Encoding);
__PACKAGE__->Define('iso-2022-kr');
sub needs_lines { 1 }
sub perlio_ok {
return 0; # for the time being
}
sub decode {
my ( $obj, $str, $chk ) = @_;
my $res = $str;
my $residue = iso_euc( \$res );
# This is for PerlIO
$_[1] = $residue if $chk;
return Encode::decode( 'euc-kr', $res, FB_PERLQQ );
}
sub encode {
my ( $obj, $utf8, $chk ) = @_;
# empty the input string in the stack so perlio is ok
$_[1] = '' if $chk;
my $octet = Encode::encode( 'euc-kr', $utf8, FB_PERLQQ );
euc_iso( \$octet );
return $octet;
}
use Encode::CJKConstants qw(:all);
# ISO<->EUC
sub iso_euc {
my $r_str = shift;
$$r_str =~ s/$RE{'2022_KR'}//gox; # remove the designator
$$r_str =~ s{ # replace characters in GL
\x0e # between SO(\x0e) and SI(\x0f)
([^\x0f]*) # with characters in GR
\x0f
}
{
my $out= $1;
$out =~ tr/\x21-\x7e/\xa1-\xfe/;
$out;
}geox;
my ($residue) = ( $$r_str =~ s/(\e.*)$//so );
return $residue;
}
sub euc_iso {
no warnings qw(uninitialized);
my $r_str = shift;
substr( $$r_str, 0, 0 ) =
$ESC{'2022_KR'}; # put the designator at the beg.
$$r_str =~
s{ # move KS X 1001 characters in GR to GL
($RE{EUC_C}+) # and enclose them with SO and SI
}{
my $str = $1;
$str =~ tr/\xA1-\xFE/\x21-\x7E/;
"\x0e" . $str . "\x0f";
}geox;
$$r_str;
}
1;
__END__
=head1 NAME
Encode::KR::2022_KR -- internally used by Encode::KR
=cut
TW.pm 0000666 00000004542 15077514243 0005455 0 ustar 00 package Encode::TW;
BEGIN {
if ( ord("A") == 193 ) {
die "Encode::TW not supported on EBCDIC\n";
}
}
use strict;
use warnings;
use Encode;
our $VERSION = do { my @r = ( q$Revision: 2.2 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
1;
__END__
=head1 NAME
Encode::TW - Taiwan-based Chinese Encodings
=head1 SYNOPSIS
use Encode qw/encode decode/;
$big5 = encode("big5", $utf8); # loads Encode::TW implicitly
$utf8 = decode("big5", $big5); # ditto
=head1 DESCRIPTION
This module implements tradition Chinese charset encodings as used
in Taiwan and Hong Kong.
Encodings supported are as follows.
Canonical Alias Description
--------------------------------------------------------------------
big5-eten /\bbig-?5$/i Big5 encoding (with ETen extensions)
/\bbig5-?et(en)?$/i
/\btca-?big5$/i
big5-hkscs /\bbig5-?hk(scs)?$/i
/\bhk(scs)?-?big5$/i
Big5 + Cantonese characters in Hong Kong
MacChineseTrad Big5 + Apple Vendor Mappings
cp950 Code Page 950
= Big5 + Microsoft vendor mappings
--------------------------------------------------------------------
To find out how to use this module in detail, see L.
=head1 NOTES
Due to size concerns, C (Extended Unix Character), C
(Chinese Character Code for Information Interchange), C
(CMEX's Big5+) and C (CMEX's Big5e) are distributed separately
on CPAN, under the name L. That module also contains
extra China-based encodings.
=head1 BUGS
Since the original C encoding (1984) is not supported anywhere
(glibc and DOS-based systems uses C to mean C; Microsoft
uses C to mean C), a conscious decision was made to alias
C to C, which is the de facto superset of the original
big5.
The C encoding files are not complete. For common C
manipulation, please use C in L, which contains
planes 1-7.
The ASCII region (0x00-0x7f) is preserved for all encodings, even
though this conflicts with mappings by the Unicode Consortium. See
L
to find out why it is implemented that way.
=head1 SEE ALSO
L
=cut
Encoding.pm 0000666 00000021607 15077514243 0006652 0 ustar 00 package Encode::Encoding;
# Base class for classes which implement encodings
use strict;
use warnings;
our $VERSION = do { my @r = ( q$Revision: 2.5 $ =~ /\d+/g ); sprintf "%d." . "%02d" x $#r, @r };
require Encode;
sub DEBUG { 0 }
sub Define {
my $obj = shift;
my $canonical = shift;
$obj = bless { Name => $canonical }, $obj unless ref $obj;
# warn "$canonical => $obj\n";
Encode::define_encoding( $obj, $canonical, @_ );
}
sub name { return shift->{'Name'} }
sub mime_name{
require Encode::MIME::Name;
return Encode::MIME::Name::get_mime_name(shift->name);
}
# sub renew { return $_[0] }
sub renew {
my $self = shift;
my $clone = bless {%$self} => ref($self);
$clone->{renewed}++; # so the caller can see it
DEBUG and warn $clone->{renewed};
return $clone;
}
sub renewed { return $_[0]->{renewed} || 0 }
*new_sequence = \&renew;
sub needs_lines { 0 }
sub perlio_ok {
eval { require PerlIO::encoding };
return $@ ? 0 : 1;
}
# (Temporary|legacy) methods
sub toUnicode { shift->decode(@_) }
sub fromUnicode { shift->encode(@_) }
#
# Needs to be overloaded or just croak
#
sub encode {
require Carp;
my $obj = shift;
my $class = ref($obj) ? ref($obj) : $obj;
Carp::croak( $class . "->encode() not defined!" );
}
sub decode {
require Carp;
my $obj = shift;
my $class = ref($obj) ? ref($obj) : $obj;
Carp::croak( $class . "->encode() not defined!" );
}
sub DESTROY { }
1;
__END__
=head1 NAME
Encode::Encoding - Encode Implementation Base Class
=head1 SYNOPSIS
package Encode::MyEncoding;
use base qw(Encode::Encoding);
__PACKAGE__->Define(qw(myCanonical myAlias));
=head1 DESCRIPTION
As mentioned in L, encodings are (in the current
implementation at least) defined as objects. The mapping of encoding
name to object is via the C<%Encode::Encoding> hash. Though you can
directly manipulate this hash, it is strongly encouraged to use this
base class module and add encode() and decode() methods.
=head2 Methods you should implement
You are strongly encouraged to implement methods below, at least
either encode() or decode().
=over 4
=item -Eencode($string [,$check])
MUST return the octet sequence representing I<$string>.
=over 2
=item *
If I<$check> is true, it SHOULD modify I<$string> in place to remove
the converted part (i.e. the whole string unless there is an error).
If perlio_ok() is true, SHOULD becomes MUST.
=item *
If an error occurs, it SHOULD return the octet sequence for the
fragment of string that has been converted and modify $string in-place
to remove the converted part leaving it starting with the problem
fragment. If perlio_ok() is true, SHOULD becomes MUST.
=item *
If I<$check> is is false then C MUST make a "best effort" to
convert the string - for example, by using a replacement character.
=back
=item -Edecode($octets [,$check])
MUST return the string that I<$octets> represents.
=over 2
=item *
If I<$check> is true, it SHOULD modify I<$octets> in place to remove
the converted part (i.e. the whole sequence unless there is an
error). If perlio_ok() is true, SHOULD becomes MUST.
=item *
If an error occurs, it SHOULD return the fragment of string that has
been converted and modify $octets in-place to remove the converted
part leaving it starting with the problem fragment. If perlio_ok() is
true, SHOULD becomes MUST.
=item *
If I<$check> is false then C should make a "best effort" to
convert the string - for example by using Unicode's "\x{FFFD}" as a
replacement character.
=back
=back
If you want your encoding to work with L pragma, you should
also implement the method below.
=over 4
=item -Ecat_decode($destination, $octets, $offset, $terminator [,$check])
MUST decode I<$octets> with I<$offset> and concatenate it to I<$destination>.
Decoding will terminate when $terminator (a string) appears in output.
I<$offset> will be modified to the last $octets position at end of decode.
Returns true if $terminator appears output, else returns false.
=back
=head2 Other methods defined in Encode::Encodings
You do not have to override methods shown below unless you have to.
=over 4
=item -Ename
Predefined As:
sub name { return shift->{'Name'} }
MUST return the string representing the canonical name of the encoding.
=item -Emime_name
Predefined As:
sub mime_name{
require Encode::MIME::Name;
return Encode::MIME::Name::get_mime_name(shift->name);
}
MUST return the string representing the IANA charset name of the encoding.
=item -Erenew
Predefined As:
sub renew {
my $self = shift;
my $clone = bless { %$self } => ref($self);
$clone->{renewed}++;
return $clone;
}
This method reconstructs the encoding object if necessary. If you need
to store the state during encoding, this is where you clone your object.
PerlIO ALWAYS calls this method to make sure it has its own private
encoding object.
=item -Erenewed
Predefined As:
sub renewed { $_[0]->{renewed} || 0 }
Tells whether the object is renewed (and how many times). Some
modules emit C