UTF and ANSI reports

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

UTF and ANSI reports

Starting with version 10, vpxPrint accepts ANSI and UTF files.

 

ANSI files do not allow to print complex characters like the ones in asian character sets. In this case you must use the UTF format files.

You will find below a little discussion about the UTF file formats, a more complete description can be found on the http://unicode.org site.

 

From the vpxPrint point of view:

vpxPrint is fully compatible with the ANSI file formats and supports UTF-8 and UTF-16 encodings.

As vpxPrint supports both ANSI and UTF files formats, it must know the exact format of the stream before processing it

If the stream contains a BOM (see below), no doubt exists, the file is UTF-encoded. But UTF-8 files can have a BOM or not.

How to detect if the stream is ANSI or UTF-8 without a BOM:

One way to do that, is to assume that the file is UTF8, and decode it as such, then when an error is encountered when decoding a character, it's probably *not* UTF8.

But it is not 100% failsafe.

Another problem is that we have to examine the full files content before processing, that generates an useless overhead.

 

If your file is UTF-encoded and does not contain a BOM:

about_24_h You must insert a <utf-8> tag as the beginning of your file, otherwise the file is processed as an ANSI file.

 


 

Unicode

 

 

A "Unicode Transformation Format" (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term "UCS transformation format" for UTF; the two terms are merely synonyms for the same concept.

 

Each UTF is reversible, thus every UTF supports lossless round tripping : mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must map all code points (except surrogate code points) to unique byte sequences.This includes reserved (unassigned) code points and the noncharacters (including U+FFFE and U+FFFF).

 

The first version of Unicode was a 16-bit encoding, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding.The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.

 

UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems.

 

UTF-8 is the byte-oriented encoding form of Unicode.

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Byte Order Mark (BOM) FAQ

What is a BOM?

A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

Where is a BOM useful?

A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format - it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it acts as a signature for the specific encoding form used.

What does 'endian' mean?

A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian. When data is exchanged, bytes that appear in the "correct" order on the sending system may appear to be out of order on the receiving system. In that situation, a BOM would look like 0xFFFE which is a noncharacter , allowing the receiving system to apply byte reversal before processing the data. UTF-8 is byte oriented and therefore does not have that issue. Nevertheless, an initial BOM might be useful to identify the datastream as UTF-8.

A BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:

Bytes

Encoding Form

00 00 FE FF

UTF-32, big-endian

FF FE 00 00

UTF-32, little-endian

FE FF

UTF-16, big-endian

FF FE

UTF-16, little-endian

EF BB BF

UTF-8