Package org.eclipse.swt.internal
Class Converter
java.lang.Object
org.eclipse.swt.internal.Converter
About this class:
#################
This class implements the conversions between unicode characters
and the platform supported representation for characters.
Note that, unicode characters which can not be found in the platform
encoding will be converted to an arbitrary platform specific character.
This class is tested via: org.eclipse.swt.tests.gtk.Test_GtkTextEncoding
About JNI invalid input: '&' string conversion:
#############################
- Regular JNI String conversion usually uses a modified UTF-8, see: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
- And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string.
See: http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars
However, the modified UTF-8 only works well with C system functions as it doesn't contain embedded nulls
and is null terminated.
But because the modified UTF-8 only supports up to 3 bytes (and not up to 4 as regular UTF-8), characters
that require 4 bytes (e.g emojos) are not translated properly from Java to C.
To work around this issue, we convert the Java string to a byte array on the Java side manually and then pass it to C.
See: http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
Note:
Java uses UTF-16 Wide characters internally to represent a string.
C uses UTF-8 Multibyte characters (null terminated) to represent a string.
About encoding on Linux/Gtk invalid input: '&' it's relevance to SWT:
####################################################
UTF-* = variable length encoding.
UTF-8 = minimum is 8 bits, max is 6 bytes, but rarely goes beyond 4 bytes. Gtk invalid input: '&' most of web uses this.
UTF-16 = minimum is 16 bits. Java's string are stored this way.
UTF-16 can be
Big Endian : 65 = 00000000 01000001 # Human friendly, reads left to right.
Little Endian : 65 = 01000001 00000000 # Intel x86 and also AMD64 / x86-64 series of processors use the little-endian [1]
# i.e, we in SWT often have to deal with UTF-16 LE
Some terminology:
- "Code point" is the numerical value of unicode character.
- All of UTF-* have the same letter to code-point mapping,
but UTF-8/16/32 have different "back-ends".
Illustration:
(char) = (code point) = (back end).
A = 65 = 01000001 UTF-8
= 00000000 01000001 UTF-16 BE
= 01000001 00000000 UTF-16 LE
- Byte Order Marks (BOM) are a few bytes at the start of a *file* indicating which endianess is used.
Problem: Gtk/webkit often don't give us BOM's.
(further reading *3)
- We can reliably encode character to a backend (A -> UTF-8/16), but the other way round is
guess work since byte order marks are often missing and UTF-16 bits are technically valid UTF-8.
(see Converter.heuristic for details).
We could improve our heuristic by using something like http://jchardet.sourceforge.net/.
- Glib has some conversion functions:
g_utf16_to_utf8
g_utf8_to_utf16
- So does java: (e.g null terminated UTF-8)
("myString" + '\0').getBytes(StandardCharsets.UTF-8)
- I suggest using Java functions where possible to avoid memory leaks.
(Yes, they happen and are big-pain-in-the-ass to find https://bugs.eclipse.org/bugs/show_bug.cgi?id=533995)
Learning about encoding:
#########################
I suggest the following 3 videos to understand ASCII/UTF-8/UTF-16[LE|BE]/UTF-32 encoding:
Overview: https://www.youtube.com/watch?v=MijmeoH9LT4
Details:
Part-1: https://www.youtube.com/watch?v=B1Sf1IhA0j4
Part-2: https://www.youtube.com/watch?v=-oYfv794R9s
Part-3: https://www.youtube.com/watch?v=vLBtrd9Ar28
Also read all of this:
http://kunststube.net/encoding/
and this:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
And lastly, good utf-8 reference: https://en.wikipedia.org/wiki/UTF-8#Description
You should now be a master of encoding. I wish you luck on your journey.
[1] https://en.wikipedia.org/wiki/Endianness
[2] https://en.wikipedia.org/wiki/Byte_order_mark
[3] BOM's: http://unicode.org/faq/utf_bom.html#BOM
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final byte[]
static final char[]
static final byte[]
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic String
byteToStringViaHeuristic
(byte[] bytes) Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic.static String
cCharPtrToJavaString
(int cCharPtr, boolean freecCharPtr) This method takes a 'C' pointer (char *) or (gchar *), reads characters up to the terminating symbol '\0' and converts it into a Java String.static byte[]
javaStringToCString
(String string) Given a java String, convert it to a regular null terimnated C string, to be used when calling a native C function.static char[]
mbcsToWcs
(byte[] buffer) Convert a "C" multibyte UTF-8 string byte array into a Java UTF-16 Wide character array.static char
mbcsToWcs
(char ch) Convert C UTF-8 Multibyte character into a Java UTF-16 Wide character.static char
wcsToMbcs
(char ch) Convert a Java UTF-16 Wide character into a single C UTF-8 Multibyte character that you can pass to a native function.static byte[]
wcsToMbcs
(char[] chars, boolean terminate) Convert a Java UTF-16 Wide character array into a C UTF-8 Multibyte byte array.static byte[]
Convert a Java UTF-16 Wide character string into a C UTF-8 Multibyte byte array.
-
Field Details
-
NullByteArray
public static final byte[] NullByteArray -
EmptyByteArray
public static final byte[] EmptyByteArray -
EmptyCharArray
public static final char[] EmptyCharArray
-
-
Constructor Details
-
Converter
public Converter()
-
-
Method Details
-
mbcsToWcs
public static char[] mbcsToWcs(byte[] buffer) Convert a "C" multibyte UTF-8 string byte array into a Java UTF-16 Wide character array.- Parameters:
buffer
- - byte buffer with C bytes representing a string.- Returns:
- char array representing the string. Usually used for String construction like: new String(mbcsToWcs(..))
-
wcsToMbcs
Convert a Java UTF-16 Wide character string into a C UTF-8 Multibyte byte array. This algorithm stops when it finds the first NULL character. I.e, if your Java String has embedded NULL characters, then the returned string will only go up to the first NULL character.- Parameters:
string
- - a regular Java Stringterminate
- - iftrue
the byte buffer should be terminated with a null character.- Returns:
- byte array that can be passed to a native function.
-
javaStringToCString
Given a java String, convert it to a regular null terimnated C string, to be used when calling a native C function.- Parameters:
string
- A java string.- Returns:
- a pointer to a C String. In C, this would be a 'char *'
-
cCharPtrToJavaString
This method takes a 'C' pointer (char *) or (gchar *), reads characters up to the terminating symbol '\0' and converts it into a Java String. Note: In SWT we don't use JNI's native String functions because of the 3 vs 4 byte issue explained in Class description. Instead we pass a character pointer from C to java and convert it to a String in Java manually.- Parameters:
cCharPtr
- - A char * or a gchar *. Which will be freed up afterwards.freecCharPtr
- - "true" means free up memory pointed to by cCharPtr. CAREFUL! If this string is part of a struct (ex GError), and a specialized free function (like g_error_free(..) is called on the whole struct, then you should not free up individual struct members with this function, as otherwise you can get unpredictable behavior).- Returns:
- a Java String object.
-
wcsToMbcs
public static byte[] wcsToMbcs(char[] chars, boolean terminate) Convert a Java UTF-16 Wide character array into a C UTF-8 Multibyte byte array. This algorithm stops when it finds the first NULL character. I.e, if your Java String has embedded NULL characters, then the returned string will only go up to the first NULL character.- Parameters:
chars
- - a regular Java Stringterminate
- - iftrue
the byte buffer should be terminated with a null character.- Returns:
- byte array that can be passed to a native function.
-
wcsToMbcs
public static char wcsToMbcs(char ch) Convert a Java UTF-16 Wide character into a single C UTF-8 Multibyte character that you can pass to a native function.- Parameters:
ch
- - Java UTF-16 wide character.- Returns:
- C UTF-8 Multibyte character.
-
mbcsToWcs
public static char mbcsToWcs(char ch) Convert C UTF-8 Multibyte character into a Java UTF-16 Wide character.- Parameters:
ch
- - C Multibyte UTF-8 character- Returns:
- Java UTF-16 Wide character
-
byteToStringViaHeuristic
Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic. This is useful when we're not provided the encoding by OS/library.
Current implementation only supports standard java charsets but can be extended as needed. This method could be improved by using http://jchardet.sourceforge.net/
Run time is O(a * n) where a is a constant that varies depending on the size of input n, but roughly 1-20)- Parameters:
bytes
- raw bits from the OS.- Returns:
- String based on the most pop
-