Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
| Constructor and description |
|---|
CharsetToolkit
(File file)Constructor of the CharsetToolkit utility class. |
| Type Params | Return Type | Name and description |
|---|---|---|
|
static Charset[] |
getAvailableCharsets()Retrieves all the available Charsets on the platform,
among which the default charset. |
|
Charset |
getCharset() |
|
Charset |
getDefaultCharset()Retrieves the default Charset |
|
static Charset |
getDefaultSystemCharset()Retrieve the default charset of the system. |
|
boolean |
getEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding. |
|
BufferedReader |
getReader()Gets a BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered. |
|
boolean |
hasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). |
|
boolean |
hasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). |
|
boolean |
hasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors). |
|
void |
setDefaultCharset(Charset defaultCharset)Defines the default Charset used in case the buffer represents
an 8-bit Charset. |
|
void |
setEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. |
Constructor of the CharsetToolkit utility class.
file - of which we want to know the encoding. Retrieves all the available Charsets on the platform,
among which the default charset.
Charsets.Retrieves the default Charset
Retrieve the default charset of the system.
Charset.Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
Gets a BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered.
BufferedReaderHas a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
Defines the default Charset used in case the buffer represents
an 8-bit Charset.
defaultCharset - the default Charset to be returned
if an 8-bit Charset is encountered. If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
It might be a file without any special character in the range 128-255, but that may be or become
a file encoded with the default charset rather than US-ASCII.
enforce - a boolean specifying the use or not of US-ASCII.Copyright © 2003-2017 The Apache Software Foundation. All rights reserved.