Unicode in TRichView

<< Click to display table of contents >>

Unicode in TRichView

Introduction

Unicode is a universal standard that assigns a unique number (code) to every character: letters, digits, symbols, and even emojis. Thanks to this, text can be interpreted consistently across different devices and countries. Unicode includes characters from almost all written languages in the world.

UTF-8 and UTF-16 are ways of storing (encoding) Unicode characters in memory or files:

UTF-8 uses from 1 to 4 bytes per character. Single-byte UTF-8 characters are compatible with ANSI. This encoding is very popular on the internet because it saves space for common text (such as English). UTF-8 is the string encoding used in Lazarus.

UTF-16 uses 2 bytes for basic characters and 4 bytes for additional ones. It is often used internally in programs and systems (for example, in Windows). UTF-16 is the string encoding used in Delphi 2009 and later, as well as for the WideString and UnicodeString types.

Unicode Text in TRichView

All strings in TRichView are Unicode strings.

Import and Export

Text Files

LoadText, LoadTextFromStream load text files in the specified encoding (ANSI or Unicode)

LoadTextW, LoadTextFromStreamW load Unicode (UTF-16) text files.

Note:  you can test file with the function

function RV_TestFileUnicode(const FileName: TRVUnicodeString): TRVUnicodeTestResult

defined in RVUni.pas.

Return values

rvutNo the file is not Unicode (odd size);

rvutYes the file is most likely Unicode (UTF-16) (even size, Unicode byte-order characters at the start or #0 in text (first 500 bytes checked));

rvutProbably the file can contain Unicode (even size);

rvutEmpty the file is empty;

rvutError error opening the file.

You can also use WinAPI function IsTextUnicode performing more advanced tests.

SaveText saves text file in the specified encoding.

SaveTextW saves Unicode (UTF-16) text file.

RTF (Rich Text Format) and DocX files

RTF and DocX files can contain Unicode text.

HTML

HTML saving methods HTML in UTF-8 format specially.

In UTF-8 encoding, all characters that are not HTML control characters are saved as-is.

If the encoding is not UTF-8, Unicode characters are saved as codes (&#NNNN;); all characters remain the same, but the file size increases. Therefore, it is strongly recommended to save HTML in UTF-8 format.

When loading HTML, TRichView uses either the encoding specified in HTML code or the encoding specified in the HTML load method parameter. The HTML text is converted from this encoding to its internal representation (that is, UTF-16).

Selection, Search and The Clipboard

GetSelTextA returns selection as an ANSI string. Unicode text is converted basing on Style.DefCodePage property.

GetSelTextW returns selection as a Unicode string.

Text searching methods have versions allowing to search for ANSI and for Unicode string: TRichView.SearchTextA/SearchTextW; however, SearchTextA simply converts the string to Unicode (using Style.DefCodePage) and calls SearchTextW.

CopyTextA copies selection as ANSI text (VCL and LCL for Windows). Unicode strings are converted basing on Style.DefCodePage property.

CopyTextW copies selection as Unicode.

Copy and CopyDef copy Unicode text (see rvoAutoCopyUnicodeText in Options)

Editing Operations

Paste pastes Unicode text from the Clipboard (if it is available).

PasteTextA pastes ANSI text (VCL and LCL for Windows), PasteTextW pastes Unicode text.

InsertTextFromFile inserts text from a file in the specified encoding.

InsertOEMTextFromFile inserts text from a file in DOS encoding (Windows).

InsertTextFromFileW inserts Unicode text (UTF-16) from a file.

InsertText, InsertStringTag insert a Unicode (UTF-16) string in Delphi/C++Builder 2009+, Unicode (UTF-8) string in Lazarus, and ANSI string in older versions of Delphi/C++Builder.

InsertTextA, InsertStringATag inserts ANSI string.

InsertTextW, InsertStringWTag add Unicode string (UTF-16).

RVF (RichView Format)

In the new version of TRichView, all text is saved in RVF files in Unicode encoding.

In older versions of TRichView, some text might be saved in ANSI encoding. When reading, such text will be automatically converted to Unicode, and rvfwConvToUnicode will be added in RVFWarnings.

See also...

Example how to load UTF-8 files.