RTF Import/Export and Non-ANSI Characters

Michel · Post by **Michel** » Fri Nov 11, 2005 5:49 pm

Hi Sergey,

Maybe I'm doing something wrong, but I was able to reproduce the problem with your pre-compiled RVActionTest demo 1.9.8 too, so here goes...

Summary: Exporting to an RTF file and re-importing it results in all non-ANSI characters losing their character set and starting to looks like upper ASCII ones from the Default charset.

I know my terminology is imprecise, but I hope you understand what I'm talking about. Basically αβγ become áâã.

In detail:
Experiment One:
In RichViewEdit (further referred to as RV), which is not using Unicode (just playing with Charset), type or otherwise insert Greek letter "Alpha" α. Export to RTF file X.rtf or whatever.
A) Import from that file into WordPad or Word 97 -> all is well, we see the Alpha character.
B) Import from that file into RV with RV.RTFReadProperties.TextStyleMode at its default value rvrsUseClosest -> all is well, we see the Alpha character. Except of course that text styles aren't preserved, so this mode is not really acceptable for a general-purpose word-processing-like case.
C) Import from that file into RV with RV.RTFReadProperties.TextStyleMode set to rvrsAddIfNeeded (which is the mode I'd like to use to preserve styles) -> in place of the Greek Alpha we get á.
D) Perform <C> with UnicodeMode set to rvruOnlyUnicode -> we see the Alpha character. Except that I don't want to deal with any Unicode, so this mode is not acceptable to me either.

Experiment Two:
In Word 97, type the letter "Alpha". Save the document as Y.rtf or some such.
E) Import from that file into WordPad or Word 97 -> all is well, we see the Alpha character.
F) Import from that file into RV -> all is well, we see the Alpha character, and I think this is regardless of the RTFReadProperties.TextStyleMode.

I have some ideas as to what's going on, but I know absolutely nothing about the RTF format, so your guess is probably 10 times better than mine, so I won't annoy you with my superficially-empirical conclusions from the above.

What do you think?

Michel

Michel · Post by **Michel** » Sun Nov 13, 2005 2:45 pm

Sergey,

I experimented some more and ended up with a rather strange result. I now have a document in TDBRichViewEdit that when exported to RTF and then re-imported loses charset info for half of the text!

To the best of my knowledge, my TDBRichViewEdit is entirely non-Unicode. I got some Greek and Russian text into it by various means (typing some, getting some via a Symbol Picker, converting some "upper ANSI" characters to the desired charset), but it shouldn't matter, right?

When exporting, my RTFOptions are rvrtfDuplicateUnicode + rvrtfSaveEMFAsWMF + rvrtfSaveBitmapDefault (only these 3), and then I call SaveRTF(FileName, false).

When importing, I basically call:
RTFReadProperties->TextStyleMode = rvrsAddIfNeeded
RTFReadProperties->ParaStyleMode = rvrsAddIfNeeded
InsertRTFFromFileEd(FileName)

As I mentioned in the first post, importing that same RTF file into Word 97 produces correct results: all Greek and Russian text is where it belongs.
Opening it with RVActionTest produces same results as when I import it in my own RV: 2 lines are imported as Greek/Russian respectively, and 2 other lines become "Westernized" (ANSI gibberish).

Here's that RTF file:

{\rtf1\ansi\ansicpg0\uc1\deff0\deflang0\deflangfe0
{\fonttbl
{\f0\fnil\fcharset1 Arial;}
{\f1\fnil\fcharset1 Times New Roman;}
{\f2\fnil\fcharset1 Arial Black;}
{\f3\fnil\fcharset1 Bookman Old Style;}
{\f4\fnil\fcharset1 Comic Sans MS;}
{\f5\fnil\fcharset204 Arial;}
{\f6\fnil\fcharset161 Arial;}
{\f7\fnil\fcharset0 Arial Black;}
{\f8\fnil\fcharset0 Bookman Old Style;}
{\f9\fnil\fcharset0 Comic Sans MS;}
}

{\colortbl;\red0\green0\blue0;}

\uc1
\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f0\fs20 a \'e1\'e2\'e3\'e4\'e5
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 a \plain \f6\lang1032\fs20\chcbpat8\cf1 \'e1\'e2\'e3\'e4\'e5
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f0\fs20 a \plain \f5\lang1049\fs20\chcbpat8\cf1 \'e1\'e2\'e3\'e4\'e5
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f0\fs20 a \'e1\'e2\'e3\'e4\'e5
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 Some \plain \f6\lang1032\fs20 \'e1\'e2\'e3\'e4\'e5\'e6\'e7\'e8\'f6\'f7\'f8\'f9\plain \f0\fs20 text.
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 Some \plain \f5\lang1049\fs20 \'c0\'c1\'c2\'c3\'c4\'c5\'c6\'c7\'c8\'ca\'cb\'cc\'cd\plain \f0\fs20 text.
\par \pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \par}

In the above, I bravely inserted carriage returns between fonts to make it more readable and I removed all but one colors from the color table.
The lines are supposed to look like this:
a бвгде
a αβγδε
a бвгде
a бвгде
Some αβγδεζηθφχψω text.
Some АБВГДЕЖЗИКЛМН text.
The 2nd and 3rd lines are imported correctly in RV. The last 2 aren't.

My observations:
A) When I added \chcbpat8\cf1 to the "Some Greek text" line (5th line), it started importing correctly.
B) When I deleted the {\colortbl...} block completely, all lines started importing as ANSI/Western into RV (but all were still importing just fine into Word 97).

Hope this helps to pin it down! Thanks,

Michel

Post by **Sergey Tkachenko** » Thu Nov 17, 2005 8:28 pm

I assume that you have the source code

Open RVRTFProps.pas, find the line

Code: Select all

          or (FontStyle.Charset=DEFAULT_CHARSET)) // <-- not very good solution

and delete or (FontStyle.Charset=DEFAULT_CHARSET)

Honestly, I do not remember why I added this condition, not I think that it does more harm than good.

Michel · Post by **Michel** » Fri Nov 18, 2005 1:43 pm

Hi Sergey,

Thank you very much for a quick fix, but no, actually, I am still developing with a Trial version, sorry!

Now that I know where the problem is, I can simply set it aside until I have bought a license from you.

Actually, I came across another problem, very similar to the above, so maybe it is caused by either the very same line of code or something very similar (one would hope). This time, it's with import of Unicode text.
I have concluded that export to Unicode works properly: the resulting file is binary-identical to one produced by Win2k Notepad. Importing this file into RV produces "????????" where non-ANSI text was (I tried again with Greek and Russian just to be thorough).
As always, the RVEdit I am importing into is completely non-Unicode. The problem can also be reproduced with your pre-built RVActionTest Demo 1.9.8 but not with the RVEditDemo (the one that displays some Unicode text – Chinese and such). I am using InsertTextFromFileW(). I have also tried preceding the above call with setting TextStyleMode to rvrsAddIfNeeded or rvrsUseClosest (having observed how it affected RTF import), but I don't think it made any difference in this case.

I hope there's a way to persuade InsertTextFromFileW() to import into non-Unicode RVEdits. If not, perhaps you can suggest a workaround along the lines of "import into a hidden Unicode-enabled RV control, then convert that to non-Unicode (not sure about this step – save as non-Unicode RVF?), and finally import that into the final destination (the original RVEdit)". I hope there's an easy fix though.

Thank you once again,

Michel

Post by **Sergey Tkachenko** » Fri Nov 18, 2005 4:46 pm

'?' character is inserted in place of characters which cannot be converted from Unicode to ANSI.

The conversion is performed using the language of the current text style (code page is determined by charset). If the language is unspecified (TextStyle.Charset=DEFAULT_CHARSET), then RVStyle.DefCodePage is used. By default, it is CP_ACP value, that means the system default code page (it is specified in the Control Panel). The component maps Unicode character in the source file to the most similar characters available in the Charset of the destination text style. All characters that cannot be mapped are inserted as '?'.

Only one code page is used when loading a text file. So (if RichViewEdit is not Unicode), you can, for example, import correctly Greek characters but not Russian, or vice versa, Russian but not Greek. You can do it by assigning a style with the proper charset to RichViewEdit.CurTextStyleNo before the insertion.

The Editor1 demo has one Unicode text style, and it automatically switches to it before inserting a Unicode text. It is implemented using RVStyle.DefUnicodeStyle property. But it is highly not recommended to use this property in programs where collections of styles may be changed (and more, it's recommended to use only all-Unicode or all-ANSI style collections, to avoid users confusion)

Michel · Post by **Michel** » Sun Nov 20, 2005 2:40 pm

Oh, OK, I see... Too bad. But thank you for a detailed explanation!
Michel

lkessler · Post by **lkessler** » Sun Nov 20, 2005 9:42 pm

Sergey:

I am doing a very similar thing to what Michel is doing, and I am having a related problem:

I am using RichViewEdit, not using Unicode, and just playing with Charset. My Default Charset is Windows-1252 (Western).

I had to make the Czech characters available. I can reproduce my steps in your Actiontest, so I'll describe it through that:

1. Open Actiontest. I have an RVF file to test the Czech characters. You can get it from: http://www.lkessler.com/eechars.rvf so load that file into Actiontest.

The file contains the following: (Note you cannot just copy and paste this from this message, because it will be Unicode and will not show the problem):

The letters with diacritical marks in Czech are
Vowels: Á, á, É, é, E, e, Í, í, Ó, ó, Ú, ú, Ů, ů, Ý, and ý
Consonants: Č, č, Ď, ď, Ň, ň, Ř, ř, Š, š, Ť, ť, Ž, and ž
Character Set: Central European. Code Page 1250 (Windows-1250).
e.g.: Vavřinec Čada

If you've loaded the RVF file and look at the the "Font ... (Advanced)" dialog box, you'll see the Script is "Eastern European", and all characters will display properly.

2. Now export this file to HTML and also export it to RTF. Open the HTML in Internet Explorer and open the RTF in Microsoft Word. Neither displays correctly.

3. So then I made the change that you suggested earlier:

Sergey Tkachenko wrote:Open RVRTFProps.pas, find the line
Code: Select all
          or (FontStyle.Charset=DEFAULT_CHARSET)) // <-- not very good solution
and delete or (FontStyle.Charset=DEFAULT_CHARSET)

Honestly, I do not remember why I added this condition, not I think that it does more harm than good.

I repeated step 2. I found that the HTML now displayed correctly in Internet Explorer, so that part is now fixed! But, the RTF still does not load correctly into Microsoft Word (I use Office 2000, if that makes a difference).

The actual RTF generated is:

{\rtf1\ansi\ansicpg0\uc1\deff0\deflang1024\deflangfe1024{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset238 Arial;}}{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;\red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green0\blue128;\red128\green0\blue0;\red128\green128\blue0;\red128\green128\blue128;\red192\green192\blue192;}
\uc0
\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 The letters with diacritical marks in Czech are
\par\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 Vowels: \'c1, \'e1, \'c9, \'e9, E, e, \'cd, \'ed, \'d3, \'f3, \'da, \'fa, \'d9, \'f9, \'dd, and \'fd
\par\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 Consonants: \'c8, \'e8, \'cf, \'ef, \'d2, \'f2, \'d8, \'f8, \'8a, \'9a, \'8d, \'9d, \'8e, and \'9e
\par\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 Character Set: Central European. Code Page 1250 (Windows-1250).
\par\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 e.g.: Vav\'f8inec \'c8ada
\par\pard\fi0\li0\ql\ri0\sb0\sa0\itap0 \plain \f1\lang1024\fs20 \par}

Q: So my question, Sergey: Do you know how to fix it so that Richview can produce RTF for this (or any) character set that can be loaded correctly by Word?

Thanks,

Louis

Michel · Post by **Michel** » Mon Nov 21, 2005 2:26 pm

I don't want to interrupt

but I'm confused on several accounts:
A) I thought that the source-code fix that Sergey proposed would fix the import from RTF, and Louis is suggesting that this fixed export for him? Note that I am as yet fixless, so I can't experiment on my side.
B) I performed the following experiment:
1. Downloaded Louis's RVF file and opened it in RVActionTest 1.9.0.1. It ended with "Čada" as it should.
2. Copied-and-Pasted all text into my test app (just to know exactly how I'm exporting things). It was still "Čada". Non-Unicode, Central European charset.
3. Exported to RTF and HTML (via SaveHTML()).
4. Opened the RTF in Word 97. It is still "Čada".
5. Opened the HTML in Opera 8.5 (which I highly recommend) and in IE 6.0. Both produced the same results. Initially, it was "Èada" with code page Auto-selected (since none is explicitly specified in the HTML file), but with code page manually set to Windows-1250, I got "Čada" again.
As a final test, I went back to steps "1" and "2" and exported RTF and HTML directly from RVActionTest and then re-did the remaining steps. Same exact results.

Louis, is there something we're doing differently? And Sergey, would your fix affect what Louis is doing, i.e., export?

Thanks,

Michel

Guest · Post by **Guest** » Mon Nov 21, 2005 3:12 pm

Michel,

I think I see the difference. If you were able to simply paste the text in and it was showing as "Central European" charset, then the default character set on your version of Windows is probably Central European.

In my case, my default Windows charset is "Western".

When I modified the one-line of code as Sergey suggested, it was in the routine: TRVRTFReaderProperties.FindStyleNo. The name would suggest it would fix your problem.

But the name of that routine makes me wonder how one of my export problems (to HTML) got fixed. All I can think of was that maybe compiling the RVRTFProps routine (which I had never done before) caused some change in behavior.

Louis

Michel · Post by **Michel** » Mon Nov 21, 2005 3:39 pm

Hi Louis,

I think the reason I got the right charset is different. I think this is because the Copy-and-Paste operation was performed between 2 RV-based editors, hence both were using the RVF clipboard format, hence everything would work properly. My Windows charset is Western, BTW.

Remaining just as puzzled as before,

Michel

Post by **Sergey Tkachenko** » Mon Nov 21, 2005 8:28 pm

To lkessler
It's strange, because this RTF is shown correctly in my copy of MS Word.
As you can see, all the text is formatted by the 1st font (\f1 RTF keyword).
This style is defined as:
{\f1\fnil\fcharset238 Arial;}
238 is EASTEUROPE_CHARSET.
And the default Windows charset in my computer is RUSSIAN_CHARSET, not EASTEUROPE_CHARSET, but it does not prevent this RTF from displaying correctly.

As for HTML.
If you save in UTF-8 encoding, all text should be exported correctly, even if document contains text of multiple charsets.
If you save ANSI HTML file, only one charset is possible. TRichView uses TextStyles[0].Charset to specify a language of HTML document. I did not looked specially, but I think that for this document it is DEFAULT_CHARSET. In this case, TRichView does not write information about HTML encoding, this allows browsers to autodetect it. If they do it wrong, user can always define it explicitly (in IE: View | Encoding menu)

This correction for RTF loading should not affect the results.
It may affect only in one case - if you reload RTF saved by TRichView before exporting.

Post by **Sergey Tkachenko** » Mon Nov 21, 2005 8:30 pm

To Michel

Yes, this correction affects only RTF loading/pasting.
And your results of exporting this document to HTML and RTF are exactly the same as mine.

Post by **Sergey Tkachenko** » Mon Nov 21, 2005 8:34 pm

This document has the text charset explicitly defined as EASTEUROPE_CHARSET. As I explained, this charset information is retained in RTF file and is lost in HTML file (only TextStyle[0].Charset does matter when saving non-Unicode HTML).

This charset information may be lost because of that corrected bug in TRichView code if the text was loaded/pasted as RTF.

lkessler · Post by **lkessler** » Mon Nov 21, 2005 10:39 pm

Sergey and Michel,

You did help me to figure out what my problem with reading the RTF was.

Once you told me the RTF was correct, I looked for other reasons. What I found was that I had another language enabled (Hebrew) in the Microsoft Office Language Tools settings. After unchecking that language, the RTF read correctly and the Czech characters displayed correctly.

So my problems hopefully are solved.

I must say that I have learned a lot about language settings and code pages in the last few days.

Thank you for your help.

Louis