Results 1 to 2 of 2
  1. #1
    Join Date
    Mar 2008

    Unanswered: Extracting Unicode characters from RTF

    Hi All,
    I have come across a difficult problem to do with extracting UniCode characters from RTF strings.
    A detailed description of my problem is below, if anyone could help, it would be much appreciated. I've tried to make the problem as clear as possible, but if any clarification is needed please let me know.

    -Convert RTF2 formatted text containing foreign characters (UniCode) to PlainText.

    -We are using Stephan Lebans RTF2 control to display and edit text.
    -RTF2 fields cannot be displayed appropriately on reports, so unformatted text must be stored in database.
    -The RTF2 parser cannot handle Unicode (our overseas clients, specifically Romania, use Unicode characters), so often the rtf2.PlainText method returns strings containing ???
    -I have built a simple parser to convert Hex values in rtf2.RTFText to characters
    -Given a character table, I can add functionality to generate characters appropriately depending on RTF Character Set defined in .RTFText.

    -Where can I find a character table for the Character Sets specified in .RTFText (specifically fcharset238)?

    Technical/Testing info:
    These are the 2 relevant fonts:
    F1: {\f1\fnil\fcharset0 MS Sans Serif;}
    F2: {\f2\fswiss\fcharset238{\*\fname Arial;}Arial CE;}

    *Testing in MSWord showed that the actual font (Sans Serif, Arial etc made no difference to presented character, so fcharset is most likely the issue).

    -Pressing ";" usually generates "ş" (hereby referred to as "s")
    -However, when in VB6 code window it generates "º" (this probably isn't important).
    -Copy/pasting from/into VB6 code window alternates between the characters.

    -In RTF format, abnormal characters are partly referenced by “\’XX” with XX being their hex values. Eg the RTF string “xxx\’BAxxx” corresponds to “xxxşxxx”.
    -In RTF format, abnormal characters are partly referenced by the specified font.

    -So, the actual character displayed is dependent on the hex value, as well as the font (character set) specified in RTF.

    Below is a table indicating my observations for a character. Hex Value and Font are the inputs.

    Hex Value || Font ||Character Displayed || Unicode for Character Displayed
    BA || F1 || ş || 00BA
    BA || F2 || º || 015F

  2. #2
    Join Date
    Mar 2008
    I should have mentioned, testing was carried out with Input Language set to Romanian.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts