Results 1 to 4 of 4
  1. #1
    Join Date
    Dec 2006
    Posts
    3

    Arrow Unanswered: parsing XML data from CLOB

    Hello!

    My name is Sebastian and I want to read&store an XML-file content passed to a PLSQL routine via a CLOB parameter. I am using Oracle 10g.

    Code:
     PROCEDURE process_request(
          i_xml              IN   CLOB
          )
       IS
          l_parser   DBMS_XMLPARSER.parser;
          l_doc      DBMS_XMLDOM.domdocument;
       BEGIN
          l_parser            := DBMS_XMLPARSER.newparser();
          DBMS_XMLPARSER.setvalidationmode(l_parser, FALSE);
          DBMS_XMLPARSER.setpreservewhitespace(l_parser, FALSE);
          DBMS_XMLPARSER.parseclob(l_parser, i_xml);
          l_doc               := DBMS_XMLPARSER.getdocument(l_parser);
          process_inventory_file
                     (DBMS_XMLDOM.makenode(DBMS_XMLDOM.getdocumentelement(l_doc)));
        [...]
    Ok, so that's the procedure getting a CLOB parameter which contains an XML-file, which could look like this:
    Code:
    <?xml version="1.0" encoding="UTF-8"?>
    <MyFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <SomeNode>
        <Common>
          <UserName>Hans</UserName>
          <RequestName>Request_1</RequestName>      
        </Common>
        <Request>
            <Boat>
                <Type>sailing</KeyType>
                <Key>DE0005752000</Key>
                <Currency>USD</Currency>
                <Market>Market Hälämülü Ökuz</Market>
            </Boat>              
        </Request>
      </SomeNode>
    </MyFile>
    As you can see, the XML file itself is marked being encoded as "UTF-8" - I have tested and ensured it is a valid encoded UTF-8 file.

    Our database internal encoding is WE8MSWIN1252.

    Additional info: Our application uses a OpenFile dialog to select the wanted xml-file, stores it in a CLOB and then calls the PLSQL-procedure.

    And here comes the problem:

    Code:
    ORA-31011: XML parsing failed
    ORA-19202: Error occurred in XML processing
    LPX-00283: document encoding is UTF-8-based but default input encoding is not
    As far as I know, this happens because while storing the file-data as string in a CLOB, when passed to the PLSQL procedure oracle will think of it as a CLOB being encoded in database internal encoding (WE8MSWIN1252).
    Therefor, the BOM of our UTF-8 encoded file "" a.k.a. "EF BB BF" is not recognized since it is interpreted as characters from the internal db charset.

    This leads the parser to not being able to interpret the given XML file producing the error message above.

    If the XML however is encoded in "ISO-8859-1" which seems to equal the database internal encoding, the import works well.

    So this seems to be a very strict limitation of what kind of XML files depending on their encodings can be imported this way. Basically only those which have the same encoding as the database itself, right?

    Alternatives could be reading the XML directly from file or passing the XML as BLOB which is both not doable without great effort at my place.

    Could you please share all your comments, ideas and knowledge to this topic and approve or contradict where necessary?

    Is there a way of loading XML using a CLOB without the XML being corrupted?

    Best regards,

    Sebastian

  2. #2
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    If I were in your position, I'd create a new DB which utilized UTF-8 character set & try the same operation against it.
    If it works (& I suspect it will succeed), then you'll have at least 1 possible solution.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  3. #3
    Join Date
    Dec 2006
    Posts
    3
    Doing that (creating new DB with UTF-8 character set) would only result in being able to import all UTF-8 encoded XML-files, but would not help if the files have any other encoding, right? please correct me if I am wrong here!

    Additionally, changing the encoding of the database is not even an option, since this is customer specific.

    So regardless of the encoding used by the database, and regardless of the encoding of the XML files, this import is meant to work -> is this even possible?
    Aren't there character sets which can not be converted to other charactersets without loosing information? I am thinking of character set subset relations here and those who are not subset of some charset. Or is it possible, in general to convert from whatever character set to whatever character set I want to?

  4. #4
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    I am NOT a "globalization" expert by any stretch of the imagination.
    I believe that UTF-8 is a "superset" that can handle any & all other "national" character sets.


    If you are stuck using WE8MSWIN1252, then any characters not supported by WE8MSWIN1252 will get mis-handled or rejected.

    IMO, either the database is properly configured to properly handle the data coming into it, or it mangles characters it does not recognize.

    I see your choices as follows:
    1) Change the database to handle all incoming characters
    2) Change the incoming data to conform only what the DB will handle
    3) Live with mangled data.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •