Results 1 to 7 of 7
  1. #1
    Join Date
    Aug 2011
    Posts
    5

    Unanswered: Remove Illigal Characaters from SQL Server 2005 Database?

    I am using SQL server 2005 and one of my field in house_no as nvarchar, which contains hindi, numeric and both combination values like -
    * 1
    * 2/1
    * 31/1सी
    * 3 मैन्ट भवन
    * हट मैन्ट भवन
    * 12ूूू
    * 14ु
    * ैुौ
    Now I want to delete the all illegal charterers like ू or ैुौ and many more from string and preserving rest. I have used replace() function but it is not working on hindi characters. Please suggest me the solution. My database is big nearly about 200GB and I have to replace all such characters.

  2. #2
    Join Date
    Mar 2007
    Location
    Holmestrand, Norway
    Posts
    332
    I have no experience with this particular issue, and don't have a SQL Server at hand, so just a thought from me: Have tyou tried to search for the binary representation of those characters? According to Books Online the replace function allows both string and binary values.
    Ole Kristian Velstadbråten Bangås - Virinco - MSSQL.no - Facebook - Twitter

  3. #3
    Join Date
    Aug 2011
    Posts
    5
    I have tried to convert these illegal character to ASCII but each hindi character return 63. eg. - ascii('ॆ')=63 and also ascii('प')=63. So I can't use it. Any other suggestion.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Please describe what you mean by "illegal" characters. I'm struggling with that concept.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  5. #5
    Join Date
    Aug 2011
    Posts
    5
    Ok, In hindi language consonants are like म and combination of consonant म and vowel like ै combined to makes मै and both म and मै sounds different. Now due to data-entry mistakes few entry contains ै,ु,ौ ,ू or combination of these. A alone vowel like ै,ु,ौ ,ू are useless and I call them illegal characters. Now my aim is to delete such characters or highlight such entries. For example "मैैन्ट भवन" should be "मैन्ट भवन". If its not possible then at list I want a query which can identify such entries.

  6. #6
    Join Date
    Mar 2007
    Location
    Holmestrand, Norway
    Posts
    332
    So, in other words, Pat are right. We're strictly not speaking of illegal characters, but characters which does not make any sense in the text string. In other words, what you're looking for is some kind of string search mechanism.

    I do not know Hindi myself, but I see you're a bit familiar with the latin alphabet as well. So, let's say that you (incorrectly) would search for any vowels that precede a vowel:
    Code:
    patindex('%[aeiouy][aeiouy]%','My search string')+1
    The plus 1 to get the location of the second vowel, not the first one. Similarly, we could search for a vowel that is not preceded with a vowel:
    Code:
    patindex('%[^aeiouy][aeiouy]%','My search string')+1
    I think this is the tool you have to use, but I guess it will be difficult for me to help you with the logic here.

    I hope this was of some help for you. If course, patindex is only used to find the location, I would recommend you to use the stuff command to delete the "invalid" characters, and replace them with an empty stirng.
    Ole Kristian Velstadbråten Bangås - Virinco - MSSQL.no - Facebook - Twitter

  7. #7
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    The explaination from amit1902 helps me a lot.

    The first problem you are facing is purely technical. The Unicode definition for characters like N'ू' has no sort order and no Unicode weight assigned to that character. While Hindi is a single character set, different languages use and sort the vowels differently so there is no common way to compare and manage them using vanilla Unicode functions.

    The Replace() function can deal with Unicode, but it is finicky about how it does so. In the case of characters that have no Unicode weight, all of the Microsoft character manipulation ignores those characters (which is the correct behavior as defined by the ISO). There are a number of ways to work around this problem, but none of the workarounds that I've seen are generic enough to be useful.

    See Replace function does not work with valid unicode characters above 65500 | Microsoft Connect for more dicussion. A Transact-SQL workaround is given, but it is both ugly and relatively slow.

    To solve this kind of problem (many lahk of values to filter for a specific group of characters) efficently, I've always resorted to a customized CLR routine. The fundamental problem is that the definition of the charcters provided by Unicode is weak, so every generic solution is going to be painful.

    I wish that I had a better answer and would love to hear that someone else had one, but I think that this is the route that you must travel.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •