Results 1 to 7 of 7
  1. #1
    Join Date
    May 2012
    Posts
    8

    Unanswered: Split string with a special character

    Hello first of all sorry for my english. I have a problem, i need separate the text from a text in some fields in MySql.

    give examples of some records from my database to give a better idea of what I do ... thanks


    @TWCHack11 Hackers present - for live video stream go to http://t.co/jNDyKnb #SemWeb #AppsForScience #HackForScience #LinkedData

    @mhausenblas @DublinCore: CfP: IEEE IS Special Issue on Linked Open Gov. Data http://bit.ly/kIWxf8 #linkeddata

    @johanoomen: W3C Library Linked Data Incubator Group Final report and call for comments: http://bit.ly/mpWzz8 #linkeddata #opendata



    USER || Descriptions || URL || TAG
    --------------------------------------------------------------------------------------------------------------------------
    @TWCHack11 || Hackers present - for live video stream go to || http://t.co/jNDyKnb || #SemWeb #AppsForScience #HackForScience #LinkedData

    @mhausenblas @DublinCore || CfP: IEEE IS Special Issue on Linked Open Gov. Data || http://bit.ly/kIWxf8 || #linkeddata

    @johanoomen || W3C Library Linked Data Incubator Group Final report and call for comments: || http://bit.ly/mpWzz8 || #linkeddata #opendata

  2. #2
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Your data were too little to fully undersand what do you want.

    For example: Are there those data? If present, how to extract to the four columns?
    (different order of each element, mixed each element, some informations lacked, some special characters(like "@", "http:", or "#') not present, ...

    Note 1: Those are not complete list of examples. Just some ideas came in my mind...

    Note 2: If your date were enough normalized, please express clearly which formats were expected (and which formats were not present).


    @TWCHack11 Hackers present - for live video stream go to ttp://t.co/jNDyKn #SemWeb #AppsForScience HackForScience #LinkedData

    mhausenblas @DublinCore: CfP: IEEE IS Special Issue on Linked Open Gov. Data http://bit.ly/kIWxf8 #linkeddata

    @johanoomen: W3C Library Linked Data Incubator Group Final report and call for comments: http://bit.ly/mpWzz8 #linkeddata #opendata

    @TWCHack11 Hackers present - for live video stream go to (#SemWeb #AppsForScience HackForScience #LinkedData) http://t.co/jNDyKnb

    CfP: IEEE IS Special Issue on Linked Open Gov. Data @mhausenblas @DublinCore http://bit.ly/kIWxf8 #linkeddata

    #linkeddata #opendata was @johanoomen: W3C Library Linked Data Incubator Group Final report and call for comments: http://bit.ly/mpWzz8

    @johanoomen: W3C Library Linked Data Incubator Group Final report #linkeddata and call for comments: http://bit.ly/mpWzz8 #opendata

    http://t.co/jNDyKnb; @TWCHack11 Hackers present - for live video stream go to #SemWeb #AppsForScience HackForScience #LinkedData

    #linkeddata @mhausenblas http://bit.ly/kIWxf8

    http://bit.ly/mpWzz8 #linkeddata #opendata: @johanoomen W3C Library Linked Data Incubator Group Final report and call for comments

    so on...
    Last edited by tonkuma; 05-05-12 at 14:24.

  3. #3
    Join Date
    May 2012
    Posts
    8
    Hi, the examples I gave, it's like they are in each row

    Some considerations are:

    - USERS: always will be the words that start with @, there may be a single string or a maximum of three strings
    - URL: they are always strings that begin with "http" to the first space
    - TAGS: are the words that start with #, there may be a single string or some strings
    - DESCRIPTIONS: is the string that is between a user and a url, for example:

    @ pdjaramillo http://url.com description of this row, in this case descriptions is "description of this row"

    I need to extract all strings that start with @ as USER, all strings that begin with "http" as the URL, all strings that start with # as TAGS. The description is not very important but if I could extract it would be much better.

    The examples I gaverepresent a single field:

    - @TWCHack11 Hackers present - for live video stream go to http://t.co/jNDyKnb #SemWeb #AppsForScience #HackForScience #LinkedData

    - @mhausenblas @DublinCore: CfP: IEEE IS Special Issue on Linked Open Gov. Data http://bit.ly/kIWxf8 #linkeddata

    All rows are similar

  4. #4
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    I thought that the query might be able to make shorter than my example, if I knew regular-expressions.

    I assumed:
    (1) all strings start by "@".
    (2) tags(start with "#") are last in string, if they exist.
    (3) a string "@pdjaramillo http://url.com description of this row" is new format other than three examples in OP,
    such that descriptions are last(not between user and url).

    If those assumptions are not applicable, please show the exceptional examples.

    Example 1: (not tested)
    Code:
    SELECT string
         , REPLACE(users , ':' , '') AS user
         , REPLACE(
              REPLACE(
                 REPLACE(
                    string
                  , users , ''
                 )
               , url || ' ' , ''
              )
            , tag , ''
           ) AS descriptions
         , url
         , tag
     FROM (SELECT string
                , LEFT(
                     string
                   , LOCATE(
                        ' '
                      , string
                      , GREATEST(2 , second_at , third_at)
                     )
                  ) AS users
                , SUBSTR( string , url , LOCATE(' ' , string , url) - url ) AS url
                , CASE tag
                  WHEN 0 THEN
                       ''
                  ELSE SUBSTR(string , tag)
                  END  AS tag
            FROM (SELECT string
                       , LOCATE('@' , string , 2) AS second_at
                       , LOCATE('@' , string , LOCATE('@' , string , 2) + 1)
                                                  AS third_at
                       , LOCATE('http' , string)  AS url
                       , LOCATE('#' , string)     AS tag
                   FROM  sample_data
                 ) s
          ) s
    ;

  5. #5
    Join Date
    May 2012
    Posts
    8
    hi, thanks for the reply, assumptions 2 and 3 are correct, in some cases there are no TAGS, in other cases there are no URL, and in some other cases, there are no DESCRIPTIONS, there is no defined standard, only the USERS always begins with @.

    Each record is taken from a Tweet from Twitter, and is a database that collects many of them, I need to make a data mining of these records and why should I remove each piece of text.

  6. #6
    Join Date
    May 2012
    Posts
    8
    thanks for your ideas, helped me to get the result I wanted. I put the code if there are any suggestions, thanks again.


    Code:
    SELECT  text,
    --        LOCATE('@', text) AS USER_uno,
    --        LOCATE('@', text, LOCATE('@', text, 1)+1) AS USER_dos,
    --        LOCATE('@', text, LOCATE('@', text, LOCATE('@', text, 1)+1)+1) AS USER_tres,
            SUBSTRING(text,
                      LOCATE('@', text),
                      LOCATE(' ', text, LOCATE('@', text))-LOCATE('@', text)-1
                     ) AS USER1,
            SUBSTRING(text,
                      LOCATE('@', text, LOCATE('@', text, 1)+1),
                      LOCATE(' ', text, LOCATE('@', text, LOCATE('@', text, 1)+1))-LOCATE('@', text, LOCATE('@', text, 1)+1)-1
                     ) AS USER2,
            SUBSTRING(text,
                      LOCATE('@', text, LOCATE('@', text, LOCATE('@', text, 1)+1)+1),
                      LOCATE(' ', text, LOCATE('@', text, LOCATE('@', text, LOCATE('@', text, 1)+1)+1))-LOCATE('@', text, LOCATE('@', text, LOCATE('@', text, 1)+1)+1)-1
                     ) AS USER3,
    
    -- Funciones para retornar solo la URL
    --        LOCATE('http', text) AS URL, -- Determino el inicio de la cadena que contiene la URL
    --        LOCATE(' ', text, LOCATE('http', text))-2 AS ENDURL, -- Determino el final de la cadena que contiene la URL
            SUBSTRING(text, LOCATE('http',text), LOCATE(' ', text , LOCATE('http', text))-LOCATE('http', text)) AS URLSTRING,
    -- Funciones para retornar una cadena de tags
    --        LOCATE('#', text, 1) AS TAG_1,
    --        LOCATE('#', text, LOCATE('#', text, 1)+1) AS TAG_2,
    --        LOCATE(' ', text, LOCATE('#', text, LOCATE('#', text, 1)+1)) AS FIN_TAG,
            SUBSTRING(text,
                      LOCATE('#', text, 1),
                      (LOCATE(' ', text, LOCATE('#', text, LOCATE('#', text, 1)+1))-LOCATE('#', text, 1))) AS TAGS
    FROM tweets
    LIMIT 100
    Last edited by pdjaramillo; 05-08-12 at 13:23. Reason: update the code

  7. #7
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    SUBSTR() is a synonym for SUBSTRING().
    in "MySQL 5.6 Reference Manual"

    I like shorter/compact code, so I like to choose shoter name(e.g. SUBSTR) in synonyms.
    But, some other people may like explanatory names(e.g. SUBSTRING) than contracted ones(e.g. SUBSTR).

    However, I like explanatory names for user defined object names(e.g. table names, column names, user-defined function names, so on ...).

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •