Results 1 to 11 of 11
  1. #1
    Join Date
    Jul 2012
    Posts
    8

    Unanswered: Help please.. Can't find any on google

    To keep it short...

    I have over 400 word documents on 400 different people. Each word document has the same data, i.e: name, age, etc....

    I need to collect the different data into a database. I tried the Import from text tool in access.. but because the delimiters are not standard across the different variables.. I am stuck with half filled tables..

    I was hoping to figure out a way of making access/excel extract data from the word documents by searching by prespecified words...

    Any help much appreciated.. Ta

  2. #2
    Join Date
    Mar 2009
    Posts
    5,442
    Provided Answers: 14
    If every Word document has the same structure (i.e. every piece of information is always in the same place or there is a mean such as a title, a tag, ... that can identyfy it), you can either export the documents in text format and write a parser to extract the information from the text files, or you can instanciate Word to read each document and, using the Word Object Model, you can extract the information directly from the Word document files.
    Have a nice day!

  3. #3
    Join Date
    Jul 2012
    Posts
    8

    Exclamation

    Thanks for showing interest!!

    Basically:
    I have around 300 radiology reports on 300 different patients. The data in each report is pretty standard in all, but I inherited all these reports in word documents which were not formated into fields or delimited or had coma identifiers extra..
    Just as an example:

    Name: XXXXX
    DOB: XXXXXX
    Address: XXXXXX

    LVEDD=XXXX
    LVESS=KKKK

    TECHNIQUE:
    1. YYYYYY
    2. HHHHHHh

    So what I basically want to do, is try and parse/extract the data for the different patients and insert it in a database for later processing.

    It will take me ages to input manually and I was hoping to come up with some way to automate the whole process...

    Many thanks for all the help...
    PS: am still a medical student, and am a little bit challeneged when it comes to automating processes... but all the help I get from u guyz much appreciated!!!

  4. #4
    Join Date
    Mar 2009
    Posts
    5,442
    Provided Answers: 14
    Here's the path to a possible solution.

    1. In a new Access database, I created a table: [Tbl_Data], like this:
    Code:
    Sub Create_tbl_data()
    
         Const c_SQLCreate = "CREATE TABLE tbl_Data ( " & _
                                     "[Name] TEXT(50) , " & _
                                     "[DOB] TEXT(50) , " & _
                                     "[Address] TEXT(50) , " & _
                                     "[LVEDD] TEXT(50) , " & _
                                     "[LVESS] TEXT(50) , " & _
                                     "[Tech_1] TEXT(50) , " & _
                                     "[Tech_2] TEXT(50)  );"
        CurrentDb.Execute c_SQLCreate, dbFailOnError
    
    End Sub
    2. I added a reference to MS-Word to my project (in the VBA Editor, open the 'Tools' menu and select 'References...'). For Access 2003 the reference to be added is listed as: 'Microsoft Word 11.0 Object Library'.

    3. In the same database, I created a module: 'Mod_MSWordParser' with the following code:
    Code:
    Option Compare Database
    Option Explicit
    
    Function ConvertDocToTxt(ByVal DocName As String) As String
    
        Dim appWord As Word.Application
        Dim strDocName As String
        
        If Len(Dir(DocName)) > 0 Then ' Check if the document exists.
            strDocName = Replace(DocName, ".doc", ".txt")
            Set appWord = New Word.Application
            With appWord
                .Documents.Open DocName
                .ActiveDocument.SaveAs strDocName, wdFormatText
                .ActiveDocument.Close
                .Quit
            End With
            Set appWord = Nothing
        Else
            MsgBox "The document: " & DocName & vbNewLine & "was not found.", vbInformation, "ConvertDocToTxt: Not found"
        End If
        ConvertDocToTxt = strDocName
        
    End Function
    
    Function GetNullString(ByVal Count As Long) As String
    
        Dim i As Long
        
        For i = 0 To Count
            GetNullString = GetNullString & "|"
        Next i
        
    End Function
    
    Sub ParseDocument(ByVal DocName As String)
    
        Const c_Name As Long = 0
        Const c_DOB As Long = 1
        Const c_Address As Long = 2
        Const c_LVEDD As Long = 3
        Const c_LVESS As Long = 4
        Const c_Tech_1 As Long = 5
        Const c_Tech_2 As Long = 6
        Const c_Max As Long = 6
        Const c_Tags As String = "Name:|DOB:|Address:|LVEDD=|LVESS="
        Const c_SQL As String = "INSERT INTO Tbl_Data ( Name, DOB, Address, LVEDD, LVESS, Tech_1, Tech_2) " & _
                                "VALUES ( '@0', '@1', '@2', '@3', '@4', '@5', '@6');"
        
        Dim strDocName As String
        Dim strBuffer As String
        Dim strSQL As String
        Dim varData As Variant
        Dim varTags As Variant
        Dim intHandle As Integer
        Dim booTechnique As Boolean
        Dim i As Long
        
        If Right(DocName, 4) = ".doc" Then
            strDocName = ConvertDocToTxt(DocName)
        Else
            strDocName = DocName
        End If
        If Len(Dir(DocName)) > 0 Then ' Check if the document exists.
            varTags = Split(c_Tags, "|")
            intHandle = FreeFile
            Open strDocName For Input As #intHandle
            Do Until EOF(intHandle)
                Line Input #intHandle, strBuffer
                strBuffer = Trim(strBuffer)
                If Len(strBuffer) > 0 Then
                    For i = 0 To UBound(varTags)
                    
                        ' Case when data and tag are on the same line.
                        '
    
                        If InStr(strBuffer, varTags(i)) = 1 Then
                            If IsArray(varData) = False Then varData = Split(GetNullString(c_Max), "|")
                            varData(i) = Trim(Mid(strBuffer, Len(varTags(i)) + 1))
                            booTechnique = False
                        End If
                    Next i
                    If booTechnique = False Then
                        If Left(strBuffer, 10) = "TECHNIQUE:" Then booTechnique = True
                    Else
                    
                        ' Case when data and tag are on different lines.
                        '
                        If IsArray(varData) = False Then varData = Split(GetNullString(c_Max))
                        If Left(strBuffer, 2) = "1." Then
                            varData(c_Tech_1) = Trim(Mid(strBuffer, 3))
                        ElseIf Left(strBuffer, 2) = "2." Then
                            varData(c_Tech_2) = Trim(Mid(strBuffer, 3))
                        End If
                    End If
                End If
            Loop
            Close #intHandle
            If IsArray(varData) = True Then
                strSQL = c_SQL
                For i = 0 To c_Max
                    strSQL = Replace(strSQL, "@" & CStr(i), varData(i))
                Next i
                CurrentDb.Execute strSQL, dbFailOnError
            Else
                MsgBox "The document: " & DocName & vbNewLine & "contains no data.", vbInformation, "ParseDocument: No data"
            End If
        Else
            MsgBox "The document: " & DocName & vbNewLine & "was not found.", vbInformation, "ParseDocument: File Not found"
        End If
                
    End Sub
    4. I copied your last post and pasted it in a new MS-Word document that I saved in MS-Word format (.doc) under the name: 'U:\Access\Thanks for showing interest.doc'

    5. From the immediate window of the VBA editor of Access, I typed:
    Code:
    parsedocument "U:\Access\Thanks for showing interest.doc"
    6. Here's the contents of the table Tbl_Data at the end of the procedure call:
    Code:
    Name	DOB	Address	LVEDD	LVESS	Tech_1	Tech_2
    ---------------------------------------------------------
    XXXXX	XXXXXX	XXXXXX	XXXX	KKKK	YYYYYY	HHHHHHh
    7. You can easily add additional data fields that the procedure ParseDocument will be able to handle by:

    a) Adding their tags (i.e. mean for identifying them) to the c_Tags constant.

    b) Declaring additional constants similar to c_Min, c_DOB, etc.

    c) Changing the value of c_Max.

    d) Modifying the constant c_SQL accordingly.

    e) Adding the corresponding columns to the table [Tbl_Data]


    Warnings:

    1. There is no true error handling in the solution. You'll probably want to add someting more robust than what I wrote.

    2. Opening and closing MS-Word takes quite a long time. You could use a public instance of the application instead of creating a new one for each file you want to process.

    3. The input table [Tbl_Data] should be considered as a buffer for the imported data, not their final destination.

    4. This is only a sketch, not a complete solution.
    Have a nice day!

  5. #5
    Join Date
    Jul 2012
    Posts
    8
    THANK U SO MUCH SINNDHO...

    this is so helpful..

    but I couldnt get it to work smoothly with the files I have..

    I attached a sample file, can u take a quick look at it, and give me some pointers on how to extract the data?? its doesnt need to be to access.. excel will just be fine..

    what i need is a table similar to the one u just posted in ur last post... the code should then be able to populate from the documents i already have..

    Any help much appreciated.. thank u so much.
    Attached Files Attached Files

  6. #6
    Join Date
    Mar 2009
    Posts
    5,442
    Provided Answers: 14
    As explained in my last post, if the data structure is different from what you provided in your first post, you need to adapt the declaration section of the prodecure 'ParseDocument', namely the constants. In your sample; there is a "tag" that was not mentioned early: 'NHI'. To take it into account, here's the modified header of the procedure:
    Code:
        Const c_Min As Long = 0
        Const c_DOB As Long = 1
        Const c_Address As Long = 2
        Const c_NHI As Long = 3
        Const c_LVEDD As Long = 4
        Const c_LVESS As Long = 5
        Const c_Tech_1 As Long = 6
        Const c_Tech_2 As Long = 7
        Const c_Max As Long = 7
        Const c_Tags As String = "Name:|DOB:|Address:|NHI:|LVEDD=|LVESS="
        Const c_SQL As String = "INSERT INTO Tbl_Data ( Name, DOB, Address, NHI, LVEDD, LVESS, Tech_1, Tech_2) " & _
                                "VALUES ( '@0', '@1', '@2', '@3', '@4', '@5', '@6', '@7');"
    Of course, an additional column named [NHI] must be created into the table [Tbl_Data].

    I also noticed that the converted document has Tab (Chr(9)) characters, so I added one line in the input part of the procedure to get rid of them:
    Code:
                Line Input #intHandle, strBuffer
                strBuffer = Replace(strBuffer, Chr(9), Chr(32))
                strBuffer = Trim(strBuffer)
    Have a nice day!

  7. #7
    Join Date
    Jul 2012
    Posts
    8
    THANK U SO MUCH!!! IT WORKS BRILLIANTLY U ARE A LEGEND....
    I am now editing the code to try and get all the information i need... this is amazing...

    thank u for all ur help.. ur so kind.

  8. #8
    Join Date
    Jul 2012
    Posts
    8
    Just one last quick one question ... how can i make the code handle multiple documents at the same time ??

  9. #9
    Join Date
    Mar 2009
    Posts
    5,442
    Provided Answers: 14
    Quote Originally Posted by Mayseed View Post
    Just one last quick one question ... how can i make the code handle multiple documents at the same time ??
    Let's imagine that all the documents are stored in the same folder. You can use a piece of code such as:
    Code:
    Sub ProcessFolder(ByVal FolderName As String)
    
        Dim strFileName As String
        
        If Left(FolderName, 1) <> "\" Then FolderName = FolderName & "\"
        strFileName = Dir(FolderName & "*.doc")
        Do While Len(strFileName) > 0
            strFileName = FolderName & strFileName
            ParseDocument strFileName
            strFileName = Dir
        Loop
        
    End Sub
    If the doc files are in the folder 'C:\Documents' you can then invoke this procedure with:
    Code:
    ProcessFolder "C:\Documents"
    Have a nice day!

  10. #10
    Join Date
    Jul 2012
    Posts
    8

    Smile

    Thank you so much.
    So i have edited the code.. to try and add another field LA, and look for "LA area".. the LA area value is quoted in the middle of the word document. my try was like this:
    Code:
       
        Const c_Name As Long = 0
        Const c_DOB As Long = 1
        Const c_Address As Long = 2
        Const c_EDV As Long = 3
        Const c_LAA As Long = 4
        Const c_ESV As Long = 5
        Const c_Tech_1 As Long = 6
        Const c_Tech_2 As Long = 7
        Const c_Max As Long = 7
        Const c_Tags As String = "Name:|DOB:|Address:|EDV =|LA area - |ESV ="
        Const c_SQL As String = "INSERT INTO Tbl_Data ( Name, DOB, Address, EDV, LA, ESV, Tech_1, Tech_2) " & _
                                "VALUES ( '@0', '@1', '@2', '@3', '@4', '@5', '@6', '@7');"
    and

    Code:
        
    
     Const c_SQLCreate = "CREATE TABLE tbl_Data ( " & _
                                     "[Name] TEXT(50) , " & _
                                     "[DOB] TEXT(50) , " & _
                                     "[Address] TEXT(50) , " & _
                                     "[EDV] TEXT(50) , " & _
                                     "[LA] TEXT(50) , " & _
                                     "[ESV] TEXT(50) , " & _
                                     "[Tech_1] TEXT(50) , " & _
                                     "[Tech_2] TEXT(50)  );"
        CurrentDb.Execute c_SQLCreate, dbFailOnError


    But dont seem to be able to make it work...
    The technique field fails to populate as well.. what it needs to do is get the different statements after it into different fields.. so there are usually 4 or 5 fields that refer to the technique paragraph..

    Any ideas where am going wrong??

    THANK U FOR UR HELP...

  11. #11
    Join Date
    Mar 2009
    Posts
    5,442
    Provided Answers: 14
    The case of LA/RA data is more complex. The For... Next loop commented with:
    Code:
    ' Case when data and tag are on the same line.
    handles cases when the "Tag" identifying the data is at the beginning of the processed line, which is what:
    Code:
    If InStr(strBuffer, varTags(i)) = 1 Then
    actually means, and also when the remaining part of the line (i.e. the "non-tag" part) is the data in a whole (i.e. until the next line break).

    If the "Tag" can be found anywhere in a line of text, another, more complex technique will be needed.

    In the particular cases you describe ('LA area -' and 'RA area -') it would be possible to find a rather simple solution because the whole piece of information (i.e. "Tag" + data) seems to be enclosed into parentheses which could be used as 'Begin' [(] and 'End' [)] markers. Unfortunately, this would cause problems with other cases where something between parentheses is part of the data:
    EDV = 279 cc (normal 115-198 cc)
    ESV = 166 cc (normal 30-75 cc)
    Stroke volume = 113 cc (normal 76-132 cc)
    LVEF = 40 % (normal 50-75%)
    LV mass = 149 gm (normal 108-184 gm)
    A mix of both could be tried, with these rules:
    1. A "Tag" is at the beginning of a line or after an opening parenthese [(].
    2. The piece of data identified by such a "Tag" ends at the end of a line or before a closing parenthese [)].

    In VBA, we would have something like:
    Code:
    Sub ParseDocument(ByVal DocName As String)
    
        Const c_Name As Long = 0
        Const c_DOB As Long = 1
        Const c_Address As Long = 2
        Const c_EDV As Long = 3
        Const c_LAA As Long = 4
        Const c_ESV As Long = 5
        Const c_Tech_1 As Long = 6
        Const c_Tech_2 As Long = 7
        Const c_Max As Long = 7
        Const c_Tags As String = "Name:|DOB:|Address:|EDV =|LA area - |ESV ="
        Const c_SQL As String = "INSERT INTO Tbl_Data ( Name, DOB, Address, EDV, LA, ESV, Tech_1, Tech_2) " & _
                                "VALUES ( '@0', '@1', '@2', '@3', '@4', '@5', '@6', '@7');"
    
        Dim strDocName As String
        Dim strBuffer As String
        Dim strSQL As String
        Dim varData As Variant
        Dim varTags As Variant
        Dim varLine As Variant
        Dim intHandle As Integer
        Dim booTechnique As Boolean
        Dim i As Long
        Dim j As Long
        Dim x As Long
        
        If Right(DocName, 4) = ".doc" Then
            strDocName = ConvertDocToTxt(DocName)
        Else
            strDocName = DocName
        End If
        If Len(Dir(DocName)) > 0 Then ' Check if the document exists.
            varTags = Split(c_Tags, "|")
            intHandle = FreeFile
            Open strDocName For Input As #intHandle
            Do Until EOF(intHandle)
                Line Input #intHandle, strBuffer
                strBuffer = Replace(strBuffer, Chr(9), Chr(32))
                strBuffer = Trim(strBuffer)
                If Len(strBuffer) > 0 Then
                    varLine = Split(strBuffer, ")")
                    For j = 0 To UBound(varLine)
                        strBuffer = varLine(j)
                        If Len(strBuffer) > 0 Then 
                            For i = 0 To UBound(varTags)
                            
                                ' Case when data and tag are on the same line.
                                '
            
                                x = InStr(strBuffer, varTags(i))
                                If x = 1 Then
                                    If IsArray(varData) = False Then varData = Split(GetNullString(c_Max), "|")
                                    varData(i) = Trim(Mid(strBuffer, Len(varTags(i)) + 1))
                                    booTechnique = False
                                ElseIf x > 1 Then 
                                    If Mid(strBuffer, x - 1, 1) = "(" Then
                                        varData(i) = Trim(Mid(strBuffer, x + Len(varTags(i))))
                                        booTechnique = False
                                    End If 
                                End If
                            Next i
                            If booTechnique = False Then
                                If Left(strBuffer, 10) = "TECHNIQUE:" Then booTechnique = True
                            Else
                            
                                ' Case when data and tag are on different lines.
                                '
                                If IsArray(varData) = False Then varData = Split(GetNullString(c_Max))
                                If Left(strBuffer, 2) = "1." Then
                                    varData(c_Tech_1) = Trim(Mid(strBuffer, 3))
                                ElseIf Left(strBuffer, 2) = "2." Then
                                    varData(c_Tech_2) = Trim(Mid(strBuffer, 3))
                                End If
                            End If
                        End If
                    Next j
                End If
            Loop
            Close #intHandle
            If IsArray(varData) = True Then
                strSQL = c_SQL
                For i = 0 To c_Max
                    strSQL = Replace(strSQL, "@" & CStr(i), varData(i))
                Next i
                CurrentDb.Execute strSQL, dbFailOnError
            Else
                MsgBox "The document: " & DocName & vbNewLine & "contains no data.", vbInformation, "ParseDocument: No data"
            End If
        Else
            MsgBox "The document: " & DocName & vbNewLine & "was not found.", vbInformation, "ParseDocument: File Not found"
        End If
                
    End Sub
    However, another kind of data would probably require other rules and the whole procedure would quickly become a true mess (it already is). A more viable, sustainable solution would consist in writing a true parser, but this is beyond the scope of this post.

    Note: If you need to frequently change the structure of a table (Tbl_Data here), you can use:
    Code:
        Const c_SQLCreate = "CREATE TABLE tbl_Data ( " & _
                                     "[Name] TEXT(50) , " & _
                                     "[DOB] TEXT(50) , " & _
                                     "[Address] TEXT(50) , " & _
                                     "[EDV] TEXT(50) , " & _
                                     "[LA] TEXT(50) , " & _
                                     "[ESV] TEXT(50) , " & _
                                     "[Tech_1] TEXT(50) , " & _
                                     "[Tech_2] TEXT(50)  );"
                                     
        If Exists("tbl_Data") Then CurrentDb.Execute "DROP TABLE Tbl_Data", dbFailOnError
        CurrentDb.Execute c_SQLCreate, dbFailOnError
    With:
    Code:
    Function Exists(ByVal ObjectName As String) As Boolean
    
        If DCount("*", "MSysObjects", "[name] = '" & ObjectName & "'") > 0 Then Exists = True
        
    End Function
    Have a nice day!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •