If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Formatting text file data to import into database

Closed Thread
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-23-04, 14:00
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
Question Formatting text file data to import into database

I have a text file and I need to do some things that would take too long to do by hand. Can I make scripts to do the following operations?

I need to add an break between any number and letter when they are directly next to each other. Example = "bob20", break between b and 2.

I also need to Move first "10 number" sets with "-" charaters bet first two sets of three numbers to end of entry, first number listed first, second listed second.

In addition, lets say hypotheticaly we have a database where there are 10 fields, each seperated by a comma, what should be entered when there is no data for that field yet? Just blank space? Example ,,,bob,,,,20,,,,

Last edited by healdem; 05-08-12 at 12:08. Reason: Photo removed at request of OP
  #2 (permalink)  
Old 01-25-04, 09:37
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Hi,
please give an example from your file.
I think the output you want, is a line like CSV format from Excel.
Tell me which part of the original line is to be converted to which array in you database file, for example.

This part I understand :
I need to add an break between any number and letter when they are directly next to each other. Example = "bob20", break between b and 2.

This part I dont understand :
I also need to Move first "10 number" sets with "-" charaters bet first two sets of three numbers to end of entry, first number listed first, second listed second.
__________________
Greetings from germany
Peter F.
  #3 (permalink)  
Old 01-25-04, 10:00
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
Smile Examples

Yes CSV would be good as its pretty universal (MS Access, MySQL, Winfax, Outlook etc....) can all use it.

This is the format I want to end up with


Format=
Company name,st address,city,state,zip,po P O Box,city,state,zip,phone1,phone2,fax1,fax2,contact 1 (name and email)contact2,contact3,contact4,contact5,contact6 ,contact7,contact8,contact9,ccontact10,notes

therefore this;
Boaen Mech. Contr. Inc.Ph. 912-233-3208Everette Boaen
1223 W. 52nd St. (31405)Fax 912-233-3458Eddie Shuman
PO Box 22054Douglas Smith
Savannah GA 31403


would end up like this;
Boaen Mech. Contr. Inc.,1223 W. 52nd St.,Savannah, GA,31405,PO Box 22054,Savannah, GA, 31403,912-233-3208,,912-233-3458 ,,Everette Boaen,Eddie Shuman,Douglas Smith,,,,,,,,,

That I could import into the many programs I have that can use this. In adition I want to import it into MySQL and post it to the company website to make it easier for employees to access that kind of data. Eventually Id like to link the address book with the accounting database and the purchase order database then you could have a complete dossier of each company that would be easy to access and easy to analyze. Hopefully, eventually, with a way to create a graph of which areas we do most business in, who we have the most orders with, what time of year we sell the most, etc....

The original file was in word and was arranged in three columns. One for the address info, one for contact numbers, and the third was for the contact names. I took the columns out, deleted the space holders (used in the hardcopy for people to pen in changes) and got it arranged like you see above.

Now I just need to do the first thing you understand (bob20 to bob then 20 on the next line), and also move the phone and fax numbers to the end. the rest I guess will have to be by hand because each entry is variable. But I thought this could cut alot of time out and Id learn something too


Since the post I investigated writing a batch file to do the job, but could find a way to do it.

Thanks for your interest, I hope you can help,
Jeff Ernest


PS; LOL sorry about that pic, didnt know it was sooo big, hehe
  #4 (permalink)  
Old 01-26-04, 17:30
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Hi,
I can not help you out by this problem, because the input are not structured in any way.
Sorry
__________________
Greetings from germany
Peter F.
  #5 (permalink)  
Old 01-26-04, 17:33
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
bob20

Is there at least a way to seperate the letter/number problem bob20=bob (next line) 20???
  #6 (permalink)  
Old 01-27-04, 13:33
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Hi,
try this : every time when the characters change from numeric to a-z or A-Z or in the other way, a new word will be created.

old_string="bob20peter30Germany4711Dortmund3333"
new_string=`echo $old_string | awk ' BEGIN { FS="@"}
{
dummy=0
word=""
char1=substr($1,1,1)
if (char1 >= "0" )
{
num=1
}
else
{
num=0
}
lang=length($1)
while ( lang != i )
{
i=i+1
if ( substr($1,i,1) >= "0" )
{
if ( num==1 )
{
word=word""substr($1,i,1)
}
else
{
print word
word=substr($1,i,1)
num=1
}
}
else
{
if ( num==0 )
{
word=word""substr($1,i,1)
}
else
{
print word
word=substr($1,i,1)
num=0
}
}
}
print word
}'`
for i in $new_string
do
echo $i
done


And this is the output :

bob
20
peter
30
Germany
4711
Dortmund
3333
__________________
Greetings from germany
Peter F.

Last edited by fla5do; 01-27-04 at 13:59.
  #7 (permalink)  
Old 01-29-04, 09:07
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
Hmmm

OK I know this is a real n00b question, but is that awk or what and how do I run it?
  #8 (permalink)  
Old 01-29-04, 09:50
Damian Ibbotson Damian Ibbotson is offline
Padawan
 
Join Date: Jun 2002
Location: UK
Posts: 525
I think the way that you have extracted your data from Word has made this exercise pretty futile. I'm not saying that it can't be done but without seeing all your data, there is little chance that we could pick up all the natties that you have introduced.

I've just had a play now, and notice that 52nd Street would become 52,nd Street.

It would be simpler if you could retain the concept of Word's columns somehow. I haven't used it in a while but if you save as txt file, won't the columns be TAB separated? That would be a lot easier to deal with.

PS I hope you are a manager and not an analyst 'cos that spec sucks! ;-)
  #9 (permalink)  
Old 01-29-04, 10:37
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
Wink Yes it sucks

Well,
Its for a company migrating from (*looks down in embarresment and mutters*) HARDCOPIES, *pauses for shock value*, so yes the whole operation sucks. This is the first in a series of changes to convert the company. The problem is that the head honcho had it setup in word without columns just manually spacing over, so it sucks big time.

He had "_____" (underscores) as placemarkers to manually write in changes post-facto. That kinda helped as I could just "replace" the underscore with a carriage return, thereby deleting the useless underscores and formating a bit better all-in-one move.

You can see the originals here> http://www.airionassociates.com/~addresses

That is a page I made after exporting the docs to html and making that direcotry page. It works ok, but its a bitch to update when I think of how easy a db would be in relation to what there is now.


So, anyway, how do I run that script above in *nix . IS it an awk command or what??

Thanks
Jeff

file contents in format:-

name,st address,city,state,zip,po box,city,state,zip,ph1,ph2,fax1,fax2,ct1 (name and email)ct2,ct3,ct4,ct5,ct6,ct7,ct8,ct9,ct10,notes

Last edited by healdem; 05-08-12 at 15:50. Reason: file attachment removed at request of OP
  #10 (permalink)  
Old 01-29-04, 16:55
Damian Ibbotson Damian Ibbotson is offline
Padawan
 
Join Date: Jun 2002
Location: UK
Posts: 525
Okay... as you indicated, the data is too variable to perform all of what you are asking in one go. I suggest you use vi for most of this (it's a really good, interactive editor if your not familiar with it).

To start you off, run this at the command line to split your fields onto individual lines:

Code:
sed 's/\([0-9]\)\([^0-9)-]\{1,\}\)/\1\
\2/g' ab1.txt > ab1.new
It will create a new file called 'ab1.new' which you can then set to work on with vi.

Damian
  #11 (permalink)  
Old 01-29-04, 19:59
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
sed code issue

I tried the command in OSX and it worked however I had some problems.

1) it added a "^m" or "\2" to the end of each line
2) added "\xa0^m" in the blank lines
3) words in contact with numbers were deleted, not sent to a new line

thus;

Erickson Assoc. Inc.
912-527-9500
Eric Erickson
#1 Erickson Drive (31405)
912-527-9510
Hogan Pullin
PO Box 23529
Clate RalstonBill Morgan
Savannah GA 31403Brad Harris
Roger Ervin


became;

Erickson Assoc. Inc.^M
912-527-9500\2
Eric Erickson^M
#1\231405)^M
912-527-9510\2
Hogan Pullin^M
PO Box 23529\2
Clate RalstonBill Morgan^M
Savannah GA 31403\2
Roger Ervin^M


Is it the version of sed? the OS? or the code?
All your help is greatly appreciated.

Thanks
Jeff
  #12 (permalink)  
Old 01-29-04, 20:36
Damian Ibbotson Damian Ibbotson is offline
Padawan
 
Join Date: Jun 2002
Location: UK
Posts: 525
Did you copy the command exactly as I gave it (it is *supposed* to be over 2 lines) ?

The \2 at the end of the lines would indicate a syntax error. The sed command I gave you uses \1 and \2 to represent the portions of the regular expression in parentheses \([0-9]\) is \1 and \([^0-9)-]\{1,\}\) is \2.

It looks like you've entered:

sed 's/\([0-9]\)\([^0-9)-]\{1,\}\)/\1\\2/g' ab1.txt > ab1.new

This would mean that \2 is now \\2. If this were the case, the special meaning of \2 would have been escaped and I guess that is why you now see the literal '\2' rather than the buffered value that it should represent.

As for the ^M characters, they look like a port from DOS. Try...

flip -u yourFile

or, if that doesn't work...

flip -ub yourFile

'flip' just removes different operating systems' control characters.

HTH
  #13 (permalink)  
Old 01-29-04, 22:53
ahktopus ahktopus is offline
Registered User
 
Join Date: Jan 2004
Posts: 7
Thumbs up sucess

I switched to bash (was using tcsh), and it all worked like a charm, thanks for saving me hours worth of editing by hand !!!!!!!!!!!!!!

Next thing I plan on doing is learning sed!!!!!! LOL

IM not too bad with VI but dont know about awk and sed and others.

Again thanks
  #14 (permalink)  
Old 05-09-12, 23:17
Pat Phelan Pat Phelan is online now
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 12,614
Moderators: See the discussion in the moderators forum before making any further changes to this thread.

-PatP
__________________
In theory, theory and practice are identical. In practice, theory and practice are unrelated.
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On