Sorry
jayroc2k, but i consider my solution is easier and more readable than yours.
If you work with UNIX,
awk is an essential tool to know.
how to count a specific letter in a text file
Assume that the script file is named 'count_char' and is executable (chmod +x count_char)
Code:
01 #!/usr/bin/awk -f
02 #Usage: count_char char=<char_to_count> input_file(s)
03 NR==1 { char =substr(char,1,1) }
04 { count += gsub(char,"_") }
05 END { print count }
01
#!/usr/bin/awk -f
If the first line of a script begins with the two characters `#!', the remainder of the line specifies an interpreter for the program. The interpreter for this cript is
awk (perhaps
nawk or [gawk] on your system).
When you execute the script :
count_char args...
the shell run :
/usr/bin/awk -f count_char args...
02
#Usage: count_char char=<char_to_count> input_file(s)
The comment line show the calling syntax for this script (if the file is not in the PATH, speecify full or relative path for the file name).
For example, if you want to count the letter 'a' in the file 'article.txt', you can do :
count_char char="a" article.txt
./tools/count_char char=a article.txt
a_count=`count_char char=a article.txt`
The assigment 'char=a' defines and initialize the variable 'char' that will be used in the awk script.
03
NR==1 { char =substr(char,1,1) }
An awk script of a series of "rules". Each rule specifies one pattern to search for, and one action to perform when that pattern is found.
Syntactically, a rule consists of a pattern followed by an action.
The action is enclosed in curly braces to separate it from the pattern.
Rules are usually separated by newlines. Therefore, an `awk' program looks like this:
PATTERN { ACTION }
PATTERN { ACTION }
...
If the PATTERN is omited, the action applies on every line on the input file (see line 04).
If the ACTION is omited, the selected record is printed.
The special pattern END specify the action to execute when the last line of the last input file
has been processed (see line 05).
The line 03 is a rule that specify the action that must be excuted for when the first line of the first input file is read. The variable NR is the number of input records 'awk' has processed since the beginning of the program's execution, 'NR==1'.
The variable 'count_char' contains the letter to count.
We keep only the first character of 'count_char'.
04
{ count += gsub(char,"_") }
There is no PATTERN specified, so the ACTION is executed for all input records.
The 'gsub(char,'_')' function call replaces all the characters 'char' by "_" in the input record and returns the number of substitutions made.
The number of substitution (which is the char count in the record) is cumulked in the 'count' variable.
'count += gsub()' is the same thing that 'count = count + gsub()'
The 'count' variable is initialized to zero the first time it is used.
05
END { print count }
When all input records have been proceed, the number of times the letter (variable 'char') appears in the input file(s) is printed (variable 'count')
how to count a specific word in a text file
Assume that the script file is named 'count_word' and is executable (chmod +x count_char)
Code:
01 #!/usr/bin/awk -f
02 #Usage: count_word word=<word_to_count> input_file(s)
03 NR==1 { re_word = "(^|[^[:alpha:]])" word "([^[:alpha:]]|$)" }
04 { count += gsub(re_word,"_") }
05 END { print count }
02
#Usage: count_word word=<word_to_count> input_file(s)
Comment line that specify script usage.
For example, if you want to count the word 'the' in the file 'article.txt', you can do :
count_word word="the" article.txt
the_count=`count_word char=word article.txt`
The assigment 'word="the"' defines and initialize the variable 'word' that will be used in the awk script.
03
NR==1 { re_word = "(^|[^[:alpha:]])" word "([^[:alpha:]]|$)" }
We assume that a word is delimited by non alphabetics characters (alphabetics characters are A to Z upper and lower case).
When the first record is read, we initialize the 're_word' variable which will be used as a pattern to select words. The pattern is a regular expression :
[:alpha:] => alphabetic character
[^[:alpha:]] => non alphhabetic character ('^' means any characters *except*)
^ => beginning of record
(^|[^[:alpha:]]) => begining of record or non alphabetic character
$ => end of record
([^[:alpha:]]|$) => non alphabetic character or end of record
"(^|[^[:alpha:]])" word "([^[:alpha:]]|$)" => Searched word delimited by non alphabetics characters (or begining or end of record).
04
{ count += gsub(re_word,"_") }
All the occurences of word are substitued by _ and the number of words is cumulated in the 'count' variable.
how to count sentances in a text file
Assume that the script file is named 'count_sentances' and is executable (chmod +x count_char)
Code:
01 #!/usr/bin/awk -f
02 #Usage: count_sentances input_file(s)
03 { count += gsub(/.([.!?:;]+|$)/,"_") }
04 END { print count }
02
#Usage: count_sentances input_file(s)
Comment line that specify script usage.
For example, if you want to count the sentances in the file 'article.txt', you can do :
count_sentances article.txt
s_count=`count_sentances article.txt`
03
{ count += gsub(/.([.!?:;]+|$)/,"_") }
A sentance as a suit of strings ending in . ! ? : ; or end of record.
[.!?:;] => ending character
[.!?:;]+ => one or more consecutives ending characters
([.!?:;]+|$) => one or more consecutives ending characters or end of record.
. => single character
.([.!?:;]+|$) => a character followed by end of sentance.
/.([.!?:;]+|$)/ => regular expression
If you consider that a sentance may split overs records, you can simplify the re:
/.[.!?:;]+/
The sentances are replaced by "_" and number of sentances is cumulated in the 'count variable that will be printed by line 04.
Sorry for my very bad english
