Analysis of Phone Call Detail Records
In a very practical sense, many large files are databases. Many databases are really flat files that have fixed length fields and can be viewed as a j rectangular array via mapping. That is, you can map the file directly and give the line length (which does not require that the file have ASCII line end characters) and then define fields and operate with them.
An example that I presented at the 2000 j user conference involved Call Detail Records CDR from a then extant long distance carrier, MCI. I was working in a small messaging company that sent a fairly large number of faxes. The file I used to demonstrate the j script contained data for 564,218 records constituting the detail for one month's MCI phone bill.
MCE provided the layout of the 404 byte CDR format. The data structure
iMg5:~/Documents/jstuff/mci jkt$ ja mci.ijs
Mapped name of 113997.001 is cdf
$cdf NB. this gives the shape of a mapped phone bill database
564218 404
NB. The following is a function that summarizes the 564218 calls
mci_summary =: 3 : 0
((5 6 0{BI, Call_date), To_pid) mci_summary y
:
key =. x {"1 y
ot =. 1 60 10000%"1~ (#/.~key),.key+//. 0". (Bill_duration, BI, Cost){"1 y
/:~(~.key),.' ',.10 10j1 10j2 10j2 ": ot,. 100*%/"1 ]_1 _2{"1 ot
)
NB. Here are the global field (column) definitions -
BI NB. a column/field known to be all blanks
20
Call_date NB. The columns containing the date of calls
120 121 122 123 124 125 126 127
To_pid NB. A 3 character name for the Product ID
211 212 213
Bill_duration NB. Call duration in seconds
163 164 165 166 167 168
Cost NB. Cost of the call in 1e_4 $ units
218 219 220 221 222 223 224 225
mci_summary cdf
08 INA 1 0.2 0.01 3.05
08 INE 97 75.5 2.39 3.16
09 ALA 3613 2881.0 265.05 9.20
09 CAN 8715 10836.1 775.84 7.16
09 CAR 126 111.6 30.08 26.95
09 EDL 19 25.4 3.61 14.22
09 HAW 1107 1462.6 114.62 7.84
09 INA 20959 20996.7 625.97 2.98
09 INE 527143 539187.3 16501.82 3.06
09 INT 1777 2103.0 463.28 22.03
09 MEX 139 142.9 27.15 19.00
09 PUE 472 650.6 58.60 9.01
09 VIR 50 51.5 4.64 9.01
NB. The summary report columns are
NB. 0 - month billed (notice almost all calls were in September
NB. 1 - product/destination code (ALA - alaska, CAN - Canada, etc.
NB. 2 - number of calls
NB. 3 - total minutes aggregated by date/product
NB. 4 - cost of calls aggregated by date/product
NB. 5 - average cost (cents/minute) for calls by date/product
timex =: 6!:2 , 7!:2@]
timex 'mci_summary cdf'
1.4244 8.70376e7Above shows that the report was generated in 1.4 seconds. This compares very favorably with just counting the lines in the file:
iMg5:~/Documents/jstuff/mci jkt$ time wc -cl 113997.001
- 564218 227944072 113997.001
real 0m1.557s user 0m0.852s sys 0m0.363s iMg5:~/Documents/jstuff/mci jkt$
In fact, the aggregation and report generation for the data is quicker than the OS utility to count lines - I like this. You can see that j used 87 megabytes (not a big load on my 1.5G iMac) to process the 227,944,072 byte file.
The thing is that benchmarks which read a line at a time as in the example that started your questions are just "not done that way in j" - The real advantage is terse programs that subsume detail much more than most programming languages.
