Analysis of Phone Call Detail Records

In a very practical sense, many large files are databases. Many databases are really flat files that have fixed length fields and can be viewed as a j rectangular array via mapping. That is, you can map the file directly and give the line length (which does not require that the file have ASCII line end characters) and then define fields and operate with them.

An example that I presented at the 2000 j user conference involved Call Detail Records CDR from a then extant long distance carrier, MCI. I was working in a small messaging company that sent a fairly large number of faxes. The file I used to demonstrate the j script contained data for 564,218 records constituting the detail for one month's MCI phone bill.

MCE provided the layout of the 404 byte CDR format. The data structure

iMg5:~/Documents/jstuff/mci jkt$ ja mci.ijs

Mapped name of 113997.001 is  cdf

    $cdf    NB. this gives the shape of a mapped phone bill database
564218 404

NB. The following is a function that summarizes the 564218 calls

    mci_summary =: 3 : 0
((5 6 0{BI, Call_date), To_pid) mci_summary y
:
    key =.  x {"1 y
ot =. 1 60 10000%"1~ (#/.~key),.key+//. 0". (Bill_duration, BI, Cost){"1 y
/:~(~.key),.' ',.10 10j1 10j2 10j2 ": ot,. 100*%/"1 ]_1 _2{"1 ot
)

NB. Here are the global field (column) definitions -

    BI             NB. a column/field known to be all blanks
20

    Call_date      NB. The columns containing the date of calls
120 121 122 123 124 125 126 127

    To_pid         NB. A 3 character name for the Product ID
211 212 213

    Bill_duration  NB. Call duration in seconds
163 164 165 166 167 168

    Cost           NB. Cost of the call in 1e_4 $ units
218 219 220 221 222 223 224 225

    mci_summary cdf
08 INA          1       0.2      0.01      3.05
08 INE         97      75.5      2.39      3.16
09 ALA       3613    2881.0    265.05      9.20
09 CAN       8715   10836.1    775.84      7.16
09 CAR        126     111.6     30.08     26.95
09 EDL         19      25.4      3.61     14.22
09 HAW       1107    1462.6    114.62      7.84
09 INA      20959   20996.7    625.97      2.98
09 INE     527143  539187.3  16501.82      3.06
09 INT       1777    2103.0    463.28     22.03
09 MEX        139     142.9     27.15     19.00
09 PUE        472     650.6     58.60      9.01
09 VIR         50      51.5      4.64      9.01

NB. The summary report columns are
NB. 0 - month billed (notice almost all calls were in September
NB. 1 - product/destination code (ALA - alaska, CAN - Canada, etc.
NB. 2 - number of calls
NB. 3 - total minutes aggregated by date/product
NB. 4 - cost of calls aggregated by date/product
NB. 5 - average cost (cents/minute) for calls by date/product

    timex =: 6!:2 , 7!:2@]

    timex 'mci_summary cdf'
1.4244 8.70376e7

Above shows that the report was generated in 1.4 seconds. This compares very favorably with just counting the lines in the file:

iMg5:~/Documents/jstuff/mci jkt$ time wc -cl 113997.001

real 0m1.557s user 0m0.852s sys 0m0.363s iMg5:~/Documents/jstuff/mci jkt$

In fact, the aggregation and report generation for the data is quicker than the OS utility to count lines - I like this. You can see that j used 87 megabytes (not a big load on my 1.5G iMac) to process the 227,944,072 byte file.

The thing is that benchmarks which read a line at a time as in the example that started your questions are just "not done that way in j" - The real advantage is terse programs that subsume detail much more than most programming languages.


CategoryWorkInProgress

JoeyTuttle/PhoneCallRecords (last edited 2010-02-14 18:20:29 by DevonMcCormick)