Here's my experience so far attempting to use JDB on the Netflix Challenge dataset.
First, I load JDB, then create my folder, database, and table for Netflix data:
load '~addons/data/jdb/jdb.ijs' hf=: Open_jdb_ 'C:\Data\Netflix' hd=: Create__hf 'nfbase' ht=: Create__hd 'mcrd';0 : 0 movid int custno int rating byte rdate int )
Notice the "in-line" table definition beginning with "0 : 0" above. There are four fields: the movie ID number, the customer number, the customer rating of the movie, and the rating date.
Now, I attempt to load the Netflix data into the database but, first, a word about this data and how I currently work with it. Since there's too much data to handle all at once - about 100 million records - I've broken it into 100 numbered variables which I've written to file. Each variable is a four-row table of integers, as seen here in variable "murd0" retrieved from file:
VDIR
\Data\Netflix\VarsDir\
VDIR unfileVar_WS_ 'murd0' NB. Movie, User, Rating, Date
+-+----------------------------------------+
|1|+----------------------+---------+-----+|
| ||\Data\Netflix\VarsDir\|MURD0.DAT|murd0||
| |+----------------------+---------+-----+|
+-+----------------------------------------+
$murd0
4 982878
5{."1 murd0
1 1 1 1 1
1596531 1366860 1181550 1227322 2413320
5 4 3 4 4
20040123 20040126 20040201 20040206 20040206As you can see, this is a matrix of integers with a little less than one million columns. In the table definition, I attempted to save some space by defining the rating as a byte since it can only have the values 1-5.
I'm attempting to load this data from these 100 variables into the database in the way I usually deal with this data - using an adverb that applies a verb to specified variables on file:
NB.* getVarInfo: apply arbitrary function to each (filed) var named.
getVarInfo=: 1 : 0
'dd varnm'=. y. NB. Vars dir, var names.
rc=. dd unfileVar_WS_ varnm
if. >{.rc do. rc=. 1;u. ".varnm
[4!:55 <varnm
end.
rc
NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3'
NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN
)I have a list "MVN" of the variable names "murd0" through "murd99", so after defining the database update verb like this:
insRecs=: 3 : 0
Insert__hd 'mcrd';<(<'012345'{~>2{y) 2}y=. <"1 y
)As you can see, I convert the rating to a character on-the-fly. So, I should be able to do this to load all the records into the database:
rc=. insRecs getVarInfo&.>(<VDIR);&.>MVN
However, after processing about 30 of these million column variables, I get this error:
|bad view: assert | 'bad view' assert 0[free fh,mh,0
Here's what the table directory looks like at this point:
dir 'C:\Data\Netflix\nfbase\mcrd\*.*'
+--+-------------------+---------+---+------+
|c0|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c1|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c2|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c3|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c4|2008 10 19 21 15 40|34191553 |rw-|-----a|
+--+-------------------+---------+---+------+
|c5|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|d0|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|d1|2008 10 19 21 9 24 |5284 |rw-|-----a|
+--+-------------------+---------+---+------+
|d2|2008 10 19 21 9 24 |1296 |rw-|-----a|
+--+-------------------+---------+---+------+
+/,>2{"1 dir 'C:\Data\Netflix\nfbase\mcrd\*.*'
854772293There are some records loaded:
Reads__hd 'from mcrd where custno=25' +-----+------+------+--------+ |movid|custno|rating|rdate | +-----+------+------+--------+ |3427 |25 |2 |20050831| |4432 |25 |5 |20050831| |4656 |25 |3 |20050831| |4996 |25 |2 |20050831| |5317 |25 |4 |20050831| |6037 |25 |4 |20050831| |6786 |25 |5 |20050831| +-----+------+------+--------+
but not all of them:
0 0$''[VDIR unfileVar_WS_ 'murd99'
5{."1 murd99
17622 17622 17622 17622 17622
816549 359127 921587 2483371 1265459
3 4 2 4 5
20011216 20011218 20011218 20011221 20011222
NB. "17622" is first movie number in last var:
Reads__hd 'from mcrd where movid=17622'
+-----+------+------+-----+
|movid|custno|rating|rdate|
+-----+------+------+-----+
+-----+------+------+-----+The following appears to be the last variable from file loaded:
0 0$''[VDIR unfileVar_WS_ 'murd37'
5{."1 murd37
6722 6722 6722 6722 6722
1539813 302850 1337655 1003230 2012582
1 4 5 5 5
20030417 20030420 20030420 20030424 20030425
$&.>Reads__hd 'from mcrd where movid=6722'
+------+------+------+------+
|5 |6 |6 |5 |
+------+------+------+------+
|1105 1|1105 1|1105 1|1105 1|
+------+------+------+------+
So, I was able to get a little over one-third of the data loaded. I suspect I'm running into system-related limits to paging or other memory space though I have a 4GB pagefile defined on this machine.
I got the following message from JDB Browser upon attempting to open "Data" tab:
Query: movid,custno,rating,rdate from mcrd |out of memory: getwhere | I.active No data
See further JDB/Netflix -- OlegKobchenko 2008-10-22 06:09:46
