Differences between revisions 4 and 5
 ⇤ ← Revision 4 as of 2006-11-21 18:45:21 → Size: 7053 Editor: BrianSchott Comment: ← Revision 5 as of 2008-12-08 10:45:53 → ⇥ Size: 7053 Editor: anonymous Comment: converted to 1.6 markup No differences found!

Statistical cluster analysis can be done on small data sets using a hierarchical approach which starts with all observations as single clusters, and combining two "close" clusters iteratively at each step. The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. Only one (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. Nonhierarchical cluster analysis can handle large data sets, but I have not much experience with such. Please let me know if you have any recommendations for this material.

```NB. clusterbasic.ijs
NB. 12/17/5
NB. revising clusoverall.ijs
NB. revising cluster.ijs
NB. also starting from scratch

NB. usage t =. <method> cluster <data>
NB. eg:   t =.     WM   cluster sharma
NB. then: t              show sharma    NB. shows cluster history
NB.   or:  sharma show~3{."1]_3{t       NB. shows history for 3 clusters
NB.  and:  R2                           NB. show Rsquared history
NB.  and:  nTies                        NB. show ties     history
NB.  and:  r                            NB. show cluster pointer history

NB. methods available:
NB. WM:Ward's
NB. CM:Centroid

show =: [ {&.> [: < ]    NB. to see cluster results starting at maxclus
maxclus =: 10
NB. sets the final number of clusters for which detailed statistics
NB. can be sent to an output file or printed

init y
method =. x
dists =: dists__method currclus
while. <:#currclus
do.
nTies =: nTies ,<:#ndx =. (locmins # i.@#) lfm dists
curr =: >{:tmp =. ,tmp0 =. ndx&{"1 lfm"2 (<"0 ,: cp) dists
if. 1&<#ndx       NB. optionally deal with ties
do.
dists2 =. (>{:tmp0) dist__method currclus
ndx2 =. tmp0 FAR +/"1 dists2
curr =: >{:tmp =. ndx2{"1 tmp0
end.
cluskey =: ;(>{: tmp) combine <"0 i. #clusptrs
clusptrs =: cluskey <@;/. pclusptrs =. clusptrs
currclus =:  cluskey <@;/.  pcurrclus =: currclus
if. maxclus>:#currclus
do.
R =. Rs >each  currclus
R2 =: R2,-.R
r =: r, clusptrs
end.
dists =: curr distsch curr  dist__method pcurrclus
end.
r
)

clusptrs =: <"0 i. N =: #obs =. ;/y
clusno =:  N # N
pts =:  ({:"1 y) </. obs
currclus =:  clusptrs </. obs
nTies =: i. 0
SS =: +/diag SSCP&:> obs
R2 =: i. 0
NB. allr_cluster_ =: i. 10 0
r =: 0 0\$a:
)

t =. x
new =. 0 t} y
fix =. (_1{ t)} NB. this _1 must coordinate with combine
dists =. new&fix &.|: new&fix dists
drop =. (<<<0{ t)&{ NB. this 0 must coordinate with combine
drop"1 drop dists
)

length =: sum &. (*:"_)
Edstnc =: length@:-
Edstnc =: ss@:- NB. more efficient
nrmlz =: ] % length
cells =: +/~ @:(#&>) NB. table of cluster sizes

NB. routines for strictly lower triangular matrices to lists and back

expand =: #^:_1
cslt =: &.(lfm :. mfl) NB. create slt from full matrix
slt =: >/~@i. NB. strictly lower triangular matrix
lfm =: slt@# #&, ]  NB. list from matrix
fsl =: <:@fms { los@>: NB. find string length needed
fms =: ndxle >:@i. 1: NB. find matrix size
los =: +/\@ i. NB. list of sizes
ndxle =: (<:los)~  NB. index <:  list of sizes
mfl =: (],])@fms@# \$,@slt@fms@#expand fsl@#{.]  NB. slt from list

NB. routines for describing cluster homogeneity, etc.

indices =: ]each cslt@(CP&:(i.@#)/~)
spr =: (-`+)/@:(+/@ diag @SSCP&>)each cslt@(ap&:>each/~)
ns =:(,&#)each/~
ap =: <@,,;
diag =: (<0 1)&|:

NB. cluster distances

centroid =: ctr each@(> each)

NB. ways to deal with ties
FAR =: locmax@]
FAR =: locclose
FAR =: sas

min =: <.
max =: >.
locmins =: = min/
v_locmins =: ]<: >:@%&100@[ * min/@]
locmin =: i. min/
locmax =: i. max/
min_numb =: ({.,#)@(#~ locmins)
mins =: &v_locmins

t =. x( >@{:@[{<./&>@])clusptrs NB. min ID for each potential cluster record
tt =. {.@:/:@:(/:~"1) t NB. index of smallest of sorted of each cluster's 2 IDs
)

t =. x
locmin y
)

incrQ =: 2&(</\)&(,&0)
frst0 =: i.&0

transpose =: |:
sel =:[: ; [ {.&:>each ([: - [) {each ]

NB. measuring changes in clustering coefficients

dif =: 2: -~/\]
pd =: dif % }:

NB. measuring sums of squares

ss =: sum @: *:
sum =: +/
sqrt =: %:
n =: #
avg =: sum%n
ctr  =: avg"2 NB. centroid
mnc =: ] -"1 ctr
mp =: sum . *
sscp =: transpose mp ]
transpose =: |:
SSCP =: sscp@mnc
df =: <: @: n
df =: +/@:(#&>)-1:  NB. added 2/4/2
dfw =: +/@:>@:(#each) - #
dfw =: +/@:(#&>) - #  NB. added 2/4/2
SSCPw =: sum@:>@:(SSCP each)
SSw =: SSCPw%dfw
SSt =: SSCPt%dfw   NB. added 2/4/2
SSb =: SSCPb%dfw   NB. added 2/4/2
SSCPt =: SSCP@;@,.
SSCPb =: SSCPt - SSCPw
Rs =: sum@:((<0 1)&|:@SSCPb)%sum@:((<0 1)&|:@SSCPt)
det =: -/ . *

CP =: {@(,&<)  NB. Catalog of dyadic values
cp =: CP&(i.@#)~  NB. Catalog of a array's indices
freqcount=: (\: {:"1)@(~. ,. #/.~)

combine =: (nit <@:# foc)`({:@[)`]}
foc =: {.@[{.@>@{] NB. first atom of object to be copied
nit =: ({:@[#@>@{]) NB. number of atoms in target object

stddev =: ss@mnc sqrt @ % df
variance =: ss@mnc % df
std =: mnc %"1 stddev
S =: SSCP % df
R =: SSCP@std % df

sharma =: transpose 2 6\$5 6 15 16 25 30 5 6 14 15 20 19

NB. **************************************************
NB. **************************************************

coclass 'method'

create =: verb define
0
)

destroy =: codestroy

cocurrent 'base'
WM =: conew 'method'
CL =: conew 'method'
CM =: conew 'method'
AL =: conew 'method'
SL =: conew 'method'

dists__WM =:  3 :'> -:@:-:@:Edstnc__  each/~ centroid__  y'  NB. **HALF** of distance squared

t=. x
Nj =. n__"2 &>@(> each) y
NkNl =.  t{Nj
Nkl =. +/|: NkNl
NjNkl =. NkNl +/ Nj
NjNm =. Nkl +/ Nj
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
((+/"2 NjNkl*DjDkl)-Nj*/~Dkl)%NjNm
)

dists__CL =:  3 :'completelinkage__ Edstnc__"1/&:> each/~y'

t=. x
>./"2 t{ dists__
)

dists__SL =:  3 :'singlelinkage__ Edstnc__"1/&:> each/~y'

t=. x
<./"2 t{ dists__
)

dists__CM =:  3 :'> Edstnc__ each/~ centroid__ y'

t=. x
Nj =. n__"2 &>@(> each) y
NkNl =.  t{Nj
Nkl =. +/|: NkNl
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
DklNkNl =. */Dkl ,|:NkNl
((+/"2 NkNl*DjDkl)% Nkl)- (DklNkNl%*:Nkl)
)

dists__AL =:  3 :'> Edstnc__"1 each/~ centroid__ y'