Statistical cluster analysis can be done on small data sets using a hierarchical approach which starts with all observations as single clusters, and combining two "close" clusters iteratively at each step. The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. Only one (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. Nonhierarchical cluster analysis can handle large data sets, but I have not much experience with such. Please let me know if you have any recommendations for this material.
NB. clusterbasic.ijs
NB. 12/17/5
NB. revising clusoverall.ijs
NB. revising cluster.ijs
NB. also starting from scratch
NB. usage t =. <method> cluster <data>
NB. eg: t =. WM cluster sharma
NB. then: t show sharma NB. shows cluster history
NB. or: sharma show~3{."1]_3{t NB. shows history for 3 clusters
NB. and: R2 NB. show Rsquared history
NB. and: nTies NB. show ties history
NB. and: r NB. show cluster pointer history
NB. methods available:
NB. WM:Ward's
NB. CL:Complete linkage
NB. CM:Centroid
NB. AL:Average linkage
NB. SL:Single linkage (nearest neighbor)
show =: [ {&.> [: < ] NB. to see cluster results starting at maxclus
maxclus =: 10
NB. sets the final number of clusters for which detailed statistics
NB. can be sent to an output file or printed
cluster =: dyad define
init y
method =. x
dists =: dists__method currclus
while. <:#currclus
do.
nTies =: nTies ,<:#ndx =. (locmins # i.@#) lfm dists
curr =: >{:tmp =. ,tmp0 =. ndx&{"1 lfm"2 (<"0 ,: cp) dists
if. 1&<#ndx NB. optionally deal with ties
do.
dists2 =. (>{:tmp0) dist__method currclus
ndx2 =. tmp0 FAR +/"1 dists2
curr =: >{:tmp =. ndx2{"1 tmp0
end.
cluskey =: ;(>{: tmp) combine <"0 i. #clusptrs
clusptrs =: cluskey <@;/. pclusptrs =. clusptrs
currclus =: cluskey <@;/. pcurrclus =: currclus
if. maxclus>:#currclus
do.
R =. Rs >each currclus
R2 =: R2,-.R
r =: r, clusptrs
end.
dists =: curr distsch curr dist__method pcurrclus
end.
r
)
init =: monad define
clusptrs =: <"0 i. N =: #obs =. ;/y
clusno =: N # N
pts =: ({:"1 y) </. obs
currclus =: clusptrs </. obs
nTies =: i. 0
SS =: +/diag SSCP&:> obs
R2 =: i. 0
NB. allr_cluster_ =: i. 10 0
r =: 0 0$a:
)
distsch =: dyad define
t =. x
new =. 0 t} y
fix =. (_1{ t)} NB. this _1 must coordinate with combine
dists =. new&fix &.|: new&fix dists
drop =. (<<<0{ t)&{ NB. this 0 must coordinate with combine
drop"1 drop dists
)
length =: sum &. (*:"_)
Edstnc =: length@:-
Edstnc =: ss@:- NB. more efficient
nrmlz =: ] % length
cells =: +/~ @:(#&>) NB. table of cluster sizes
NB. routines for strictly lower triangular matrices to lists and back
expand =: #^:_1
cslt =: &.(lfm :. mfl) NB. create slt from full matrix
slt =: >/~@i. NB. strictly lower triangular matrix
lfm =: slt@# #&, ] NB. list from matrix
fsl =: <:@fms { los@>: NB. find string length needed
fms =: ndxle >:@i. 1: NB. find matrix size
los =: +/\@ i. NB. list of sizes
ndxle =: (<:los)~ NB. index <: list of sizes
mfl =: (],])@fms@# $,@slt@fms@#expand fsl@#{.] NB. slt from list
NB. routines for describing cluster homogeneity, etc.
indices =: ]each cslt@(CP&:(i.@#)/~)
spr =: (-`+)/@:(+/@ diag @SSCP&>)each cslt@(ap&:>each/~)
ns =:(,&#)each/~
ap =: <@,,;
diag =: (<0 1)&|:
NB. cluster distances
singlelinkage =: <./@,&>
completelinkage =: >./@,&>
averagelinkage =: avg@,&>
centroid =: ctr each@(> each)
NB. ways to deal with ties
FAR =: locmax@]
FAR =: locclose
FAR =: sas
min =: <.
max =: >.
locmins =: = min/
v_locmins =: ]<: >:@%&100@[ * min/@]
locmin =: i. min/
locmax =: i. max/
min_numb =: ({.,#)@(#~ locmins)
mins =: &v_locmins
sas =: dyad define
t =. x( >@{:@[{<./&>@])clusptrs NB. min ID for each potential cluster record
tt =. {.@:/:@:(/:~"1) t NB. index of smallest of sorted of each cluster's 2 IDs
)
locclose =: dyad define
t =. x
locmin y
)
incrQ =: 2&(</\)&(,&0)
frst0 =: i.&0
transpose =: |:
sel =:[: ; [ {.&:>each ([: - [) {each ]
NB. measuring changes in clustering coefficients
dif =: 2: -~/\]
pd =: dif % }:
NB. measuring sums of squares
ss =: sum @: *:
sum =: +/
sqrt =: %:
n =: #
avg =: sum%n
ctr =: avg"2 NB. centroid
mnc =: ] -"1 ctr
mp =: sum . *
sscp =: transpose mp ]
transpose =: |:
SSCP =: sscp@mnc
df =: <: @: n
df =: +/@:(#&>)-1: NB. added 2/4/2
dfw =: +/@:>@:(#each) - #
dfw =: +/@:(#&>) - # NB. added 2/4/2
SSCPw =: sum@:>@:(SSCP each)
SSw =: SSCPw%dfw
SSt =: SSCPt%dfw NB. added 2/4/2
SSb =: SSCPb%dfw NB. added 2/4/2
SSCPt =: SSCP@;@,.
SSCPb =: SSCPt - SSCPw
Rs =: sum@:((<0 1)&|:@SSCPb)%sum@:((<0 1)&|:@SSCPt)
det =: -/ . *
CP =: {@(,&<) NB. Catalog of dyadic values
cp =: CP&(i.@#)~ NB. Catalog of a array's indices
freqcount=: (\: {:"1)@(~. ,. #/.~)
combine =: (nit <@:# foc)`({:@[)`]}
foc =: {.@[{.@>@{] NB. first atom of object to be copied
nit =: ({:@[#@>@{]) NB. number of atoms in target object
stddev =: ss@mnc sqrt @ % df
variance =: ss@mnc % df
std =: mnc %"1 stddev
S =: SSCP % df
R =: SSCP@std % df
NB. load <path,'iris.data'
NB. load <path,'wine.data'
NB. load <path,'glass.data'
sharma =: transpose 2 6$5 6 15 16 25 30 5 6 14 15 20 19
NB. **************************************************
NB. **************************************************
coclass 'method'
create =: verb define
0
)
destroy =: codestroy
cocurrent 'base'
WM =: conew 'method'
CL =: conew 'method'
CM =: conew 'method'
AL =: conew 'method'
SL =: conew 'method'
dists__WM =: 3 :'> -:@:-:@:Edstnc__ each/~ centroid__ y' NB. **HALF** of distance squared
dist__WM =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =. t{Nj
Nkl =. +/|: NkNl
NjNkl =. NkNl +/ Nj
NjNm =. Nkl +/ Nj
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
((+/"2 NjNkl*DjDkl)-Nj*/~Dkl)%NjNm
)
dists__CL =: 3 :'completelinkage__ Edstnc__"1/&:> each/~y'
dist__CL =: dyad define
t=. x
>./"2 t{ dists__
)
dists__SL =: 3 :'singlelinkage__ Edstnc__"1/&:> each/~y'
dist__SL =: dyad define
t=. x
<./"2 t{ dists__
)
dists__CM =: 3 :'> Edstnc__ each/~ centroid__ y'
dist__CM =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =. t{Nj
Nkl =. +/|: NkNl
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
DklNkNl =. */Dkl ,|:NkNl
((+/"2 NkNl*DjDkl)% Nkl)- (DklNkNl%*:Nkl)
)
dists__AL =: 3 :'> Edstnc__"1 each/~ centroid__ y'
dist__AL =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =. t{Nj
Nkl =. +/|: NkNl
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
(+/"2 NkNl*DjDkl)% Nkl
)Contributed by BrianSchott
