Here's some of my musings on matters statistical.
Median Inconstancy
from Devon McCormick <devonmcc@gmail.com> to J-programming forum <programming@jsoftware.com> date Fri, Oct 16, 2009 at 2:21 PM subject "median" considered inaccurate?
Members of the forum -
while looking up some statistical definitions, I came across this example http://www2.le.ac.uk/offices/ssds/sd/ld/resources/numeracy/variability in which the calculation of the median disagrees with the result of the one listed as "m0=: median=: <.@-:@# { /:~" in "MathStats" on the J wiki.
I was actually looking at the definition of quartiles when I noticed this.
For the series
#scrs=. 43 48 50 50 52 53 56 58 59 60 62 65 66 68 70 71 74 76 78 80
20
m0=: <.@-:@# { /:~
m0 scrs
62
median scrs NB. my own definition
61
median
-:@(+/)@((<. , >.)@midpt { /:~)
midpt
-:@<:@#Also, this site's answers disagree with Excel and with my own quartile function, applied to "scrs" above, but I think the site is correct:
NB. Quartiles 1-3 according to Excel: 52.75 61 70.25 NB. According to http://www2.le.ac.uk/offices/ssds/sd/ld/resources/numeracy/variability: 52.5 61 70.5 0 1 2 quartile&><scrs 52 60 70
My "quartile" disagrees with my "median": the middle quartile should be the same as the median.
quartile=: 4 : 'x{4 ntilebps y'
ntilebps=: 4 : 0
NB.* ntilebps: return breakpoint values of x-tiles of y; e.g. 4 ntilebps y
NB. -> quartiles; 0-based so "1st" quartile is 0{4 ntilebps y.
y=. /:~y
wh=. 0 1#:(i.x)*x%~#y NB. Where partition points are exactly
'n f'=. |:wh NB. whole and fractional part of partitions
1|.+/"1 ((1-f),.f)*(n+/_1 0){y NB. "1|." moves top quantile to end.
)Anyone care to weigh in on this?
Regards,
Devon
This received the following thoughtful replies:
from Sherlock, Ric <R.G.Sherlock@massey.ac.nz> to Programming JForum <programming@jsoftware.com> date Fri, Oct 16, 2009 at 3:33 PM subject Re: [Jprogramming] "median" considered inaccurate?
My understanding is that for an even number of values there is no universally accepted correct value for the median. The average of the two middle values (your defn (the same as that from stats/base/univariate)) is oftn used though. Looked up the definition of "median" and it appears that there are several definitions of "median". And, according to http://en.wikipedia.org/wiki/Median median and quartiles can be messy with badly skewed data. Best I can tell this is a measurement that should be used with care. I wrote a quick verb which gives the same answers as the site you referenced and it does strange things, depending on the data. If the count of the set is odd, which group should have the extra number? What if the data is really skewed? from Don Guinn <donguinn@gmail.com>
to Programming forum <programming@jsoftware.com>
date Fri, Oct 16, 2009 at 3:47 PM
subject Re: [Jprogramming] "median" considered inaccurate?
qr=.([:([:(+/%#)]{~[:(<:,:])[:>.0.25 0.5 0.75"_*#)]/:]) NB. Needs cleaning up.
qr scrs
52.5 61 70.5
qr i.4
0.5 1.5 2.5
qr i.5
1.5 2.5 3.5
qr i.12
2.5 5.5 8.5
qr i.11
2.5 5.5 8.5
qr i.13
3.5 6.5 9.5
-~/0 2{qr scrs
18
qr 1 1 1 1 1 2 3 4
1 1 2.5
My reply:
Don - I like yours better than the one I have now, though I'll probably generalize it into an "Ntiler".
Part of the problem is that there are multiple correct answers if we define quartile numbers as those which divide the set as evenly as possible into four groups, e.g.
quartileCt=: 4 : '+/"1 (y>:/~x,_) *. y< /~__,x' NB. Count elements/quartile NB. All these different answers work correctly: (52.75 61 70.25) quartileCt scrs NB. Excel 5 5 5 5 (52.5 61 70.5) quartileCt scrs NB. web site 5 5 5 5 (52.1 61.1 70.1) quartileCt scrs NB. another answer... 5 5 5 5
One way to test, as you suggest is to look at the behavior when we have an odd number of elements, i.e. "odd" with respect to four:
NB. Two different ways of counting number of elements/quartile:
quartileCt=: 4 : '+/"1 (y>:/~x,_) *. y< /~__,x'
quartileCt2=: 4 : '+/"1 (y> /~x,_) *. y<:/~__,x'
NB. Two different quartilers:
test0=: 1 : '(3{.4 ntilebps y) u y' NB. Mine
test1=: 1 : '(qr y) u y' NB. Don's
NB. Both work OK for even and odd cases counted one way...
quartileCt test0&>0 1 2 3 4}.&.><scrs
5 5 5 5
4 5 5 5
4 5 4 5
4 4 4 5
4 4 4 4
quartileCt test1&>0 1 2 3 4}.&.><scrs
5 5 5 5
5 5 5 4
5 4 5 4
5 4 4 4
4 4 4 4
NB. Mine falls down for the even cases counted the other way:
quartileCt2 test0&>0 1 2 3 4}.&.><scrs
4 5 5 6
4 5 5 5
4 4 5 5
4 4 4 5
3 4 4 5
NB. but Don's works well under this counting method as well:
quartileCt2 test1&>0 1 2 3 4}.&.><scrs
5 5 5 5
5 5 5 4
5 4 5 4
5 4 4 4
4 4 4 4Thanks for your suggestions.
Regards,
Devon
