Here's some of my musings on matters statistical.

Median Inconstancy

from    Devon McCormick <devonmcc@gmail.com>
to      J-programming forum <programming@jsoftware.com>
date    Fri, Oct 16, 2009 at 2:21 PM
subject "median" considered inaccurate?

Members of the forum -

while looking up some statistical definitions, I came across this example http://www2.le.ac.uk/offices/ssds/sd/ld/resources/numeracy/variability in which the calculation of the median disagrees with the result of the one listed as "m0=: median=: <.@-:@# { /:~" in "MathStats" on the J wiki.

I was actually looking at the definition of quartiles when I noticed this.

For the series

   #scrs=. 43 48 50 50 52 53 56 58 59 60 62 65 66 68 70 71 74 76 78 80
20
   m0=: <.@-:@# { /:~
   m0 scrs
62
   median scrs  NB. my own definition
61
   median
-:@(+/)@((<. , >.)@midpt { /:~)
   midpt
-:@<:@#

Also, this site's answers disagree with Excel and with my own quartile function, applied to "scrs" above, but I think the site is correct:

   NB. Quartiles 1-3 according to Excel:
   52.75 61 70.25

   NB. According to http://www2.le.ac.uk/offices/ssds/sd/ld/resources/numeracy/variability:
   52.5 61 70.5

   0 1 2 quartile&><scrs
52 60 70

My "quartile" disagrees with my "median": the middle quartile should be the same as the median.

quartile=: 4 : 'x{4 ntilebps y'
ntilebps=: 4 : 0
NB.* ntilebps: return breakpoint values of x-tiles of y; e.g. 4 ntilebps y
NB.  -> quartiles; 0-based so "1st" quartile is 0{4 ntilebps y.
   y=. /:~y
   wh=. 0 1#:(i.x)*x%~#y           NB. Where partition points are exactly
   'n f'=. |:wh                    NB. whole and fractional part of partitions
   1|.+/"1 ((1-f),.f)*(n+/_1 0){y  NB. "1|." moves top quantile to end.
)

Anyone care to weigh in on this?

Regards,

Devon

This received the following thoughtful replies:

from    Sherlock, Ric <R.G.Sherlock@massey.ac.nz>
to      Programming JForum <programming@jsoftware.com>
date    Fri, Oct 16, 2009 at 3:33 PM
subject Re: [Jprogramming] "median" considered inaccurate?

My understanding is that for an even number of values there is no universally accepted correct value for the median. The average of the two middle values (your defn (the same as that from stats/base/univariate)) is oftn used though.

from    Don Guinn <donguinn@gmail.com>
to      Programming forum <programming@jsoftware.com>
date    Fri, Oct 16, 2009 at 3:47 PM
subject Re: [Jprogramming] "median" considered inaccurate?

Looked up the definition of "median" and it appears that there are several definitions of "median". And, according to http://en.wikipedia.org/wiki/Median median and quartiles can be messy with badly skewed data. Best I can tell this is a measurement that should be used with care.

I wrote a quick verb which gives the same answers as the site you referenced and it does strange things, depending on the data. If the count of the set is odd, which group should have the extra number? What if the data is really skewed?

  qr=.([:([:(+/%#)]{~[:(<:,:])[:>.0.25 0.5 0.75"_*#)]/:]) NB. Needs cleaning up.
  qr scrs
52.5 61 70.5
  qr i.4
0.5 1.5 2.5
  qr i.5
1.5 2.5 3.5
  qr i.12
2.5 5.5 8.5
  qr i.11
2.5 5.5 8.5
  qr i.13
3.5 6.5 9.5
  -~/0 2{qr scrs
18
  qr 1 1 1 1 1 2 3 4
1 1 2.5

My reply:

Don - I like yours better than the one I have now, though I'll probably generalize it into an "Ntiler".

Part of the problem is that there are multiple correct answers if we define quartile numbers as those which divide the set as evenly as possible into four groups, e.g.

   quartileCt=:  4 : '+/"1 (y>:/~x,_) *. y< /~__,x'  NB. Count elements/quartile
NB. All these different answers work correctly:
   (52.75 61 70.25) quartileCt scrs  NB. Excel
5 5 5 5
   (52.5 61 70.5) quartileCt scrs    NB. web site
5 5 5 5
   (52.1 61.1 70.1) quartileCt scrs  NB. another answer...
5 5 5 5

One way to test, as you suggest is to look at the behavior when we have an odd number of elements, i.e. "odd" with respect to four:

NB. Two different ways of counting number of elements/quartile:
   quartileCt=:  4 : '+/"1 (y>:/~x,_) *. y< /~__,x'
   quartileCt2=: 4 : '+/"1 (y> /~x,_) *. y<:/~__,x'
NB. Two different quartilers:
   test0=: 1 : '(3{.4 ntilebps y) u y'  NB. Mine
   test1=: 1 : '(qr y) u y'             NB. Don's
  
NB. Both work OK for even and odd cases counted one way...
   quartileCt test0&>0 1 2 3 4}.&.><scrs
5 5 5 5
4 5 5 5
4 5 4 5
4 4 4 5
4 4 4 4
   quartileCt test1&>0 1 2 3 4}.&.><scrs
5 5 5 5
5 5 5 4
5 4 5 4
5 4 4 4
4 4 4 4
  
NB. Mine falls down for the even cases counted the other way:
   quartileCt2 test0&>0 1 2 3 4}.&.><scrs
4 5 5 6
4 5 5 5
4 4 5 5
4 4 4 5
3 4 4 5
NB. but Don's works well under this counting method as well:
   quartileCt2 test1&>0 1 2 3 4}.&.><scrs
5 5 5 5
5 5 5 4
5 4 5 4
5 4 4 4
4 4 4 4

Thanks for your suggestions.

Regards,

Devon

DevonMcCormick/Statistics (last edited 2009-10-16 23:40:52 by DevonMcCormick)