September 5, 2018

Announcements

Intro to Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")
library(lego)
data(legosets)

Types of Variables

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

Data Types in R

Types of Variables

str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6172 obs. of  14 variables:
##  $ Item_Number : chr  "10246" "10247" "10248" "10249" ...
##  $ Name        : chr  "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ...
##  $ Pieces      : int  2262 2464 1158 898 13 39 32 105 13 11 ...
##  $ Minifigures : int  6 10 NA NA 1 2 2 3 2 2 ...
##  $ Image_URL   : chr  "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ...
##  $ GBP_MSRP    : num  132.99 149.99 69.99 59.99 9.99 ...
##  $ USD_MSRP    : num  159.99 199.99 99.99 79.99 9.99 ...
##  $ CAD_MSRP    : num  200 230 120 NA 13 ...
##  $ EUR_MSRP    : num  149.99 179.99 89.99 69.99 9.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...

Qualitative Variables

Descriptive statistics:

  • Contingency Tables
  • Proportional Tables

Plot types:

  • Bar plot
  • Mosaic plot

Contingency Tables

table(legosets$Availability, useNA='ifany')
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   695                     2                  1795 
##           Promotional Promotional (Airline)                Retail 
##                   141                    12                  3120 
##      Retail - limited               Unknown 
##                   403                     4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##                        
##                         Blister pack  Box Box with backing card Bucket
##   LEGO exclusive                  45  147                     0      1
##   LEGOLAND exclusive               0    2                     0      0
##   Not specified                    0   20                     0      0
##   Promotional                      0   44                     0      0
##   Promotional (Airline)            0   11                     0      0
##   Retail                          53 2575                    16     30
##   Retail - limited                 2  302                     1      5
##   Unknown                          0    1                     0      0
##                        
##                         Canister Foil pack Loose Parts Not specified Other
##   LEGO exclusive               0         0          71             7     5
##   LEGOLAND exclusive           0         0           0             0     0
##   Not specified                0         5           0          1739     0
##   Promotional                  0         0           1             0     3
##   Promotional (Airline)        0         0           0             1     0
##   Retail                      78       285           0             0    28
##   Retail - limited             0         1           0             0     0
##   Unknown                      0         0           0             0     0
##                        
##                         Plastic box Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive                  1     412              0    6    0
##   LEGOLAND exclusive              0       0              0    0    0
##   Not specified                   6      24              0    0    1
##   Promotional                     2      90              0    0    1
##   Promotional (Airline)           0       0              0    0    0
##   Retail                          0       4             18    0   33
##   Retail - limited                1      86              0    0    5
##   Unknown                         0       3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1126053143          0.0003240441          0.2908295528 
##           Promotional Promotional (Airline)                Retail 
##          0.0228451069          0.0019442644          0.5055087492 
##      Retail - limited               Unknown 
##          0.0652948801          0.0006480881

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Mosaic Plot

library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

Quantitative Variables

Descriptive statistics:

  • Mean
  • Median
  • Quartiles
  • Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
  • Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

  • Dot plots
  • Histograms
  • Density plots
  • Box plots
  • Scatterplots

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976


fivenum(legosets$Pieces, na.rm=TRUE)
## [1]    0.0   30.0   82.0  256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25

The summary Function

summary(legosets$Pieces)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    82.0   215.2   256.2  5922.0     112

The psych Package

library(psych)
describe(legosets$Pieces, skew=FALSE)
##    vars    n   mean    sd min  max range   se
## X1    1 6060 215.17 356.2   0 5922  5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
##     item                group1 vars    n      mean        sd min  max
## X11    1        LEGO exclusive    1  659 172.74203 442.96954   1 3428
## X12    2    LEGOLAND exclusive    1    2 211.00000 154.14928 102  320
## X13    3         Not specified    1 1747 145.87178 309.19929   1 5195
## X14    4           Promotional    1  140  53.97143 108.42721   1 1000
## X15    5 Promotional (Airline)    1   12 126.16667  47.01612  10  203
## X16    6                Retail    1 3094 245.78119 294.78052   0 3803
## X17    7      Retail - limited    1  402 410.94030 652.06435   1 5922
## X18    8               Unknown    1    4  27.50000  15.96872   6   44
##     range         se
## X11  3427  17.255643
## X12   218 109.000000
## X13  5194   7.397620
## X14   999   9.163772
## X15   193  13.572384
## X16  3803   5.299546
## X17  5921  32.522014
## X18    38   7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)