Exploring probability distributions for bivariate temporal granularities

# Exploring probability distributions for bivariate temporal granularities
### Sayani Gupta @Sayani07 <a href="https://sayanigupta-ysc.netlify.com/" class="uri">https://sayanigupta-ysc.netlify.com/</a>
### Young Statisticians Conference October 2, 2019

---

# Electricity smart meter technology  (~ 40 billion half hourly observations)

- Source : Department of the Environment and Energy, Australia
 
 
- Frequency: Half hourly (interval meter reading (Kwh))
 
 
- Time Span: 2012 to 2014
 
 
- Spread: 14K (approx.) households based in Newcastle, New South Wales, and parts of Sydney

???
Smart meters record electricity usage (per kWh) every 30 minutes and send this information to the electricity retailer for billing

**Consumers** can save considerable amount on their electricity bill by 
- Switching on their hot water heater or do laundry when energy is cheaper, or when their solar system is generating surplus energy 
- Switching off appliances during peak demands
- Check usage and compare with similar homes

**Retailers** can reduce costs and increase efficiency
- Lowering metering and connection fees 
- Drawing insights into when customer is home, or sleeping, or even what appliances they are using based on usage figures
- Rewarding customers for mindful usage

Just to give you some perspective I have this data from Department of Energy and Environment, Australia that provides interval meter reading data every 30 minutes from 2012 to 2014. So you can think of it like, that the finest temporal unit here is half hour, whereas the coarsest temporal unit is year. This data is made available for 14k customers located in different local government areas across places.. So this is a data which is spread across both time and space and hence is a spatio-temporal data.

---

## Visualize the raw data from from 2012 - 2014 for 50 households

---

## Visualize the periodicities in half-hourly energy usage for 1 household from 2012 to 2014

---

background-position: center
background-size: contain

???
Well, there can be numerous ways to analyse this data! But I was interested in answering the question - that given this huge volume and spread, how can one explore this data systematically?

---

## **Problem** : How do we systematically explore large quantities of temporal data across different deconstructions of time (half-hour, day, type of day, year) to find regular patterns or anomalies in behaviour?

## **Solution** : Visualize probability distributions over different time granularities.

???

Developed by **John Tukey** as a way of _*systematically*_ using the tools of statistics on a problem before a hypotheses about the data were developed. This encourages to break the big problem into pieces and focusing on subsets. So the reduced goal that I set for myself is to look at time only and to provide ... . The smart meter example is the one that motivated me for this problem, how the idea is to provide the same for any temporal data following an hierarchy.

The key terms are decontructing time and visualizing distribution. In the next couples of slides, we will talk about the strength and challenges for each of these.

---

# Decontructing time: Arrangement
.pull-left[
#### **Granularities** 
abstractions of time based on calendar
 
 
 - **Linear** days, weeks, months, years 
 
 - **Circular** day-of-week, month-of-year or hour-of-day 
 
 - **Aperiodic** day-of-month, week-of-month 
]

.pull-right[
<img src="images/linear-time.png" width="260%" height="150%" style="display: block; margin: auto;" />
 
<img src="images/circular.png" width="100%" style="display: block; margin: auto;" />

]

---

#  Decontructing time: Order

**Single-order-up** second-of-minute, hour-of-day 
 
**Multiple-order-up** second-of-hour, hour-of-week

---

# Computation of granularities

`$z$` : index of a tsibble 
 
`$x$`, `$y$` : two units in the hierarchy table with `$order(x) < order(y)$` 
 
`$f(x, y)$` : accessor function for computing the granularity 
 
`$c(x, y)$` : a constant which relates x and y

#### **Single-order-up**
`$$f(x, y) = \lfloor z/c(z,x) \rfloor\mod c(x,y)$$` where `$y = x+1$`

#### **Multiple-order-up**

`\begin{split}
f(x,y) & = \sum_{i=0}^{order(y) - order(x) - 1} c(x, x+i)(f(x
  +i, x+i+1) - 1)\\
\end{split}`

---

## Interaction of bivariate granularities

**Harmonies** : pairs of granularities that aid exploratory data analysis  
**Clashes**   : pairs leading to structurally empty sets

---

## Visualizing probability distributions

Breaking down the big problem -  two granularities at a time.

#### Types of statistical distribution plots

---

# R package: **gravitas**

.left[
- Compute any granularity? `create_gran`
 
 
 
- Exhaustive list of granularities to explore? `search_gran`
 
 ]

]
 
.pull-left[
 ### Interaction
 --- 
 
- Check if bivariate granularities are harmonies/clashes? `is.harmony`
 
 
- List of harmonies to explore? `harmony`
 
 ]

- Best probability distribution plot for harmonies? 
 `granplot`
 
 
- Sufficient observations? `gran_obs`

]

---

## An example : Electricity smart meter data

Data source : [Department of the Environment and Energy, Australia](https://data.gov.au/dataset/4e21dea3-9b87-4610-94c7-15a8a77907ef)

```
#> # A tsibble: 1,450,232 x 8 [30m] <UTC>
#> # Key: customer_id [50]
#> customer_id reading_datetime general_supply_…
#> <chr> <dttm> <dbl>
#> 1 10006414 2012-02-10 08:00:00 0.141
#> 2 10006414 2012-02-10 08:30:00 0.088
#> 3 10006414 2012-02-10 09:00:00 0.078
#> 4 10006414 2012-02-10 09:30:00 0.151
#> # … with 1.45e+06 more rows, and 5 more variables:
#> # event_key <dbl>, controlled_load_kwh <dbl>,
#> # gross_generation_kwh <dbl>,
#> # net_generation_kwh <dbl>, other_kwh <dbl>
```

---

### Set of possible temporal granularities

<table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:left;"> x </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 1 </td>
 <td style="text-align:left;"> hour_day </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 2 </td>
 <td style="text-align:left;"> hour_week </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 3 </td>
 <td style="text-align:left;"> hour_month </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 4 </td>
 <td style="text-align:left;"> day_week </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 5 </td>
 <td style="text-align:left;"> day_month </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 6 </td>
 <td style="text-align:left;"> week_month </td>
 </tr>
</tbody>
</table>

### So there are 156 pair of granularities to look at. -----> <large> WHAT??? </large>

---

## Set of harmonies

### <large> Good news! Only 13 out 156 are harmonies </large>

<table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:left;"> facet_variable </th>
 <th style="text-align:left;"> x_variable </th>
 <th style="text-align:right;"> facet_levels </th>
 <th style="text-align:right;"> x_levels </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 1 </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 24 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 2 </td>
 <td style="text-align:left;"> day_month </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:right;"> 31 </td>
 <td style="text-align:right;"> 24 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 3 </td>
 <td style="text-align:left;"> week_month </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:right;"> 5 </td>
 <td style="text-align:right;"> 24 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 4 </td>
 <td style="text-align:left;"> day_month </td>
 <td style="text-align:left;"> hour_week </td>
 <td style="text-align:right;"> 31 </td>
 <td style="text-align:right;"> 168 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 5 </td>
 <td style="text-align:left;"> week_month </td>
 <td style="text-align:left;"> hour_week </td>
 <td style="text-align:right;"> 5 </td>
 <td style="text-align:right;"> 168 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 6 </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:left;"> hour_month </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 744 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 7 </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:right;"> 24 </td>
 <td style="text-align:right;"> 7 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 8 </td>
 <td style="text-align:left;"> day_month </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:right;"> 31 </td>
 <td style="text-align:right;"> 7 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 9 </td>
 <td style="text-align:left;"> week_month </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:right;"> 5 </td>
 <td style="text-align:right;"> 7 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 10 </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:left;"> day_month </td>
 <td style="text-align:right;"> 24 </td>
 <td style="text-align:right;"> 31 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 11 </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:left;"> day_month </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 31 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 12 </td>
 <td style="text-align:left;"> hour_day </td>
 <td style="text-align:left;"> week_month </td>
 <td style="text-align:right;"> 24 </td>
 <td style="text-align:right;"> 5 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 13 </td>
 <td style="text-align:left;"> day_week </td>
 <td style="text-align:left;"> week_month </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 5 </td>
 </tr>
</tbody>
</table>

---

## Visualize Harmonies

---

## Another example: Cricket data of Indian Premier League

Data source: [Cricsheet](http://cricsheet.org/) , [Kaggle](https://www.kaggle.com/josephgpinto/ipl-data-analysis/data)

```
#> Observations: 136,598
#> Variables: 38
#> $ season <dbl> 2008, 2008, 2008, 2008, 200…
#> $ match_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ inning <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ over <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, …
#> $ ball <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, …
#> $ winner <chr> "Kolkata Knight Riders", "K…
#> $ total_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 4, …
#> $ batting_team <chr> "Kolkata Knight Riders", "K…
#> $ bowling_team <chr> "Royal Challengers Bangalor…
#> $ batsman <chr> "SC Ganguly", "BB McCullum"…
#> $ non_striker <chr> "BB McCullum", "SC Ganguly"…
#> $ bowler <chr> "P Kumar", "P Kumar", "P Ku…
#> $ is_super_over <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ wide_runs <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ bye_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ legbye_runs <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ noball_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ penalty_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ batsman_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, …
#> $ extra_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, …
#> $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA,…
#> $ dismissal_kind <chr> NA, NA, NA, NA, NA, NA, NA,…
#> $ fielder <chr> NA, NA, NA, NA, NA, NA, NA,…
#> $ city <chr> "Bangalore", "Bangalore", "…
#> $ date <date> 2008-04-18, 2008-04-18, 20…
#> $ team1 <chr> "Kolkata Knight Riders", "K…
#> $ team2 <chr> "Royal Challengers Bangalor…
#> $ toss_winner <chr> "Royal Challengers Bangalor…
#> $ toss_decision <chr> "field", "field", "field", …
#> $ result <chr> "normal", "normal", "normal…
#> $ dl_applied <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ win_by_runs <dbl> 140, 140, 140, 140, 140, 14…
#> $ win_by_wickets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ player_of_match <chr> "BB McCullum", "BB McCullum…
#> $ venue <chr> "M Chinnaswamy Stadium", "M…
#> $ umpire1 <chr> "Asad Rauf", "Asad Rauf", "…
#> $ umpire2 <chr> "RE Koertzen", "RE Koertzen…
#> $ umpire3 <lgl> NA, NA, NA, NA, NA, NA, NA,…
```

---

## Difference in strategy between two top teams

---

### More Information

Package : https://github.com/Sayani07/gravitas  
Slides: https://sayanigupta-ysc.netlify.com/

### With special thanks to

#### Supervisors Professor Rob J Hyndman & Professor Dianne Cook
 
#### Slides created with Rmarkdown, knitr, xaringan, xaringanthemer
 
** Monash University**
]

.pull-right[
### NUMBATS
 
<img src="images/Numbats.png" width="100%" style="display: block; margin: auto;" />
]