class: center, middle, inverse, title-slide # Exploring probability distributions for bivariate temporal granularities ###
Sayani Gupta
@Sayani07
https://sayanigupta-ysc.netlify.com/
###
Young Statisticians Conference
October 2, 2019 --- # Electricity smart meter technology (~ 40 billion half hourly observations) <!-- .pull-left[ --> <!-- .center-left[ --> - Source : Department of the Environment and Energy, Australia <br> <br> - Frequency: Half hourly (interval meter reading (Kwh)) <br> <br> - Time Span: 2012 to 2014 <br> <br> - Spread: 14K (approx.) households based in Newcastle, New South Wales, and parts of Sydney <br> <br> ??? Smart meters record electricity usage (per kWh) every 30 minutes and send this information to the electricity retailer for billing **Consumers** can save considerable amount on their electricity bill by - Switching on their hot water heater or do laundry when energy is cheaper, or when their solar system is generating surplus energy - Switching off appliances during peak demands - Check usage and compare with similar homes **Retailers** can reduce costs and increase efficiency - Lowering metering and connection fees - Drawing insights into when customer is home, or sleeping, or even what appliances they are using based on usage figures - Rewarding customers for mindful usage Just to give you some perspective I have this data from Department of Energy and Environment, Australia that provides interval meter reading data every 30 minutes from 2012 to 2014. So you can think of it like, that the finest temporal unit here is half hour, whereas the coarsest temporal unit is year. This data is made available for 14k customers located in different local government areas across places.. So this is a data which is spread across both time and space and hence is a spatio-temporal data. --- <!-- class: hide-slide-number --> ## Visualize the raw data from from 2012 - 2014 for 50 households <img src="images/smart_allcust.gif" style="display: block; margin: auto;" /> --- <!-- class: hide-slide-number --> ## Visualize the periodicities in half-hourly energy usage for 1 household from 2012 to 2014 <img src="figure/motivation5-1.svg" style="display: block; margin: auto;" /> --- background-image: url("images/problem.png") background-position: center background-size: contain ??? Well, there can be numerous ways to analyse this data! But I was interested in answering the question - that given this huge volume and spread, how can one explore this data systematically? --- class: center,middle ## **Problem** : <span style="color:#FFDAB9"> How do we systematically explore large quantities of temporal data across different deconstructions of time (half-hour, day, type of day, year) to find regular patterns or anomalies in behaviour? ## **Solution** : <span style="color:#FFDAB9"> Visualize probability distributions over different time granularities. ??? Developed by **John Tukey** as a way of _*systematically*_ using the tools of statistics on a problem before a hypotheses about the data were developed. This encourages to break the big problem into pieces and focusing on subsets. So the reduced goal that I set for myself is to look at time only and to provide ... . The smart meter example is the one that motivated me for this problem, how the idea is to provide the same for any temporal data following an hierarchy. The key terms are decontructing time and visualizing distribution. In the next couples of slides, we will talk about the strength and challenges for each of these. --- # Decontructing time: Arrangement .pull-left[ #### **Granularities** abstractions of time based on calendar <br> <br> - <i> **Linear**</i> days, weeks, months, years <br> - <i> **Circular** </i> day-of-week, month-of-year or hour-of-day <br> - <i> **Aperiodic** </i> day-of-month, week-of-month ] .pull-right[ <img src="images/linear-time.png" width="260%" height="150%" style="display: block; margin: auto;" /> <br> <img src="images/circular.png" width="100%" style="display: block; margin: auto;" /> ] --- # Decontructing time: Order <i>**Single-order-up**</i> second-of-minute, hour-of-day <br> <i>**Multiple-order-up**</i> second-of-hour, hour-of-week <img src="images/calendar_new.jpg" width="100%" height="380" style="display: block; margin: auto;" /> --- # Computation of granularities `\(z\)` : index of a tsibble <br> `\(x\)`, `\(y\)` : two units in the hierarchy table with `\(order(x) < order(y)\)` <br> `\(f(x, y)\)` : accessor function for computing the granularity <br> `\(c(x, y)\)` : a constant which relates x and y <br> #### **Single-order-up** `$$f(x, y) = \lfloor z/c(z,x) \rfloor\mod c(x,y)$$` where `\(y = x+1\)` #### **Multiple-order-up** `\begin{split} f(x,y) & = \sum_{i=0}^{order(y) - order(x) - 1} c(x, x+i)(f(x +i, x+i+1) - 1)\\ \end{split}` --- ## Interaction of bivariate granularities **Harmonies** : pairs of granularities that aid exploratory data analysis **Clashes** : pairs leading to structurally empty sets <img src="images/clash.png" width="100%" style="display: block; margin: auto;" /> --- ## Visualizing probability distributions Breaking down the big problem - two granularities at a time. #### Types of statistical distribution plots <img src="figure/allplot-1.svg" style="display: block; margin: auto;" /> --- # R package: **gravitas** .center[ ### Computation --- .left[ - Compute any granularity? <span style="color:Red">`create_gran` <br> <br> - Exhaustive list of granularities to explore? <span style="color:Red"> `search_gran` <br> ] ] .pull-left[ ### Interaction --- - Check if bivariate granularities are harmonies/clashes? <span style="color:Red"> `is.harmony` <br> <br> - List of harmonies to explore? <span style="color:Red"> `harmony` <br> ] .pull-right[ ### Visualization --- - Best probability distribution plot for harmonies? <span style="color:Red"> `granplot` <br> <br> - Sufficient observations? <span style="color:Red"> `gran_obs` ] --- <!-- class: center,middle --> <!-- # <span style="color:MediumVioletRed"> Package gravitas </span> --> <!-- ## granularity visualization of time series data --> ## An example : Electricity smart meter data <i><small>Data source</i></small> : [<small><i>Department of the Environment and Energy, Australia</i></small>](https://data.gov.au/dataset/4e21dea3-9b87-4610-94c7-15a8a77907ef) ``` #> # A tsibble: 1,450,232 x 8 [30m] <UTC> #> # Key: customer_id [50] #> customer_id reading_datetime general_supply_… #> <chr> <dttm> <dbl> #> 1 10006414 2012-02-10 08:00:00 0.141 #> 2 10006414 2012-02-10 08:30:00 0.088 #> 3 10006414 2012-02-10 09:00:00 0.078 #> 4 10006414 2012-02-10 09:30:00 0.151 #> # … with 1.45e+06 more rows, and 5 more variables: #> # event_key <dbl>, controlled_load_kwh <dbl>, #> # gross_generation_kwh <dbl>, #> # net_generation_kwh <dbl>, other_kwh <dbl> ``` --- ### Set of possible temporal granularities <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> hour_day </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> hour_week </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> hour_month </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> day_week </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> day_month </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> week_month </td> </tr> </tbody> </table> ### So there are 156 pair of granularities to look at. <span style="color:Red"> -----> <large> WHAT??? </large> --- ## Set of harmonies ### <span style="color:Orange"><large> Good news! Only 13 out 156 are harmonies </large> <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> facet_variable </th> <th style="text-align:left;"> x_variable </th> <th style="text-align:right;"> facet_levels </th> <th style="text-align:right;"> x_levels </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> hour_week </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 168 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> hour_week </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 168 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> hour_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 744 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> day_month </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> week_month </td> <td style="text-align:left;"> day_week </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> day_month </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> day_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> hour_day </td> <td style="text-align:left;"> week_month </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> day_week </td> <td style="text-align:left;"> week_month </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> --- ## Visualize Harmonies <img src="figure/granplotoverlay3-1.svg" style="display: block; margin: auto;" /> --- ## Another example: Cricket data of Indian Premier League <small><i>Data source</i></small>: [<small><i>Cricsheet</i></small>](http://cricsheet.org/) , [<small><i>Kaggle</i></small>](https://www.kaggle.com/josephgpinto/ipl-data-analysis/data) ``` #> Observations: 136,598 #> Variables: 38 #> $ season <dbl> 2008, 2008, 2008, 2008, 200… #> $ match_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, … #> $ inning <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, … #> $ over <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, … #> $ ball <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, … #> $ winner <chr> "Kolkata Knight Riders", "K… #> $ total_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 4, … #> $ batting_team <chr> "Kolkata Knight Riders", "K… #> $ bowling_team <chr> "Royal Challengers Bangalor… #> $ batsman <chr> "SC Ganguly", "BB McCullum"… #> $ non_striker <chr> "BB McCullum", "SC Ganguly"… #> $ bowler <chr> "P Kumar", "P Kumar", "P Ku… #> $ is_super_over <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ wide_runs <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, … #> $ bye_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ legbye_runs <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, … #> $ noball_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ penalty_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ batsman_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, … #> $ extra_runs <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, … #> $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ dismissal_kind <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ fielder <chr> NA, NA, NA, NA, NA, NA, NA,… #> $ city <chr> "Bangalore", "Bangalore", "… #> $ date <date> 2008-04-18, 2008-04-18, 20… #> $ team1 <chr> "Kolkata Knight Riders", "K… #> $ team2 <chr> "Royal Challengers Bangalor… #> $ toss_winner <chr> "Royal Challengers Bangalor… #> $ toss_decision <chr> "field", "field", "field", … #> $ result <chr> "normal", "normal", "normal… #> $ dl_applied <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ win_by_runs <dbl> 140, 140, 140, 140, 140, 14… #> $ win_by_wickets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, … #> $ player_of_match <chr> "BB McCullum", "BB McCullum… #> $ venue <chr> "M Chinnaswamy Stadium", "M… #> $ umpire1 <chr> "Asad Rauf", "Asad Rauf", "… #> $ umpire2 <chr> "RE Koertzen", "RE Koertzen… #> $ umpire3 <lgl> NA, NA, NA, NA, NA, NA, NA,… ``` --- ## Difference in strategy between two top teams <img src="images/cricketex.gif" style="display: block; margin: auto;" /> --- class: center, middle ### More Information Package : https://github.com/Sayani07/gravitas Slides: https://sayanigupta-ysc.netlify.com/ ### With special thanks to .pull-left[ #### <span style="color: White"> Supervisors <span style="color: Crimson"> Professor Rob J Hyndman <span style="color: White"> & <span style="color: Crimson"> Professor Dianne Cook <br> #### <span style="color: White"> Slides created with <span style="color: Crimson"><i> Rmarkdown, knitr, xaringan, xaringanthemer</i> <br> **<span style="color: Crimson"> Monash University** ] .pull-right[ ### <span style="color: White"> NUMBATS <br> <img src="images/Numbats.png" width="100%" style="display: block; margin: auto;" /> ]