The hyfer package

Eric Keen @ Hyfe

 

The hyfer package provides utilities for interacting with data collected by Hyfe cough detection apps (www.hyfe.ai). This package was designed to be used by Hyfe analysts and external research partners alike.

hyfer in a nutshell

Put simply, hyfer processes raw Hyfe data – which you downloaded in a standard format referred to as a hyfe_data object – into a polished format for tables and plots. We refer to that post-processed, analysis-ready data simply as a hyfe object.

 

 

The following chunk of code shows you the whole game; use it as a template for starting your own analysis. The rest of the vignette explains each bit of this code, demonstrates other hyfer functions, and provides plot examples.

# Install hyfer
library(devtools) 
devtools::install_github('hyfe-ai/hyfer', force=TRUE, quiet=TRUE)
library(hyfer)

# Other dependencies
library(dplyr)
library(lubridate)
library(ggplot2)

# Bring in your hyfe_data object (here we use sample data)
data(hyfe_data)

# Process data for all users together
ho <- process_hyfe_data(hyfe_data)

# ... or process users separately
ho_by_user <- process_hyfe_data(hyfe_data, by_user = TRUE)

# summarize data
hyfe_summarize(ho_by_user)

# Now ready for plotting, etc. 

Setup

Install hyfer

The hyfer package is maintained on GitHub and can be installed as follows:

library(devtools) 
devtools::install_github('hyfe-ai/hyfer', quiet=TRUE)
library(hyfer)

Hyfe data have been formatted for use with the tidyverse of packages, particularly dplyr, lubridate, and ggplot2.

library(dplyr)
library(lubridate)
library(ggplot2)

Getting Hyfe data

This package assumes (1) you already have some Hyfe data locally on your computer, and (2) those data are structured in a standardized way, as a hyfe_data object (see next section).

Hyfe’s research collaborators can download data for their respective research cohorts from the Hyfe Research Dashboard. Hyfe’s internal analysts download data directly using hyferdrive, a private company package.

Both the dashboard and hyferdrive deliver data structured in exactly the same way, allowing both groups to utilize the functions offered in hyfer.

To get started in hyfer, begin by using a sample dataset that comes built-in to the package:

data(hyfe_data)

This sample dataset contains Hyfe data for two “super-users” of the Hyfe Cough Tracker app.

Structure of a hyfe_data object

All downloaded Hyfe data are provided in a standardized data format: a hyfe_data object. A hyfe_data object is simply a list with 6 standard slots.

names(hyfe_data)
#> [1] "id_key"          "sessions"        "sounds"          "locations"      
#> [5] "labels"          "cohort_settings"

A detailed description of each slot is provided below.

hyfe_data$id_key

The id_key slot provides the unique identifiers for each user represented in the data.

hyfe_data$id_key %>% head()
#>                            uid name                  email
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 <NA> navarra+73@hyfeapp.com
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2 <NA> navarra+12@hyfeapp.com
#>                    alias cohort_id
#> 1 navarra+73@hyfeapp.com   Navarra
#> 2 navarra+12@hyfeapp.com   Navarra

hyfe_data$sessions

The sessions slot provides details for each session of user activity for all users in the data.

hyfe_data$sessions %>% names()
#>  [1] "uid"         "start"       "stop"        "duration"    "session_id" 
#>  [6] "device_info" "name"        "email"       "alias"       "cohort_id"

hyfe_data$sessions %>% head()
#>                            uid      start       stop duration
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611696415 1611696415        0
#> 2 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611696454 1611696454        0
#> 3 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611699418 1611727018    27600
#> 4 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611728433 1611728433        0
#> 5 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611783683 1611813084    29401
#> 6 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611817248 1611817248        0
#>                             session_id
#> 1 b22421d1-6d80-4fd6-86c7-b4b2f070ecac
#> 2 6fc2ecbf-15e4-4160-ac33-3a2ce797b461
#> 3 5175dc5b-4470-4f51-b307-6cc2ded3d85e
#> 4 b3148b85-6176-4756-8de6-b94e4f8d6015
#> 5 c73c2d70-5e52-4fae-aa50-dbc8a8551ad3
#> 6 0905a507-49ab-4748-909e-92025c0fe87e
#>                                                                                                                          device_info
#> 1 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#> 2 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#> 3 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#> 4 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#> 5 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#> 6 {"id": "552b2002-d40c-4e6e-a6f4-7cda2074195b", "model": "LM-X430", "vendor": "LGE", "os_version": "28", "app_version": "a1.21.15"}
#>   name                  email                  alias cohort_id
#> 1 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra
#> 2 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra
#> 3 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra
#> 4 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra
#> 5 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra
#> 6 <NA> navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra

The start and stop times of each session of user activity are provided as numeric timestamps, as are all other date/time fields in the hyfe_data object. Though they are not easy to read, timestamps are an unambiguous and timezone-agnostic representation of date/time. Timestamps represent the seconds since midnight UTC on January 1, 1970.

hyfe_data$sounds

The sounds slot provides details for each explosive sound detected for all users in the data.

hyfe_data$sounds %>% names()
#>  [1] "uid"                "timestamp"          "prediction_score"  
#>  [4] "is_cough"           "onboarding_cough"   "loudness"          
#>  [7] "snr"                "loudness_threshold" "snr_threshold"     
#> [10] "highpass_frequency" "peak_start_offset"  "sound_id"          
#> [13] "session_id"         "url_peak"           "url_parent"        
#> [16] "name"               "email"              "alias"             
#> [19] "cohort_id"

hyfe_data$sounds %>% head()
#>                            uid  timestamp prediction_score is_cough
#> 1 9D7SChvklVa7zya0LdU6YVOi9QV2 1626794485      0.004492915    FALSE
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2 1626794936      0.005674792    FALSE
#> 3 9D7SChvklVa7zya0LdU6YVOi9QV2 1626794951      0.039426319    FALSE
#> 4 9D7SChvklVa7zya0LdU6YVOi9QV2 1626795054      0.806312203    FALSE
#> 5 9D7SChvklVa7zya0LdU6YVOi9QV2 1626795366      0.006854793    FALSE
#> 6 9D7SChvklVa7zya0LdU6YVOi9QV2 1626795462      0.020625576    FALSE
#>   onboarding_cough loudness      snr loudness_threshold snr_threshold
#> 1            FALSE 63.20647 27.22247                 58            18
#> 2            FALSE 59.17874 18.53862                 58            18
#> 3            FALSE 59.71023 28.82942                 58            18
#> 4            FALSE 66.05753 42.07871                 58            18
#> 5            FALSE 58.36067 33.21019                 58            18
#> 6            FALSE 67.97848 27.63222                 58            18
#>   highpass_frequency peak_start_offset                         sound_id
#> 1               0.35              3.14 7230f77e5db83ab4a8e888785bacbb35
#> 2               0.35              2.54 64f8f4da93b535bb8448f1752affd2e4
#> 3               0.35             18.18 4ae37ee879613a55a85481fc12ebef13
#> 4               0.35              0.82 1682bd40ea7535c3827f798eb74b92a0
#> 5               0.35             12.26 941483bb38733f278bf5cec5afc4a031
#> 6               0.35             17.18 630c34020351368788b6a0e071d64eab
#>   session_id                                                        url_peak
#> 1       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626794482637-recording-1.wav
#> 2       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626794933684-recording-1.wav
#> 3       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626794933684-recording-2.wav
#> 4       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626795053960-recording-1.wav
#> 5       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626795354670-recording-1.wav
#> 6       <NA> user/9D7SChvklVa7zya0LdU6YVOi9QV2/1626795444868-recording-1.wav
#>                                                      url_parent name
#> 1 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626794482637.m4a <NA>
#> 2 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626794933684.m4a <NA>
#> 3 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626794933684.m4a <NA>
#> 4 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626795053960.m4a <NA>
#> 5 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626795354670.m4a <NA>
#> 6 samples/9D7SChvklVa7zya0LdU6YVOi9QV2/sample-1626795444868.m4a <NA>
#>                    email                  alias cohort_id
#> 1 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
#> 2 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
#> 3 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
#> 4 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
#> 5 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
#> 6 navarra+12@hyfeapp.com navarra+12@hyfeapp.com   Navarra
  • The column prediction_score contains the probability that the explosive sound is a cough, based upon Hyfe’s cough classification algorithms.

  • The column is_cough is a boolean (TRUE / FALSE) stating whether or not the prediction score is above Hyfe’s cough prediction threshold of 0.85.

  • The column onboarding_cough is a boolean stating whether or not this sound was collected while the user was onboarding (following instructions upon log in to cough into the app). Since these are elicited coughs, in certain analyses it may be useful to ignore coughs for which onboarding_cough == TRUE.

hyfe_data$locations

The locations slot provides details for each location fix for all users in the data.

hyfe_data$locations %>% names()
#>  [1] "uid"            "timestamp"      "longitude"      "latitude"      
#>  [5] "resolution"     "location_id"    "location_index" "app_version"   
#>  [9] "device_info"    "name"           "email"          "alias"         
#> [13] "cohort_id"

hyfe_data$locations %>% head()
#>  [1] uid            timestamp      longitude      latitude       resolution    
#>  [6] location_id    location_index app_version    device_info    name          
#> [11] email          alias          cohort_id     
#> <0 rows> (or 0-length row.names)

Note that some studies, such as the one related to this sample data, have location data service disabled.

hyfe_data$labels

This slot is empty for now. It is a placeholder for a future time in which hyfe will have manually labelled sounds from a dataset associated with it inside the hyfe_data object.

hyfe_data$cohort_settings

hyfe_data$cohort_settings %>% names()
#> [1] "cohort_id"             "timezone"              "is_virtual"           
#> [4] "h3_zoom_level"         "snr_threshold"         "loudness_threshold"   
#> [7] "location_data_enabled"

hyfe_data$cohort_settings
#>   cohort_id      timezone is_virtual h3_zoom_level snr_threshold
#> 1   Navarra Europe/Madrid       TRUE            15            18
#>   loudness_threshold location_data_enabled
#> 1                 58                  true

The cohort_settings slot will only be populated if the hyfe_data object is for a research cohort. Otherwise this slot will be NULL. Critically, cohort_settings contains the timezone used to determine local time in the function format_hyfe_time().

Processing Hyfe data

Once you download Hyfe data, the first step is to process it.

ho <- process_hyfe_data(hyfe_data,
                        verbose=TRUE)

This returns a standard hyfe object (ho for short variable names), a named list with the original hyfe_data slots plus new ones. These hyfe objects are formatted to make subsequent plots and analyses as simple as possible. The standard hyfe object structure is explored in detail in the next session.

By default, the process_hyfe_data() function lumps all user data together before summarizing, even if multiple users are present. To summarize each user separately, use the input by_user:

ho_by_user <- process_hyfe_data(hyfe_data,
                        by_user = TRUE,
                        verbose=TRUE)

If you want to work with data from only a single user in a hyfe_data object containing data from multiple users, use the function filter_to_user() before processing. Your workflow would look like:

# Look at your ID options
hyfe_data$id_key

# Filter data to the first ID in that list
hyfe_data_1 <- filter_to_user(uid = hyfe_data$id_key$uid[1],
                     hyfe_data)

# Now process the data into a hyfe object
ho <- process_hyfe_data(hyfe_data_1)

As explained above, the argument uid refers to the anonymous identifier assigns to each user.

Structure of a hyfe object

Once a hyfe_data object is processed, it becomes a hyfe object. The structure and formatting of the hyfe object is designed to accommodate plotting and analysis.

ho %>% names
#>  [1] "id_key"          "sessions"        "sounds"          "locations"      
#>  [5] "labels"          "cohort_settings" "coughs"          "hours"          
#>  [9] "days"            "weeks"

The first several slots contain the raw data from the hyfe_data object, and those data are unchanged.

The coughs slot has all explosive sounds classified as a cough, with various new date/time variables to streamline plotting and analysis.

ho$coughs %>% head
#>                            uid  timestamp prediction_score is_cough
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611700789        0.9998544     TRUE
#> 2 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611696403        0.6972452     TRUE
#> 3 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611700789        0.9998734     TRUE
#> 4 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611700790        0.9996773     TRUE
#> 5 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611721840        0.9906024     TRUE
#> 6 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1612070577        0.9858828     TRUE
#>   onboarding_cough loudness      snr loudness_threshold snr_threshold
#> 1            FALSE 71.69626 41.40092                 NA            NA
#> 2             TRUE 72.07907 29.02771                 NA            NA
#> 3            FALSE 76.74911 46.97339                 NA            NA
#> 4            FALSE 71.62092 39.48697                 NA            NA
#> 5            FALSE 71.43337 19.44805                 NA            NA
#> 6            FALSE 71.21107 40.48111                 NA            NA
#>   highpass_frequency peak_start_offset                         sound_id
#> 1                 NA             17.10 532bc8adcaf23eb491be98a58ecfc353
#> 2                 NA              0.50 a1c8b4213cab3c09ba3812a2d645ba5f
#> 3                 NA             16.30 da6d5dc3a20a3711a76ed619945a16b3
#> 4                 NA             17.62 f4121dcccabc3513a1768c2a4a988090
#> 5                 NA             10.98 4ff2d4513d693f4d8adbb148478d17ec
#> 6                 NA             14.22 e0bd3e48873b38759693df07d60ed307
#>   session_id                                                        url_peak
#> 1       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1611700772764-recording-2.wav
#> 2       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1611696402848-recording-1.wav
#> 3       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1611700772764-recording-1.wav
#> 4       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1611700772764-recording-3.wav
#> 5       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1611721829192-recording-1.wav
#> 6       <NA> user/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/1612070563377-recording-1.wav
#>                                                         url_parent name
#> 1    samples/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1611700772764.m4a <NA>
#> 2 onboarding/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1611696402848.m4a <NA>
#> 3    samples/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1611700772764.m4a <NA>
#> 4    samples/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1611700772764.m4a <NA>
#> 5    samples/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1611721829192.m4a <NA>
#> 6    samples/5Ue2PKP6KMUUbQcVIIjWu8rglIU2/sample-1612070563377.m4a <NA>
#>                    email                  alias cohort_id           date_time
#> 1 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-26 22:39:49
#> 2 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-26 21:26:43
#> 3 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-26 22:39:49
#> 4 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-26 22:39:50
#> 5 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-27 04:30:40
#> 6 navarra+73@hyfeapp.com navarra+73@hyfeapp.com   Navarra 2021-01-31 05:22:57
#>    tz       date date_floor date_ceiling year week yday hour study_week
#> 1 UTC 2021-01-26 1611619200   1611705600 2021    4   26   22         12
#> 2 UTC 2021-01-26 1611619200   1611705600 2021    4   26   21         12
#> 3 UTC 2021-01-26 1611619200   1611705600 2021    4   26   22         12
#> 4 UTC 2021-01-26 1611619200   1611705600 2021    4   26   22         12
#> 5 UTC 2021-01-27 1611705600   1611792000 2021    4   27    4         12
#> 6 UTC 2021-01-31 1612051200   1612137600 2021    5   31    5         13
#>   study_day study_hour frac_week frac_day frac_hour
#> 1        82       1953  11.61919 81.33436  1952.025
#> 2        82       1951  11.61194 81.28360  1950.806
#> 3        82       1953  11.61919 81.33436  1952.025
#> 4        82       1953  11.61920 81.33437  1952.025
#> 5        82       1958  11.65400 81.57801  1957.872
#> 6        86       2055  12.23062 85.61432  2054.744

Note that the final columns in this coughs table provide data about dates and times. The date_time column is an attempt to convert the UTC timestamp into local time according to the timezone specified in ho$cohort_settings. These date/time fields are generated by the helper function format_hyfe_time(). See further details further down.

The hours, days, and weeks slots hold summary timetables of session activity, peak/cough detections, and cough rates for the entire dataset.

ho$hours %>% head
#>    timestamp           date_time            tz       date date_floor
#> 1 1604617200 2020-11-06 00:00:00 Europe/Madrid 2020-11-06 1604617200
#> 2 1604620800 2020-11-06 01:00:00 Europe/Madrid 2020-11-06 1604617200
#> 3 1604624400 2020-11-06 02:00:00 Europe/Madrid 2020-11-06 1604617200
#> 4 1604628000 2020-11-06 03:00:00 Europe/Madrid 2020-11-06 1604617200
#> 5 1604631600 2020-11-06 04:00:00 Europe/Madrid 2020-11-06 1604617200
#> 6 1604635200 2020-11-06 05:00:00 Europe/Madrid 2020-11-06 1604617200
#>   date_ceiling year week yday hour study_week study_day study_hour   frac_week
#> 1   1604617200 2020   45  311    0          0         0          0 0.000000000
#> 2   1604703600 2020   45  311    1          1         1          1 0.005952381
#> 3   1604703600 2020   45  311    2          1         1          2 0.011904762
#> 4   1604703600 2020   45  311    3          1         1          3 0.017857143
#> 5   1604703600 2020   45  311    4          1         1          4 0.023809524
#> 6   1604703600 2020   45  311    5          1         1          5 0.029761905
#>     frac_day frac_hour n_uid session_seconds session_hours session_days peaks
#> 1 0.00000000         0     0               0             0            0     0
#> 2 0.04166667         1     0               0             0            0     0
#> 3 0.08333333         2     0               0             0            0     0
#> 4 0.12500000         3     0               0             0            0     0
#> 5 0.16666667         4     0               0             0            0     0
#> 6 0.20833333         5     0               0             0            0     0
#>   coughs cough_rate session_seconds_tot session_hours_tot session_days_tot
#> 1      0        NaN                   0                 0                0
#> 2      0        NaN                   0                 0                0
#> 3      0        NaN                   0                 0                0
#> 4      0        NaN                   0                 0                0
#> 5      0        NaN                   0                 0                0
#> 6      0        NaN                   0                 0                0
#>   peaks_tot coughs_tot
#> 1         0          0
#> 2         0          0
#> 3         0          0
#> 4         0          0
#> 5         0          0
#> 6         0          0
ho$days %>% as.data.frame %>% head
#>         date            tz date_floor date_ceiling year week yday study_week
#> 1 2020-11-06 Europe/Madrid 1604617200   1604617200 2020   45  311          0
#> 2 2020-11-07 Europe/Madrid 1604703600   1604703600 2020   45  312          1
#> 3 2020-11-08 Europe/Madrid 1604790000   1604790000 2020   45  313          1
#> 4 2020-11-09 Europe/Madrid 1604876400   1604876400 2020   45  314          1
#> 5 2020-11-10 Europe/Madrid 1604962800   1604962800 2020   45  315          1
#> 6 2020-11-11 Europe/Madrid 1605049200   1605049200 2020   46  316          1
#>   study_day n_uid session_seconds session_hours session_days peaks coughs
#> 1         0     1           30060      8.350000    0.3479167    24     24
#> 2         1     1           86244     23.956667    0.9981944    58     58
#> 3         2     1           86400     24.000000    1.0000000   140    140
#> 4         3     1           86400     24.000000    1.0000000    91     91
#> 5         4     1           34310      9.530556    0.3971065    23     23
#> 6         5     1           20285      5.634722    0.2347801     5      5
#>   cough_rate session_seconds_tot session_hours_tot session_days_tot peaks_tot
#> 1   68.98204               30060           8.35000        0.3479167        24
#> 2   58.10491              116304          32.30667        1.3461111        82
#> 3  140.00000              202704          56.30667        2.3461111       222
#> 4   91.00000              289104          80.30667        3.3461111       313
#> 5   57.91897              323414          89.83722        3.7432176       336
#> 6   21.29652              343699          95.47194        3.9779977       341
#>   coughs_tot
#> 1         24
#> 2         82
#> 3        222
#> 4        313
#> 5        336
#> 6        341
ho$weeks %>% as.data.frame %>% head
#>   week            tz date_floor date_ceiling year study_week n_uid
#> 1   45 Europe/Madrid 1604617200   1605049200 2020          0     1
#> 2   46 Europe/Madrid 1605049200   1605654000 2020          1     1
#> 3   47 Europe/Madrid 1605654000   1606258800 2020          2     1
#> 4   48 Europe/Madrid 1606258800   1606863600 2020          3     1
#> 5   49 Europe/Madrid 1606863600   1607468400 2020          4     1
#> 6   50 Europe/Madrid 1607468400   1608073200 2020          5     1
#>   session_seconds session_hours session_days peaks coughs cough_rate
#> 1          323414      89.83722     3.743218   336    336   628.3364
#> 2          394832     109.67556     4.569815   269    228   349.2483
#> 3          389373     108.15917     4.506632  4313    369   573.1553
#> 4          322348      89.54111     3.730880  4072    140   262.6726
#> 5          476167     132.26861     5.511192  2576    306   388.6636
#> 6          604163     167.82306     6.992627  5618    331   331.3490
#>   session_seconds_tot session_hours_tot session_days_tot peaks_tot coughs_tot
#> 1              323414          89.83722         3.743218       336        336
#> 2              718246         199.51278         8.313032       605        564
#> 3             1107619         307.67194        12.819664      4918        933
#> 4             1429967         397.21306        16.550544      8990       1073
#> 5             1906134         529.48167        22.061736     11566       1379
#> 6             2510297         697.30472        29.054363     17184       1710

Note that when processed with by_user = TRUE, the slot names are slightly different:

ho_by_user %>% names
#> [1] "id_key"          "sessions"        "sounds"          "locations"      
#> [5] "labels"          "cohort_settings" "coughs"          "user_summaries"

The user_summaries slot is itself a list; each of its slots pertains to a single user. Each user has a list of 4 tables: hours, days, weeks, and id_key.

# Data structure for first user 
ho_by_user$user_summaries[[1]] %>% names
#> [1] "hours"  "days"   "weeks"  "id_key"

Visualizing Hyfe data

Custom plotting functions are under construction. For now, ggplot works well with hyfe objects (see examples in Overview above.)

Cumulative plots

plot_total(ho)

By default, plot_total() plots sessions. But you can specify for detections of explosive sounds …

plot_total(ho, type='sounds')

… and of coughs:

plot_total(ho, type='coughs')

The default time unit for this function is days, but you can specify for hours ….

plot_total(ho, type='coughs', unit='hours')

… and for weeks as well:

plot_total(ho, type='coughs', unit='weeks')

This function, like other plot...() functions in hyfer, allows you to return the dataset underlying the plot in addition to (or instead of) producing the plot.

plot_total(ho,
           unit='weeks', 
           type='coughs', 
           print_plot = FALSE, 
           return_data = TRUE)
#> $data
#> # A tibble: 43 × 2
#>    x                       y
#>    <dttm>              <dbl>
#>  1 2020-11-05 23:00:00   336
#>  2 2020-11-10 23:00:00   564
#>  3 2020-11-17 23:00:00   933
#>  4 2020-11-24 23:00:00  1073
#>  5 2020-12-01 23:00:00  1379
#>  6 2020-12-08 23:00:00  1710
#>  7 2020-12-15 23:00:00  2005
#>  8 2020-12-22 23:00:00  2059
#>  9 2020-12-31 23:00:00  2247
#> 10 2020-12-29 23:00:00  2247
#> # … with 33 more rows

This function also allows you to return the ggplot object so that you can add to it before printing the plot:

plot_total(ho, 
           unit='days',
           type='sessions',
           print_plot=FALSE, 
           return_plot=TRUE)$plot +
  ggplot2::labs(title='Monitoring time (person-days)')

These plot functions also accept ho objects that are processed with by_user = TRUE:

plot_total(ho_by_user,unit='hours', type='coughs')

Note that the users’ individual datasets are pooled in order to make a single aggregate plot.

But when using ho_by_user, you can also plot each user separately on the same plot:

plot_total(ho_by_user,unit='hours', type='coughs', by_user = TRUE)

Time series of counts

For all users together:

plot_timeseries(ho)

For each user separately:

plot_timeseries(ho_by_user, by_user=TRUE)

This plotting function has the same optional inputs as plot_total(), {#runningmean} with the addition of an option for overlaying a running mean:

plot_timeseries(ho, type='coughs', unit = 'days', running_mean=7)

The units of the running_mean argument is the same as that of the unit argument. This option is only listened to when you process your hyfe data with by_user=FALSE.

Time series of cough rate

For all users together:

plot_cough_rate(ho, unit='days', running_mean = 14)

For each user separately:

plot_cough_rate(ho_by_user, by_user=TRUE)

User trajectories

To overlap users such that all of their timeseries begin at the origin, use plot_trajectories(). This can be helpful when studying the evolution of cough in the days following enrollment or hospitalization, or studying user retention in the days since signing up with the app. Note that this function only accepts hyfe data that were processed with by_user = TRUE.

For each user separately:

plot_trajectory(ho_by_user, type='rate', unit = 'days')

For all user trajectories pooled together:

plot_trajectory(ho_by_user, type='coughs', unit = 'days', pool_users = TRUE)

Circadian patterns

For all users together:

plot_circadian(ho)

For each user separately:

plot_circadian(ho_by_user, by_user=TRUE)

Diagnostic plots

Diagnostic plots can be helpful in data review and technical troubleshooting. To explore all of a cohort dataset quickly at once, use plot_cohort_diagnostic():

plot_cohort_diagnostic(ho)

In this plot, each user has a row of data. The grey bars indicate session activity and red dots indicate cough detections.

To produce a diagnostic plot for a single user, use plot_user_diagnostic():

# Look at your ID options
hyfe_data$id_key
#>                            uid name                  email
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 <NA> navarra+73@hyfeapp.com
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2 <NA> navarra+12@hyfeapp.com
#>                    alias cohort_id
#> 1 navarra+73@hyfeapp.com   Navarra
#> 2 navarra+12@hyfeapp.com   Navarra

# Filter data to the first ID in that list
hyfe_data_1 <- filter_to_user(uid = hyfe_data$id_key$uid[1],
                     hyfe_data)

# Now process the data into a hyfe object for a single user
ho1 <- process_hyfe_data(hyfe_data_1)

plot_user_diagnostic(ho1)

The grey bars indicate session activity and the red dots indicate cough detections.

Analytical tools

Summarize Hyfe data

To get summary metrics for a hyfe object, use the function hyfe_summarize():

hyfe_summarize(ho)
#> $overall
#>   users  seconds   hours    days     years sounds coughs hourly_n hourly_rate
#> 1     2 20693233 5748.12 239.505 0.6561781 853382   7148     5407    1.287206
#>   hourly_var hourly_sd hourly_max daily_n daily_rate daily_var daily_sd
#> 1   15.51397  3.938778         62     259   30.14035   608.292 24.66358
#>   daily_max
#> 1       140
#> 
#> $users
#> NULL

Since the ho object was processed by aggregating all users together (note that the users slot in the output is NULL), the cough rates reported should be treated with caution: these rates are going to be biased by users with (1) a lot of monitoring time and (2) a lot of coughs.

To summarize Hyfe data from multiple users in a way that is truly balanced, in which each user is weighted equally, you should use a hyfe object processed with by_user = TRUE:

hyfe_summarize(ho_by_user)
#> $overall
#>   users  seconds   hours     days     years sounds coughs hourly_n hourly_rate
#> 1     2 21069973 5852.77 243.8654 0.6681245 869312   7362     2933    1.043988
#>   hourly_var hourly_sd hourly_max daily_n daily_rate daily_var daily_sd
#> 1  0.1292396 0.3594991   1.298192   149.5   21.23788  189.6924 13.77289
#>   daily_max
#> 1  30.97678
#> 
#> $users
#>                            uid name                  email
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 <NA> navarra+73@hyfeapp.com
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2 <NA> navarra+12@hyfeapp.com
#>                    alias cohort_id users  seconds     hours      days
#> 1 navarra+73@hyfeapp.com   Navarra     1  1678967  466.3797  19.43249
#> 2 navarra+12@hyfeapp.com   Navarra     1 19391006 5386.3906 224.43294
#>        years sounds coughs hourly_n hourly_rate hourly_var hourly_sd hourly_max
#> 1 0.05323969  37946    343      464   0.7897837   7.182032  2.679931   23.98667
#> 2 0.61488477 831366   7019     5402   1.2981923  15.008396  3.874067   65.00000
#>   daily_n daily_rate daily_var daily_sd daily_max
#> 1      44   11.49898  257.4542 16.04538  74.29333
#> 2     255   30.97678  584.6859 24.18028 147.00000

In the users slot of the output, you now you have a row summarizing each user. That user table is then used to build the overall slot. The mean rates (i.e., hourly_rate and daily_rate) are the average of each user’s mean rates, and – importantly – the variability metrics (hourly_var, hourly_sd, daily_var, daily_sd) now pertain to the variability among users.

The hyfe_summarize() function uses sample size cutoffs to ensure that rates are not swung to extremes due to insufficient monitoring. For example, an hour of day with 1 cough detection but only 1 minute of monitoring would produce an hourly cough rate estimate of 60 coughs per hour. Those scenarios should be avoided.

The default cutoffs are: at least 30 minutes of monitoring must occur within a hour-long window of the day in order for that hour to contribute to the estimation of the hourly cough rate; and at least 4 hours of monitoring must occur within a day in order for that day to count toward the daily cough rate.

You may adjust those defaults using the function arguments. For example, here is a much more stringent set of requirements, which may improve the accuracy of rate estimates but severely reduces sample size:

hyfe_summarize(ho_by_user,
               cutoff_hourly = 59,
               cutoff_daily = 23.9)
#> $overall
#>   users  seconds   hours     days     years sounds coughs hourly_n hourly_rate
#> 1     2 21069973 5852.77 243.8654 0.6681245 869312   7362   2841.5   0.9205046
#>   hourly_var hourly_sd hourly_max daily_n daily_rate daily_var daily_sd
#> 1  0.2615673 0.5114365   1.282145      81   26.19617  54.00036 7.348494
#>   daily_max
#> 1  31.39234
#> 
#> $users
#>                            uid name                  email
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 <NA> navarra+73@hyfeapp.com
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2 <NA> navarra+12@hyfeapp.com
#>                    alias cohort_id users  seconds     hours      days
#> 1 navarra+73@hyfeapp.com   Navarra     1  1678967  466.3797  19.43249
#> 2 navarra+12@hyfeapp.com   Navarra     1 19391006 5386.3906 224.43294
#>        years sounds coughs hourly_n hourly_rate hourly_var hourly_sd hourly_max
#> 1 0.05323969  37946    343      408   0.5588644   3.550132  1.884179         19
#> 2 0.61488477 831366   7019     5275   1.2821448  14.806009  3.847858         65
#>   daily_n daily_rate daily_var daily_sd daily_max
#> 1       1   21.00000        NA       NA        21
#> 2     161   31.39234  600.7896 24.51101       147

Note that the cumulative counts are unaffected by these cutoffs, only the rates.

Cough rate distributions

To get details and summaries about the distribution of cough rates in your data, use the function cough_rate_distribution().

cough_rates <- cough_rate_distribution(ho_by_user, 
                                       min_session = 0.5)

This function can take both aggregated data (ho) and user-separated (ho_by_user), but it does best with the latter. It returns metrics about hourly cough rates based on an hour-by-hour analysis. Similar to the inputs in hyfe_summarize(), the argument min_session allows you to define the minimum amount of monitoring required during a single hour in order for that hour to be included in the cough rate estimation. For example, sometimes an hour of day contains only a few minutes of monitoring for a user; that makes for a pretty poor estimate of that hour’s cough rate. The default min_session is 0.5 hours, or 30 minutes of monitoring within an hour.

The function returns a list with four slots:

cough_rates %>% names()
#> [1] "overall" "users"   "rates"   "details"

The slot $overall returns a one-row summary of the entire dataset:

cough_rates$overall
#> # A tibble: 1 × 7
#>   mean_of_mean sd_of_mean mean_of_variance sd_of_variance n_hours_tot
#>          <dbl>      <dbl>            <dbl>          <dbl>       <int>
#> 1         1.04      0.359             11.1           5.53        5866
#> # … with 2 more variables: n_hours_mean <dbl>, n_uid <int>

These metrics are based on the mean/variance for each individual user, i.e., mean_of_mean is the average of mean cough rates across users. When using a hyfe object prepared with by_user=TRUE, this means that each user is weighted equally in the summary statistics. When using a hyfe object in which all user data are aggregated together, users will be weighted according to their session time.

The slot $users returns a summary for every user contained in the data:

cough_rates$users
#> # A tibble: 2 × 5
#>   uid                          rate_mean rate_variance n_hours n_uid
#>   <chr>                            <dbl>         <dbl>   <int> <int>
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2     0.790          7.18     464     1
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2     1.30          15.0     5402     1

The slot $rates returns a numeric vector of hourly cough rates that satisfy the minimum monitoring threshold:

cough_rates$rates %>% head(100)
#>   [1]  3.000000  0.000000  0.000000  0.000000  0.000000  1.000000  0.000000
#>   [8]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [15]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [22]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [29]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  2.000000
#>  [36]  2.756508  2.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [43]  0.000000  2.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [50]  0.000000  0.000000  3.000000  0.000000  0.000000  0.000000  0.000000
#>  [57]  0.000000  0.000000  4.572396  0.000000  0.000000  0.000000  0.000000
#>  [64]  0.000000  0.000000  0.000000  3.835908  0.000000  0.000000  0.000000
#>  [71]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [78]  0.000000  0.000000  0.000000 23.986674  0.000000  0.000000  0.000000
#>  [85]  0.000000  0.000000  0.000000  1.000000  0.000000  0.000000  0.000000
#>  [92]  0.000000  0.000000  0.000000  0.000000  0.000000  1.000000  0.000000
#>  [99]  0.000000  0.000000

The slot $details returns a dataframe with all details you might need to analyze these rates (essentially the hours table from a hyfe object):

cough_rates$details %>% head()
#>                            uid  timestamp           date_time            tz
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611700741 2021-01-26 23:39:01 Europe/Madrid
#> 2 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611704341 2021-01-27 00:39:01 Europe/Madrid
#> 3 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611707941 2021-01-27 01:39:01 Europe/Madrid
#> 4 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611711541 2021-01-27 02:39:01 Europe/Madrid
#> 5 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611715141 2021-01-27 03:39:01 Europe/Madrid
#> 6 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 1611718741 2021-01-27 04:39:01 Europe/Madrid
#>         date date_floor date_ceiling year week yday hour study_week study_day
#> 1 2021-01-26 1611615600   1611702000 2021    4   26   23         12        82
#> 2 2021-01-27 1611702000   1611788400 2021    4   27    0         12        82
#> 3 2021-01-27 1611702000   1611788400 2021    4   27    1         12        82
#> 4 2021-01-27 1611702000   1611788400 2021    4   27    2         12        82
#> 5 2021-01-27 1611702000   1611788400 2021    4   27    3         12        82
#> 6 2021-01-27 1611702000   1611788400 2021    4   27    4         12        82
#>   study_hour frac_week frac_day frac_hour n_uid session_hours session_days
#> 1       1952  11.61905 81.33333      1952     1             1   0.04166667
#> 2       1953  11.62500 81.37500      1953     1             1   0.04166667
#> 3       1954  11.63095 81.41667      1954     1             1   0.04166667
#> 4       1955  11.63690 81.45833      1955     1             1   0.04166667
#> 5       1956  11.64286 81.50000      1956     1             1   0.04166667
#> 6       1957  11.64881 81.54167      1957     1             1   0.04166667
#>   peaks coughs cough_rate
#> 1    10      3          3
#> 2     0      0          0
#> 3     0      0          0
#> 4     0      0          0
#> 5     0      0          0
#> 6     1      1          1

This function should make it straightforward to plot cough rate histograms…

ggplot(cough_rates$details, 
       aes(x=cough_rate)) + 
  geom_histogram() + 
  xlab('Coughs per hour') + ylab('Count') + 
  facet_wrap(~uid)

… or scatterplots of the relationship between cough rate mean and cough rate variance among users:

ggplot(cough_rates$users, 
       aes(x=rate_mean, y=rate_variance)) + 
  geom_point() + 
  xlab('Mean cough rate (coughs per hour)') + ylab('Variance in cough rate')

Simulating coughs

To generate a fake timeseries of coughs, use the function simulate_cougher().

demo <- simulate_cougher(rate_mean = 3)

This function return a dataframe with hourly cough counts based on the mean cough rate you provide:

par(mar=c(4.2,4.2,.5,.5))
plot(coughs ~ date_time, data=demo, cex=.5, pch=16)

By default, this function returns a month-long timeseries of coughs simulated using a negative binomial distribution in which the variance is predicted based on the mean using a regression Hyfe has developed using a 600-participant dataset from northern Spain. You can also specify your own variance using the rate_variance argument. (There are many other arguments as well – see the function’s documentation for more).

Monitoring time requirements

For a visualize assessment of how much monitoring is needed in order to accurately estimate a user’s overall mean cough rate, use the function plot_cough_rate_error().

# Generate a cough time series
demo <- simulate_cougher(rate_mean = 3, random_seed = 124)

# Plot cough rate error
demo_error <- plot_cough_rate_error(demo)

Fit a variance~mean regression model

For a cohort of users, you can use fit_model_to_cough() to fit a model of the relationship between the mean and variance of cough rate in a cohort of users. Knowing the relationship between mean and variance in cough rate allows you to simulate realistic time series for any cough rate.

To do this, you would need a hyfe object, processed with by_user = TRUE, with many users. Here is what your code would look like:

fit_model_to_cough(ho_by_user,
                   cutoff_hours = 50,
                   cutoff_hourly = .5,
                   toplot=TRUE)

See the function documentation to understand the details of these arguments and what is returned. This function returns various details in a list (p-values, R-squared, model objects, etc.). For now, note that you can feed the model coefficients directly into another function, predict_cough_variance(), which returns a variance prediction based upon a user-specified cough rate and the model coefficients:

predict_cough_variance(cough_rate,
                       b1,
                       intercept)

Cough bouts

The cough_bouts() function allows you to “pool” coughs that occur very close in time. This option can be useful in a medical context as well as in clinical validation experiments.

data(ho)
coughs <- ho$coughs
bouts <- cough_bouts(coughs,
                     bout_window = 2,
                     bout_limit = Inf,
                     verbose=FALSE)

In the code above, the inputs specify that coughs occurring within 2 seconds of each other should be pooled into a single cough bout, and that there is no limit to the number of coughs that can occur in a single bout.

Compare the number of coughs to the number of bouts:

coughs %>% nrow
#> [1] 15572
bouts %>% nrow
#> [1] 14527

Examine how many coughs were contained in these bouts:

bouts$n_coughs %>% table 
#> .
#>     1     2     3     4     5     6     7     8    10 
#> 13722   637   133    14    12     6     1     1     1

Most bouts contain a single cough, but some contained two or three, and a few contained more than that.

Evaluate Hyfe performance

Synchronize detections & labels

Comparing Hyfe performance to a groundtruth, such as a set of labeled detections, requires that the two sets of events are synchronized. Even if Hyfe’s system time differs from the labeler’s clock by a second or two, that offset can complicate and confuse the performance evaluation process. Use the function synchronize() to find the offset correction needed for Hyfe detections to be synchronized to a set of reference/label times.

Say you did a “field test” in which you coughed into a MP3 recorder and a Hyfe phone a few dozen times and you want to see how well Hyfe performed at detecting all of those coughs. A friend reviews the MP3 file and labels each sound according to Hyfe 4-tier labeling system. Your table of labels looks like this:

#>        times labels
#> 1 1638636126      3
#> 2 1638636133      3
#> 3 1638636147      3
#> 4 1638636153      2
#> 5 1638636163      3
#> 6 1638636178      2

You download your Hyfe data and use the ho$sounds slot to find the timestamp and prediction for the sounds in your test. A simplified table of your detections may look like this:

#>        times predictions
#> 1 1638661299        TRUE
#> 2 1638661306        TRUE
#> 3 1638661320        TRUE
#> 4 1638661326       FALSE
#> 5 1638661336        TRUE
#> 6 1638661351       FALSE

Let’s synchronize these detections to your labels. (These are fake data that we generated for this example. The true time offset is 6 hours, 59 minutes, and 33 seconds ahead of the labels, or a total of 25173 seconds). The function should return the same offset:

synchronize(reference_times = reference$times,
            reference_labels = reference$labels,
            hyfe_times = detections$times,
            hyfe_predictions = detections$predictions)

#> [1] -25173

The offset is negative because, when you add this number to the Hyfe detection timestamps, they will match the label timestamps.

Background functions in hyfer

The function process_hyfe_data relies on several background functions to do its thing. Those functions can also be called directly if you have need for them:

format_hyfe_time()

This function takes a set of timestamps and creates a dataframe of various date/time variables that will be useful in subsequent hyfer functions.

format_hyfe_time(c(1626851363, 1626951363))
#>    timestamp           date_time  tz       date date_floor date_ceiling year
#> 1 1626851363 2021-07-21 07:09:23 UTC 2021-07-21 1626825600   1626912000 2021
#> 2 1626951363 2021-07-22 10:56:03 UTC 2021-07-22 1626912000   1626998400 2021
#>   week yday hour study_week study_day study_hour frac_week frac_day frac_hour
#> 1   29  202    7          0         0          0 0.0000000 0.000000   0.00000
#> 2   29  203   10          1         2         28 0.1653439 1.157407  27.77778

When a timezone is not provided, as above, the function assumes that times are in UTC. You can specify that explicitly if you wish:

format_hyfe_time(c(1626851363, 1626951363), 'UTC')
#>    timestamp           date_time  tz       date date_floor date_ceiling year
#> 1 1626851363 2021-07-21 07:09:23 UTC 2021-07-21 1626825600   1626912000 2021
#> 2 1626951363 2021-07-22 10:56:03 UTC 2021-07-22 1626912000   1626998400 2021
#>   week yday hour study_week study_day study_hour frac_week frac_day frac_hour
#> 1   29  202    7          0         0          0 0.0000000 0.000000   0.00000
#> 2   29  203   10          1         2         28 0.1653439 1.157407  27.77778

This function is able to accept any timezone listed in the R’s built-in collection of timezones (see OrsonNames()).

format_hyfe_time(c(1626851363, 1626951363), 'Africa/Kampala')
#>    timestamp           date_time             tz       date date_floor
#> 1 1626851363 2021-07-21 10:09:23 Africa/Kampala 2021-07-21 1626814800
#> 2 1626951363 2021-07-22 13:56:03 Africa/Kampala 2021-07-22 1626901200
#>   date_ceiling year week yday hour study_week study_day study_hour frac_week
#> 1   1626901200 2021   29  202   10          0         0          0 0.0000000
#> 2   1626987600 2021   29  203   13          1         2         28 0.1653439
#>   frac_day frac_hour
#> 1 0.000000   0.00000
#> 2 1.157407  27.77778
format_hyfe_time(c(1626851363, 1626951363), 'America/Chicago')
#>    timestamp           date_time              tz       date date_floor
#> 1 1626851363 2021-07-21 02:09:23 America/Chicago 2021-07-21 1626843600
#> 2 1626951363 2021-07-22 05:56:03 America/Chicago 2021-07-22 1626930000
#>   date_ceiling year week yday hour study_week study_day study_hour frac_week
#> 1   1626930000 2021   29  202    2          0         0          0 0.0000000
#> 2   1627016400 2021   29  203    5          1         2         28 0.1653439
#>   frac_day frac_hour
#> 1 0.000000   0.00000
#> 2 1.157407  27.77778

Explanations of date/time variables:

expand_sessions()

Most analyses of Hyfe data hinge upon detailed knowledge of when Hyfe was actively listening for coughs, and when it wasn’t. To determine the duration of monitoring on an hourly or daily basis, use the expand_sessions() function.

This function returns a list with two slots: timetable and series. By default, series is returned as a NULL object since it is usually only needed for troubleshooting and can be time-consuming to prepare. The timetable is a dataframe in which monitoring activity is detailed for each individual user in the dataset on an hourly or daily basis.

To create an hourly time table:

hyfe_time <- expand_sessions(hyfe_data, 
                             unit='hour',
                             verbose=TRUE)

hyfe_time$timetable %>% nrow
#> [1] 13920

hyfe_time$timetable %>% head
#>    timestamp           date_time            tz       date date_floor
#> 1 1604617200 2020-11-06 00:00:00 Europe/Madrid 2020-11-06 1604617200
#> 2 1604620800 2020-11-06 01:00:00 Europe/Madrid 2020-11-06 1604617200
#> 3 1604624400 2020-11-06 02:00:00 Europe/Madrid 2020-11-06 1604617200
#> 4 1604628000 2020-11-06 03:00:00 Europe/Madrid 2020-11-06 1604617200
#> 5 1604631600 2020-11-06 04:00:00 Europe/Madrid 2020-11-06 1604617200
#> 6 1604635200 2020-11-06 05:00:00 Europe/Madrid 2020-11-06 1604617200
#>   date_ceiling year week yday hour study_week study_day study_hour   frac_week
#> 1   1604617200 2020   45  311    0          0         0          0 0.000000000
#> 2   1604703600 2020   45  311    1          1         1          1 0.005952381
#> 3   1604703600 2020   45  311    2          1         1          2 0.011904762
#> 4   1604703600 2020   45  311    3          1         1          3 0.017857143
#> 5   1604703600 2020   45  311    4          1         1          4 0.023809524
#> 6   1604703600 2020   45  311    5          1         1          5 0.029761905
#>     frac_day frac_hour                          uid session_time
#> 1 0.00000000         0 9D7SChvklVa7zya0LdU6YVOi9QV2            0
#> 2 0.04166667         1 9D7SChvklVa7zya0LdU6YVOi9QV2            0
#> 3 0.08333333         2 9D7SChvklVa7zya0LdU6YVOi9QV2            0
#> 4 0.12500000         3 9D7SChvklVa7zya0LdU6YVOi9QV2            0
#> 5 0.16666667         4 9D7SChvklVa7zya0LdU6YVOi9QV2            0
#> 6 0.20833333         5 9D7SChvklVa7zya0LdU6YVOi9QV2            0

You can then, for example, summarize session activity for the two users in the sample dataset:

hyfe_time$timetable %>% 
  group_by(uid) %>% 
  summarize(hours_monitored = sum(session_time) / 3600,
            days_monitored = sum(session_time) / 86400)
#> # A tibble: 2 × 3
#>   uid                          hours_monitored days_monitored
#>   <chr>                                  <dbl>          <dbl>
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2            466.           19.4
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2           5282.          220.

To create a daily time table:

hyfe_time <- expand_sessions(hyfe_data, 
                             unit='day')

hyfe_time$timetable %>% 
  group_by(uid) %>% 
  summarize(hours_monitored = sum(session_time) / 3600,
            days_monitored = sum(session_time) / 86400)
#> # A tibble: 2 × 3
#>   uid                          hours_monitored days_monitored
#>   <chr>                                  <dbl>          <dbl>
#> 1 5Ue2PKP6KMUUbQcVIIjWu8rglIU2            466.           19.4
#> 2 9D7SChvklVa7zya0LdU6YVOi9QV2           5283.          220.

Instead of a summary of session activity, the series slot contains a continuous second-by-second time series:

hyfe_time <- expand_sessions(hyfe_data,
                             create_table = FALSE,
                             create_series = TRUE,
                             inactive_value = 0)

In this time series, every row is a second between the floor_date and ceiling_date of the study, and every column is a user (uid). Seconds in which the user is active is represented with a “1”. Inactive seconds are given the value of the input inactive_value, the default for which is “0”.

hyfe_time$series %>% head
#>    timestamp 9D7SChvklVa7zya0LdU6YVOi9QV2 5Ue2PKP6KMUUbQcVIIjWu8rglIU2
#> 1 1604617200                            0                            0
#> 2 1604617201                            0                            0
#> 3 1604617202                            0                            0
#> 4 1604617203                            0                            0
#> 5 1604617204                            0                            0
#> 6 1604617205                            0                            0

Confirm that the same total monitoring duration, in days, was found using the series approach.

hyfe_time$series %>% select(2,3) %>% apply(2,sum) / 86400
#> 9D7SChvklVa7zya0LdU6YVOi9QV2 5Ue2PKP6KMUUbQcVIIjWu8rglIU2 
#>                    220.07252                     19.43249

Note that this series feature is only useful in certain circumstances, and it can create enormous objects that slow everything down. However, it can be particularly valuable during troubleshooting if a phone seems to be acting up.

Tip: Changing the inactive inactive_value to NA may make it easier to plot session activity as lines on plots.

hyfe_time <- expand_sessions(hyfe_data,
                             create_table = FALSE,
                             create_series = TRUE,
                             inactive_value = NA)

# Setup plot
par(mar=c(4.2,4.2,.5,.5))
plot(1, type='n', 
     xlim=range(hyfe_time$series$timestamp), 
     ylim=c(0,3),
     xlab='Timestamp',
     ylab='User')

# Add user 1
lines(x = hyfe_time$series$timestamp,
      y = hyfe_time$series[,2])

# Add user 2
lines(x = hyfe_time$series$timestamp,
      y = hyfe_time$series[,3] + 1)

hyfe_timetables()

To create hourly/daily/weekly summaries of session activity, peak/cough detections, and cough rates, use the function hyfe_summary_tables().

hyfe_tables <- hyfe_timetables(hyfe_data,
                               verbose=TRUE)

This function is essentially a wrapper for expand_sessions(), and calls both that function and format_hyfe_time(). Note: This function lumps all users together.

This function returns a named list:

names(hyfe_tables)
#> [1] "hours" "days"  "weeks"

Hourly summary table:

hyfe_tables$hours %>% as.data.frame %>% head
#>    timestamp           date_time            tz       date date_floor
#> 1 1604617200 2020-11-06 00:00:00 Europe/Madrid 2020-11-06 1604617200
#> 2 1604620800 2020-11-06 01:00:00 Europe/Madrid 2020-11-06 1604617200
#> 3 1604624400 2020-11-06 02:00:00 Europe/Madrid 2020-11-06 1604617200
#> 4 1604628000 2020-11-06 03:00:00 Europe/Madrid 2020-11-06 1604617200
#> 5 1604631600 2020-11-06 04:00:00 Europe/Madrid 2020-11-06 1604617200
#> 6 1604635200 2020-11-06 05:00:00 Europe/Madrid 2020-11-06 1604617200
#>   date_ceiling year week yday hour study_week study_day study_hour   frac_week
#> 1   1604617200 2020   45  311    0          0         0          0 0.000000000
#> 2   1604703600 2020   45  311    1          1         1          1 0.005952381
#> 3   1604703600 2020   45  311    2          1         1          2 0.011904762
#> 4   1604703600 2020   45  311    3          1         1          3 0.017857143
#> 5   1604703600 2020   45  311    4          1         1          4 0.023809524
#> 6   1604703600 2020   45  311    5          1         1          5 0.029761905
#>     frac_day frac_hour n_uid session_seconds session_hours session_days peaks
#> 1 0.00000000         0     0               0             0            0     0
#> 2 0.04166667         1     0               0             0            0     0
#> 3 0.08333333         2     0               0             0            0     0
#> 4 0.12500000         3     0               0             0            0     0
#> 5 0.16666667         4     0               0             0            0     0
#> 6 0.20833333         5     0               0             0            0     0
#>   coughs cough_rate session_seconds_tot session_hours_tot session_days_tot
#> 1      0        NaN                   0                 0                0
#> 2      0        NaN                   0                 0                0
#> 3      0        NaN                   0                 0                0
#> 4      0        NaN                   0                 0                0
#> 5      0        NaN                   0                 0                0
#> 6      0        NaN                   0                 0                0
#>   peaks_tot coughs_tot
#> 1         0          0
#> 2         0          0
#> 3         0          0
#> 4         0          0
#> 5         0          0
#> 6         0          0

Daily summary table

hyfe_tables$days %>% as.data.frame %>% head
#>         date            tz date_floor date_ceiling year week yday study_week
#> 1 2020-11-06 Europe/Madrid 1604617200   1604617200 2020   45  311          0
#> 2 2020-11-07 Europe/Madrid 1604703600   1604703600 2020   45  312          1
#> 3 2020-11-08 Europe/Madrid 1604790000   1604790000 2020   45  313          1
#> 4 2020-11-09 Europe/Madrid 1604876400   1604876400 2020   45  314          1
#> 5 2020-11-10 Europe/Madrid 1604962800   1604962800 2020   45  315          1
#> 6 2020-11-11 Europe/Madrid 1605049200   1605049200 2020   46  316          1
#>   study_day n_uid session_seconds session_hours session_days peaks coughs
#> 1         0     1           30060      8.350000    0.3479167    24     24
#> 2         1     1           86244     23.956667    0.9981944    58     58
#> 3         2     1           86400     24.000000    1.0000000   140    140
#> 4         3     1           86400     24.000000    1.0000000    91     91
#> 5         4     1           34310      9.530556    0.3971065    23     23
#> 6         5     1           20285      5.634722    0.2347801     5      5
#>   cough_rate session_seconds_tot session_hours_tot session_days_tot peaks_tot
#> 1   68.98204               30060           8.35000        0.3479167        24
#> 2   58.10491              116304          32.30667        1.3461111        82
#> 3  140.00000              202704          56.30667        2.3461111       222
#> 4   91.00000              289104          80.30667        3.3461111       313
#> 5   57.91897              323414          89.83722        3.7432176       336
#> 6   21.29652              343699          95.47194        3.9779977       341
#>   coughs_tot
#> 1         24
#> 2         82
#> 3        222
#> 4        313
#> 5        336
#> 6        341

Weekly summary table:

hyfe_tables$weeks %>% as.data.frame %>% head
#>   week            tz date_floor date_ceiling year study_week n_uid
#> 1   45 Europe/Madrid 1604617200   1605049200 2020          0     1
#> 2   46 Europe/Madrid 1605049200   1605654000 2020          1     1
#> 3   47 Europe/Madrid 1605654000   1606258800 2020          2     1
#> 4   48 Europe/Madrid 1606258800   1606863600 2020          3     1
#> 5   49 Europe/Madrid 1606863600   1607468400 2020          4     1
#> 6   50 Europe/Madrid 1607468400   1608073200 2020          5     1
#>   session_seconds session_hours session_days peaks coughs cough_rate
#> 1          323414      89.83722     3.743218   336    336   628.3364
#> 2          394832     109.67556     4.569815   269    228   349.2483
#> 3          389373     108.15917     4.506632  4313    369   573.1553
#> 4          322348      89.54111     3.730880  4072    140   262.6726
#> 5          476167     132.26861     5.511192  2576    306   388.6636
#> 6          604163     167.82306     6.992627  5618    331   331.3490
#>   session_seconds_tot session_hours_tot session_days_tot peaks_tot coughs_tot
#> 1              323414          89.83722         3.743218       336        336
#> 2              718246         199.51278         8.313032       605        564
#> 3             1107619         307.67194        12.819664      4918        933
#> 4             1429967         397.21306        16.550544      8990       1073
#> 5             1906134         529.48167        22.061736     11566       1379
#> 6             2510297         697.30472        29.054363     17184       1710