Accepts record-level data from a data frame, validates it against the expected type of content of each column, generates a collection of time series plots for visual inspection, and saves a report to disk.
Usage
daiquiri_report(
df,
field_types,
override_column_names = FALSE,
na = c("", "NA", "NULL"),
dataset_description = NULL,
aggregation_timeunit = "day",
report_title = "daiquiri data quality report",
save_directory = ".",
save_filename = NULL,
show_progress = TRUE,
log_directory = NULL
)
Arguments
- df
A data frame. Rectangular data can be read from file using
read_data()
. See Details.- field_types
field_types()
object specifying names and types of fields (columns) in the supplieddf
. See also field_types_available.- override_column_names
If
FALSE
, column names in the supplieddf
must match the names specified infield_types
exactly. IfTRUE
, column names in the supplieddf
will be replaced with the names specified infield_types
. The specification must therefore contain the columns in the correct order. Default =FALSE
- na
vector containing strings that should be interpreted as missing values, Default =
c("","NA","NULL")
.- dataset_description
Short description of the dataset being checked. This will appear on the report. If blank, the name of the data frame object will be used
- aggregation_timeunit
Unit of time to aggregate over. Specify one of
"day"
,"week"
,"month"
,"quarter"
,"year"
. The"week"
option is Monday-based. Default ="day"
- report_title
Title to appear on the report
- save_directory
String specifying directory in which to save the report. Default is current directory.
- save_filename
String specifying filename for the report, excluding any file extension. If no filename is supplied, one will be automatically generated with the format
daiquiri_report_YYMMDD_HHMMSS
.- show_progress
Print progress to console. Default =
TRUE
- log_directory
String specifying directory in which to save log file. If no directory is supplied, progress is not logged.
Value
A list containing information relating to the supplied parameters as
well as the resulting daiquiri_source_data
and daiquiri_aggregated_data
objects.
Details
In order for the package to detect any non-conformant
values in numeric or datetime fields, these should be present in the data
frame in their raw character format. Rectangular data from a text file will
automatically be read in as character type if you use the read_data()
function. Data frame columns that are not of class character will still be
processed according to the field_types
specified.
Examples
# \donttest{
# load example data into a data.frame
raw_data <- read_data(
system.file("extdata", "example_prescriptions.csv", package = "daiquiri"),
delim = ",",
col_names = TRUE
)
# create a report in the current directory
daiq_obj <- daiquiri_report(
raw_data,
field_types = field_types(
PrescriptionID = ft_uniqueidentifier(),
PrescriptionDate = ft_timepoint(),
AdmissionDate = ft_datetime(includes_time = FALSE),
Drug = ft_freetext(),
Dose = ft_numeric(),
DoseUnit = ft_categorical(),
PatientID = ft_ignore(),
Location = ft_categorical(aggregate_by_each_category = TRUE)
),
override_column_names = FALSE,
na = c("", "NULL"),
dataset_description = "Example data provided with package",
aggregation_timeunit = "day",
report_title = "daiquiri data quality report",
save_directory = ".",
save_filename = "example_data_report",
show_progress = TRUE,
log_directory = NULL
)
#> field_types supplied:
#> PrescriptionID <uniqueidentifier>
#> PrescriptionDate <timepoint> options: includes_time
#> AdmissionDate <datetime>
#> Drug <freetext>
#> Dose <numeric>
#> DoseUnit <categorical>
#> PatientID <ignore>
#> Location <categorical> options: aggregate_by_each_category
#>
#> Checking column names against field_types...
#> Importing source data [Example data provided with package]...
#> Checking data against field_types...
#> Selecting relevant warnings...
#> Identifying nonconformant values...
#> Checking and removing missing timepoints...
#> Checking for duplicates...
#> Sorting data...
#> Loading into source_data structure...
#> PrescriptionID
#> PrescriptionDate
#> AdmissionDate
#> Drug
#> Dose
#> DoseUnit
#> PatientID
#> Location
#> Finished
#> Aggregating [] by [day]...
#> Aggregating overall dataset...
#> Aggregating each data_field in turn...
#> 1: PrescriptionID
#> Preparing...
#> Aggregating character field...
#> By n
#> By missing_n
#> By missing_perc
#> By min_length
#> By max_length
#> By mean_length
#> Finished
#> 2: PrescriptionDate
#> Preparing...
#> Aggregating double field...
#> By n
#> By midnight_n
#> By midnight_perc
#> Finished
#> 3: AdmissionDate
#> Preparing...
#> Aggregating double field...
#> By n
#> By missing_n
#> By missing_perc
#> By nonconformant_n
#> By nonconformant_perc
#> By min
#> By max
#> Finished
#> 4: Drug
#> Preparing...
#> Aggregating character field...
#> By n
#> By missing_n
#> By missing_perc
#> Finished
#> 5: Dose
#> Preparing...
#> Aggregating double field...
#> By n
#> By missing_n
#> By missing_perc
#> By nonconformant_n
#> By nonconformant_perc
#> By min
#> By max
#> By mean
#> By median
#> Finished
#> 6: DoseUnit
#> Preparing...
#> Aggregating character field...
#> By n
#> By missing_n
#> By missing_perc
#> By distinct
#> Finished
#> 7: Location
#> Preparing...
#> Aggregating character field...
#> By n
#> By missing_n
#> By missing_perc
#> By distinct
#> By subcat_n
#> 2 categories found
#> 1: SITE1
#> 2: SITE2
#> By subcat_perc
#> 2 categories found
#> 1: SITE1
#> 2: SITE2
#> Finished
#> Aggregating calculated fields...
#> [DUPLICATES]:
#> Preparing...
#> Aggregating integer field...
#> By sum
#> By nonzero_perc
#> Finished
#> [ALL_FIELDS_COMBINED]:
#> Finished
#> Generating html report...
#>
#>
#> processing file: report_htmldoc.Rmd
#>
|
| | 0%
|
|.. | 3%
#> inline R code fragments
#>
#>
|
|..... | 7%
#> label: setup (with options)
#> List of 1
#> $ include: logi FALSE
#>
#>
|
|....... | 10%
#> ordinary text without R code
#>
#>
|
|.......... | 14%
#> label: unnamed-chunk-1 (with options)
#> List of 2
#> $ echo : logi FALSE
#> $ engine: chr "css"
#>
#>
|
|............ | 17%
#> inline R code fragments
#>
#>
|
|.............. | 21%
#> label: source-data
#>
|
|................. | 24%
#> ordinary text without R code
#>
#>
|
|................... | 28%
#> label: fields-imported
#>
|
|...................... | 31%
#> ordinary text without R code
#>
#>
|
|........................ | 34%
#> label: fields-ignored
#>
|
|........................... | 38%
#> ordinary text without R code
#>
#>
|
|............................. | 41%
#> label: validation-warnings
#>
|
|............................... | 45%
#> ordinary text without R code
#>
#>
|
|.................................. | 48%
#> label: source-data-summary
#>
|
|.................................... | 52%
#> ordinary text without R code
#>
#>
|
|....................................... | 55%
#> label: aggregated-data
#>
|
|......................................... | 59%
#> ordinary text without R code
#>
#>
|
|........................................... | 62%
#> label: overview-presence
#>
|
|.............................................. | 66%
#> ordinary text without R code
#>
#>
|
|................................................ | 69%
#> label: overview-missing
#>
|
|................................................... | 72%
#> ordinary text without R code
#>
#>
|
|..................................................... | 76%
#> label: overview-nonconformant
#>
|
|........................................................ | 79%
#> ordinary text without R code
#>
#>
|
|.......................................................... | 83%
#> label: overview-duplicates
#>
|
|............................................................ | 86%
#> ordinary text without R code
#>
#>
|
|............................................................... | 90%
#> label: aggregated-data-summary
#>
|
|................................................................. | 93%
#> ordinary text without R code
#>
#>
|
|.................................................................... | 97%
#> label: individual-fields (with options)
#> List of 1
#> $ results: chr "asis"
#>
#>
|
|......................................................................| 100%
#> ordinary text without R code
#>
#>
#> output file: report_htmldoc.knit.md
#> "C:/Program Files/RStudio/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS report_htmldoc.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc4b404ee3191b.html --lua-filter "C:\Users\pquan\Documents\Rworkspace\library\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\pquan\Documents\Rworkspace\library\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --template "C:\Users\pquan\Documents\Rworkspace\library\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\pquan\AppData\Local\Temp\RtmpIru0Ff\rmarkdown-str4b404d84289d.html"
#>
#> Output created: example_data_report.html
#> Report saved to: ./example_data_report.html
file.remove("./example_data_report.html")
#> [1] TRUE
# }