Title: | Non-Parametric Sampling with Parallel Monte Carlo |
---|---|
Description: | An implementation of a non-parametric statistical model using a parallelised Monte Carlo sampling scheme. The method implemented in this package allows non-parametric inference to be regularized for small sample sizes, while also being more accurate than approximations such as variational Bayes. The concentration parameter is an effective sample size parameter, determining the faith we have in the model versus the data. When the concentration is low, the samples are close to the exact Bayesian logistic regression method; when the concentration is high, the samples are close to the simplified variational Bayes logistic regression. The method is described in full in the paper Lyddon, Walker, and Holmes (2018), "Nonparametric learning from Bayesian models with randomized objective functions" <arXiv:1806.11544>. |
Authors: | Simon Lyddon [aut], Miguel Morin [aut], James Robinson [aut, cre], Matt Craddock [ctb], The Alan Turing Institute [cph] |
Maintainer: | James Robinson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3 |
Built: | 2024-10-12 06:15:04 UTC |
Source: | https://github.com/alan-turing-institute/posteriorbootstrap |
The PosteriorBootstrap package provides two categories of functions. The first category returns or loads the system files that ship with the package: get_stan_file, get_german_credit_file, get_german_credit_dataset. The second category performs statistical sampling: draw_stick_breaks and draw_logit_samples (for adaptive non-parametric learning of the logistic regression model).
Please see the vignette for sample usage and performance metrics.
Maintainer: James Robinson [email protected]
Authors:
Simon Lyddon [email protected]
Miguel Morin [email protected]
Other contributors:
The Alan Turing Institute [email protected] [copyright holder]
Useful links:
https://github.com/alan-turing-institute/PosteriorBootstrap/
Report bugs at https://github.com/alan-turing-institute/PosteriorBootstrap/issues
draw_logit_samples
returns samples of the parameter of interest in a
logistic regression.
draw_logit_samples( x, y, concentration, n_bootstrap = 100, posterior_sample = NULL, gamma_mean = NULL, gamma_vcov = NULL, threshold = 1e-08, num_cores = 1, show_progress = FALSE )
draw_logit_samples( x, y, concentration, n_bootstrap = 100, posterior_sample = NULL, gamma_mean = NULL, gamma_vcov = NULL, threshold = 1e-08, num_cores = 1, show_progress = FALSE )
x |
The features of the data. |
y |
The outcomes of the data (either |
concentration |
The parameter |
n_bootstrap |
The number of bootstrap samples required. |
posterior_sample |
The function can take samples from the posterior to
generate non-parametric-learning samples, or it can take NULL and the
posterior is assumed normal N( |
gamma_mean |
In case |
gamma_vcov |
In case |
threshold |
The threshold of stick remaining below which the function stops looking for more stick-breaks. It correspondes to epsilon in the paper, at the bottom of page 5 and in algorithm 2 in page 12. |
num_cores |
Number of processor cores for the parallel run of the
algorithm. See |
show_progress |
Boolean whether to show the progress of the algorithm in a progress bar. |
This function implements the non-parametric-learning algorithm, which is algorithm 2 in page 12 in the paper. It uses a mixture of Dirichlet processes and stick-breaking to find the number of posterior samples and logistic regression to find the randomized parameter of interest. For examples, see the vignette.
A matrix of bootstrap samples for the parameter of interest.
draw_stick_breaks
returns a vector with the breaks of a stick of
length 1.
draw_stick_breaks( concentration = 1, min_stick_breaks = 100, threshold = 1e-08, seed = NULL )
draw_stick_breaks( concentration = 1, min_stick_breaks = 100, threshold = 1e-08, seed = NULL )
concentration |
The parameter |
min_stick_breaks |
The minimal number of stick-breaks. |
threshold |
The threshold of stick remaining below which the function stops looking for more stick-breaks. It corresponds to epsilon in the paper, at the bottom of page 5 and in algorithm 2 in page 12. |
seed |
A seed to start the sampling. |
This function implements the stick-breaking process for non-parametric
learning described in section 2 of the supplementary material. The name
"stick-breaking" comes from a stick of unit length that we need to break into
a number of items. This code implements algorithm 2 and the stick-breaking
function calculates the parameter T in algorithm 1, which is the only
difference between the two algorithms. The code uses the Beta distribution as
that distribution is part of the definition of the stick-breaking process.
The function draws from the beta distribution, e.g. b_1
, b_2
,
b_3
, ..., and computes the stick breaks as b_1
,
(1-b_1)*b_2
, (1-b_1)*(1-b_2)*b_3
, ... . The length remaining in
the stick at each step is 1-b_1
, (1-b_1)* (1-b_2)
,
(1-b_1)*(1-b_2)*(1-b_3)
, ... so the latter converges to zero.
A vector of stick-breaks summing to one.
draw_stick_breaks(1) draw_stick_breaks(1, min_stick_breaks = 10) draw_stick_breaks(1, min_stick_breaks = 10, threshold = 1e-8)
draw_stick_breaks(1) draw_stick_breaks(1, min_stick_breaks = 10) draw_stick_breaks(1, min_stick_breaks = 10, threshold = 1e-8)
Get a file from extdata by name
get_file(name)
get_file(name)
name |
The filename that is requested |
The requested file
f <- get_file('bayes_logit.stan') writeLines(readLines(f))
f <- get_file('bayes_logit.stan') writeLines(readLines(f))
Load and pre-process the dataset that ships with the package
get_german_credit_dataset( scale = TRUE, add_constant_term = TRUE, download_destination = NULL )
get_german_credit_dataset( scale = TRUE, add_constant_term = TRUE, download_destination = NULL )
scale |
Whether to scale the features to have mean 0 and variance 1. |
add_constant_term |
Whether to add a constant term as the first feature. |
download_destination |
Provide a filepath if you want to download the dataset from source. Note that although the original dataset has 20 features (some of them qualitative), the numeric dataset has 24 features. |
A list with fields x
for features and y
for outcomes.
german <- get_german_credit_dataset() head(german$y) head(german$x)
german <- get_german_credit_dataset() head(german$y) head(german$x)
The file contains a local copy of the German Statlog credit dataset with 1,000 observations and 24 features. The data page is at: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) and the original files at: http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/ We use the file 'german.data-numeric', which has 24 covariates instead of the 20 in the original data (as some are qualitative).
get_german_credit_file()
get_german_credit_file()
A file with the plain-text raw data for the German Statlog credit
that ships with this package (extension .dat
).
f <- get_german_credit_file() writeLines(readLines(f, n=5))
f <- get_german_credit_file() writeLines(readLines(f, n=5))
Get the Stan file with Bayesian Logistic Regression
get_stan_file()
get_stan_file()
An RStan file with the model for variational Bayes that ships with
this package (extension .stan
).
f <- get_stan_file() writeLines(readLines(f))
f <- get_stan_file() writeLines(readLines(f))