Title: | Synthetic Population Generator |
---|---|
Description: | Generates high-entropy integer synthetic populations from marginal and (optionally) seed data using quasirandom sampling, in arbitrary dimensionality (Smith, Lovelace and Birkin (2017) <doi:10.18564/jasss.3550>). The package also provides an implementation of the Iterative Proportional Fitting (IPF) algorithm (Zaloznik (2011) <doi:10.13140/2.1.2480.9923>). |
Authors: | Andrew Smith [aut, cre], Steven Johnson [ctb] (Sobol sequence generator implementation), Massachusetts Institute of Technology [cph] (Sobol sequence generator implementation), John Burkhardt [ctb, cph] (C++ implementation of incomplete gamma function), G Bhattacharjee [ctb] (Original FORTRAN implementation of incomplete gamma function) |
Maintainer: | Andrew Smith <[email protected]> |
License: | MIT + file LICENCE |
Version: | 2.3.2 |
Built: | 2024-11-22 06:08:12 UTC |
Source: | https://github.com/virgesmith/humanleague |
This function
flatten(stateOccupancies, categoryNames)
flatten(stateOccupancies, categoryNames)
stateOccupancies |
an arbitrary-dimension array of (integer) state occupation counts. |
categoryNames |
a string vector of unique column names. |
a DataFrame with columns corresponding to category values and rows corresponding to individuals.
gender=c(51,49) age=c(17,27,35,21) states=qis(list(1,2),list(gender,age))$result table=flatten(states,c("Gender","Age")) print(nrow(table[table$Gender==1,])) # 51 print(nrow(table[table$Age==2,])) # 27
gender=c(51,49) age=c(17,27,35,21) states=qis(list(1,2),list(gender,age))$result table=flatten(states,c("Gender","Age")) print(nrow(table[table$Gender==1,])) # 51 print(nrow(table[table$Age==2,])) # 27
R package for synthesising populations from aggregate and (optionally) seed data
See README.md for detailed information and examples.
The package contains algorithms that use a number of different microsynthesis techniques:
Iterative Proportional Fitting (IPF), a la mipfp package
Quasirandom Integer Sampling (QIS) (no seed population) -
Quasirandom Integer Sampling of IPF (QISI): A combination of the two techniques whereby IPF solutions are used to sample an integer population.
The latter provides a bridge between deterministic reweighting and combinatorial optimisation, offering advantages of both techniques:
generates high-entropy integral populations
can be used to generate multiple populations for sensitivity analysis
is less sensitive than IPF to convergence issues when there are a high number of empty cells present in the seed
relatively fast computation time, though running time is linear in population
The algorithms:
support arbitrary dimensionality* for both the marginals and the seed.
produce statistical data to ascertain the likelihood/degeneracy of the population (where appropriate).
[* excluding the legacy functions retained for backward compatibility with version 1.0.1]
The package also contains the following utility functions:
a Sobol sequence generator -
functionality to convert fractional to nearest-integer marginals (in 1D). This can also be achieved in multiple dimensions by using the QISI algorithm.
functionality to 'flatten' a population into a table: this converts a multidimensional array containing the population count for each state into a table listing individuals and their characteristics.
This function will generate the closest integer array to the fractional population provided, preserving the sums in every dimension.
integerise(population)
integerise(population)
population |
a numeric vector of state occupation probabilities. Must sum to unity (to within double precision epsilon) |
an integer vector of frequencies that sums to pop.
prob2IntFreq(c(0.1,0.2,0.3,0.4), 11)
prob2IntFreq(c(0.1,0.2,0.3,0.4), 11)
C++ multidimensional IPF implementation
ipf(seed, indices, marginals)
ipf(seed, indices, marginals)
seed |
an n-dimensional array of seed values |
indices |
a List of 1-d arrays specifying the dimension indices of each marginal as they apply to the seed values |
marginals |
a List of arrays containing marginal data. The sum of elements in each array must be identical |
an object containing:
a flag indicating if the solution converged
the population matrix
the total population
the number of iterations required
the maximum error between the generated population and the marginals
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) seed = array(rep(1,30), dim=c(5,2,3)) result = ipf(seed, list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) seed = array(rep(1,30), dim=c(5,2,3)) result = ipf(seed, list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
This function will generate the closest integer vector to the probabilities scaled to the population.
prob2IntFreq(pIn, pop)
prob2IntFreq(pIn, pop)
pIn |
a numeric vector of state occupation probabilities. Must sum to unity (to within double precision epsilon) |
pop |
the total population |
an integer vector of frequencies that sum to pop, and the RMS difference from the original values.
prob2IntFreq(c(0.1,0.2,0.3,0.4), 11)
prob2IntFreq(c(0.1,0.2,0.3,0.4), 11)
C++ multidimensional Quasirandom Integer Sampling implementation
qis(indices, marginals, skips = 0L)
qis(indices, marginals, skips = 0L)
indices |
a List of 1-d arrays specifying the dimension indices of each marginal |
marginals |
a List of arrays containing marginal data. The sum of elements in each array must be identical |
skips |
(optional, default 0) number of Sobol points to skip before sampling |
an object containing:
a flag indicating if the solution converged
the population matrix
the exepected state occupancy matrix
the total population
chi-square and p-value
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) result = qis(list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) result = qis(list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
C++ QIS-IPF implementation
qisi(seed, indices, marginals, skips = 0L)
qisi(seed, indices, marginals, skips = 0L)
seed |
an n-dimensional array of seed values |
indices |
a List of 1-d arrays specifying the dimension indices of each marginal |
marginals |
a List of arrays containing marginal data. The sum of elements in each array must be identical |
skips |
(optional, default 0) number of Sobol points to skip before sampling |
an object containing:
a flag indicating if the solution converged
the population matrix
the exepected state occupancy matrix
the total population
chi-square and p-value
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) seed = array(rep(1,30), dim=c(5,2,3)) result = qisi(seed, list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
ageByGender = array(c(1,2,5,3,4,3,4,5,1,2), dim=c(5,2)) ethnicityByGender = array(c(4,6,5,6,4,5), dim=c(3,2)) seed = array(rep(1,30), dim=c(5,2,3)) result = qisi(seed, list(c(1,2), c(3,2)), list(ageByGender, ethnicityByGender))
Generate Sobol' quasirandom sequence
sobolSequence(dim, n, skip = 0L)
sobolSequence(dim, n, skip = 0L)
dim |
dimensions |
n |
number of variates to sample |
skip |
number of variates to skip (actual number skipped will be largest power of 2 less than k) |
a n-by-d matrix of uniform probabilities in (0,1).
sobolSequence(2, 1000, 1000) # will skip 512 numbers!
sobolSequence(2, 1000, 1000) # will skip 512 numbers!
Entry point to enable running unit tests within R (e.g. in testthat)
unitTest()
unitTest()
a List containing, number of tests run, number of failures, and any error messages.
unitTest()
unitTest()