Package 'binr' reference manual

Title:	Cut Numeric Values into Evenly Distributed Groups
Description:	Implementation of algorithms for cutting numerical values exhibiting a potentially highly skewed distribution into evenly distributed groups (bins). This functionality can be applied for binning discrete values, such as counts, as well as for discretization of continuous values, for example, during generation of features used in machine learning algorithms.
Authors:	Sergei Izrailev
Maintainer:	Sergei Izrailev <[email protected]>
License:	Apache License (== 2.0)
Version:	1.2
Built:	2025-03-11 04:14:58 UTC
Source:	https://github.com/jabiru/binr

Cut Numeric Values Into Evenly Distributed Groups (bins).

Description

Package binr (pronounced as "binner") provides algorithms for cutting numerical values exhibiting a potentially highly skewed distribution into evenly distributed groups (bins). This functionality can be applied for binning discrete values, such as counts, as well as for discretization of continuous values, for example, during generation of features used in machine learning algorithms.

Maintainer

Sergei Izrailev

Copyright

License

Apache License, Version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0

URL

http://github.com/jabiru/binr

Installation from github

devtools::install_github("jabiru/binr")

Author(s)

Sergei Izrailev

Cut Numeric Values Into Evenly Distributed Groups (Bins)

Description

bins - Cuts points in vector x into evenly distributed groups (bins). bins takes 3 separate approaches to generating the cuts, picks the one resulting in the least mean square deviation from the ideal cut - length(x) / target.bins points in each bin - and then merges small bins unless exact.groups is TRUE The 3 approaches are:

Use quantiles, and increase the number of even cuts up to max.breaks until the number of groups reaches the desired number. See bins.quantiles.
Start with a single bin with all the data in it and perform bin splits until either the desired number of bins is reached or there's no reduction in error (the latter is ignored if exact.groups is TRUE). See bins.split.
Start with length(table(x)) bins, each containing exactly one distinct value and merge bins until the desired number of bins is reached. If exact.groups is FALSE, continue merging until there's no further reduction in error. See bins.merge.

For each of these approaches, apply redistribution of points among existing bins until there's no further decrease in error. See bins.move.

bins.getvals - Extracts cut points from the object retured by bins. The cut points are always between the values in x and weighed such that the cut point splits the area under the line from (lo, n1) to (hi, n2) in half.

bins.merr - Partitioning the data into bins using splitting, merging and moving optimizes this error function, which is the mean squared error of point counts in the bins relative to the optimal number of points per bin.

Usage

bins(x, ...)

## Default S3 method:
bins(x, target.bins, max.breaks = NA, exact.groups = F,
  verbose = F, errthresh = 0.1, minpts = NA)

## S3 method for class 'data.frame'
bins(df, ...)

## S3 method for class 'binr'
predict(obj, data, labels = FALSE, ...)

bins.getvals(lst, minpt = -Inf, maxpt = Inf)

bins.merr(binct, target.bins)
bins(x, ...)

## Default S3 method:
bins(x, target.bins, max.breaks = NA, exact.groups = F,
  verbose = F, errthresh = 0.1, minpts = NA)

## S3 method for class 'data.frame'
bins(df, ...)

## S3 method for class 'binr'
predict(obj, data, labels = FALSE, ...)

bins.getvals(lst, minpt = -Inf, maxpt = Inf)

bins.merr(binct, target.bins)

Arguments

`x`	Vector of numbers
`target.bins`	Number of groups desired; this is also the max number of groups.
`max.breaks`	Used for initial cut. If `exact.groups` is `FALSE`, bins are merged until there's no bins with fewer than `length(x) / max.breaks` points. In `bins`, one of `max.breaks` and `minpts` must be supplied.
`exact.groups`	if TRUE, the result will have exactly the number of target.bins; if FALSE, the result may contain fewer than target.bins bins
`verbose`	Indicates verbose output.
`errthresh`	If the error is below the provided value, stops after the first rough estimate of the bins.
`minpts`	Minimum number of points in a bin. In `bins`, one of `max.breaks` and `minpts` must be supplied.
`lst`	The list returned by the `bins` function.
`minpt`	The value replacing the lower bound of the cut points.
`maxpt`	The value replacing the upper bound of the cut points.
`binct`	The number of points falling into the bins.

Details

The gains are computed using incremental analytical expresions derived for moving a value from one bin to the next, splitting a bin into two or merging two bins.

Value

A list containing the following items (not all of them may be present):

binlo - The index into xval yielding the "low" value falling into the bin.
binhi - The index into xval yielding the "high" value falling into the bin.
binct - The number of points falling into the bin.
xtbl - The result of a call to table(x).
xval - The sorted unique values of the data points x. Essentially, a numeric version of names(xtbl).
changed - Flag indicating whether the bins have been modified by the function.
err - Mean square root error between the resulting counts and ideal bins.
imax - For the move, merge and split operations, the index of the bin with the maximum gain.
iside - For the move operation, the side of the move: 0 = left, 1 = right.
gain - Error gain obtained as the result of the function call.

bins.getvals returns a vector of cut points extracted from the lst object. The actual low and high values in each bin, as well as the counts of values in each bin are placed in attributes binlo, binhi and binct, respectively.

Examples

## Not run: 
   # Seriously skewed x:
   x <- floor(exp(rnorm(200000 * 1.3)))
   cuts <- bins(x, target.bins = 10, minpts = 2000)
   cuts$breaks <- bins.getvals(cuts)
   cuts$binct
   #   [0, 0]    [1, 1]    [2, 2]    [3, 3]    [4, 4]    [5, 5]    [6, 7]   [8, 10]
   # 129868     66611     28039     13757      7595      4550      4623      2791
   #   [11, 199]
   # 2166

   # Centered x:
   x <- rep(c(1:10,20,31:40), c(rep(1, 10), 100, rep(1,10)))
   cuts <- bins(x, target.bins = 3, minpts = 10)
   cuts$binct
   # [1, 10] [20, 20] [31, 40]
   #      10      100       10

## End(Not run)
## Not run: 
   # Seriously skewed x:
   x <- floor(exp(rnorm(200000 * 1.3)))
   cuts <- bins(x, target.bins = 10, minpts = 2000)
   cuts$breaks <- bins.getvals(cuts)
   cuts$binct
   #   [0, 0]    [1, 1]    [2, 2]    [3, 3]    [4, 4]    [5, 5]    [6, 7]   [8, 10]
   # 129868     66611     28039     13757      7595      4550      4623      2791
   #   [11, 199]
   # 2166

   # Centered x:
   x <- rep(c(1:10,20,31:40), c(rep(1, 10), 100, rep(1,10)))
   cuts <- bins(x, target.bins = 3, minpts = 10)
   cuts$binct
   # [1, 10] [20, 20] [31, 40]
   #      10      100       10

## End(Not run)

Greedy binning algorithm.

Description

bins.greedy - Wrapper around bins.greedy.impl. Goes over the sorted values of x left to right and fills the bins with the values until they are about the right size.

bins.greedy.impl - Implementation of a single-pass binning algorithm that examines sorted data left to right and builds bins of the target size. The bins.greedy wrapper around this function provides a less involved interface. This is not symmetric wrt direction: symmetric distributions may not have symmetric bins if there are multiple points with the same values. If a single value accounts for more than thresh * binsz points, it will be placed in a new bin.

Usage

bins.greedy(x, nbins, minpts = floor(0.5 * length(x)/nbins), thresh = 0.8, naive = FALSE)

bins.greedy.impl(xval, xtbl, xstp, binsz, nbins, thresh, verbose = F)
bins.greedy(x, nbins, minpts = floor(0.5 * length(x)/nbins), thresh = 0.8, naive = FALSE)

bins.greedy.impl(xval, xtbl, xstp, binsz, nbins, thresh, verbose = F)

Arguments

`x`	Vector of numbers.
`nbins`	Target number of bins.
`minpts`	Minimum number of points in a bin. Only used if `naive = FALSE`.
`thresh`	Threshold fraction of bin size for the greedy algorithm. Suppose there's `n < binsz` points in the current bin already. Also suppose that the next value V is represented by `m` points, and `m + n > binsz`. Then the algorithm will check if `m > thresh * binsz`, and if so, will place the value V into a new bin. If `m` is below the threshold, the points having value V are added to the current bin.
`naive`	When `TRUE`, simply calls `bins.greedy.impl` with data derived from `x`. Otherwise, makes an extra step of marking the values that by themselves take a whole bin to force the algorithm to place these values in a bin separately.
`xval`	Sorted unique values of the data set x. This should be the numeric version of `names(xtbl)`.
`xtbl`	Result of a call to `table(x)`.
`xstp`	Stopping points; if `xstp[i] == TRUE`, the `i`-th value can't be merged to the `(i-1)`-th one. `xstp[1]` value is ignored.
`binsz`	Target bin size, i.e., the number of points falling into each bin; for example, `floor(length(x) / nbins)`
`verbose`	When `TRUE`, prints the number of points falling into the bins.

Value

A list with the following items:

binlo - The "low" value falling into the bin.
binhi - The "high" value falling into the bin.
binct - The number of points falling into the bin.
xtbl - The result of a call to table(x).
xval - The sorted unique values of the data points x. Essentially, a numeric version of names(xtbl).

Algorithms minimizing the binning error function `bins.merr`.

Description

bins.move - Compute the best move of a value from one bin to its neighbor

bins.split - Split a bin into two bins optimally.

bins.merge - Merges the two bins yielding the largest gain in error reduction.

bins.move.iter - Apply bins.move until there's no change. Can only reduce the error.

bins.split.iter Iterate to repeatedly apply bins.split.

bins.merge.iter Iterate to repeatedly apply bins.merge.

Usage

bins.move(xval, xtbl, binlo, binhi, binct, target.bins, verbose = F)

bins.split(xval, xtbl, binlo, binhi, binct, target.bins, force = F, verbose = F)

bins.merge(xval, xtbl, binlo, binhi, binct, target.bins, force = F, verbose = F)

bins.move.iter(lst, target.bins, verbose = F)

bins.split.iter(lst, target.bins, exact.groups = F, verbose = F)

bins.merge.iter(lst, target.bins, exact.groups = F, verbose = F)
bins.move(xval, xtbl, binlo, binhi, binct, target.bins, verbose = F)

bins.split(xval, xtbl, binlo, binhi, binct, target.bins, force = F, verbose = F)

bins.merge(xval, xtbl, binlo, binhi, binct, target.bins, force = F, verbose = F)

bins.move.iter(lst, target.bins, verbose = F)

bins.split.iter(lst, target.bins, exact.groups = F, verbose = F)

bins.merge.iter(lst, target.bins, exact.groups = F, verbose = F)

Arguments

`xval`	Sorted unique values of the data set x. This should be the numeric version of `names(xtbl)`.
`xtbl`	Result of a call to `table(x)`.
`binlo`	The "low" value falling into the bin.
`binhi`	The "high" value falling into the bin.
`binct`	The number of points falling into the bin.
`target.bins`	Number of bins desired; this is also the max number of bins.
`verbose`	When `TRUE`, prints resulting `binct`.
`force`	When `TRUE`, splits or merges bins regardless of whether the best gain is positive.
`lst`	List containing `xval, xtbl, binlo, binhi, binct`.
`exact.groups`	If `FALSE`, run until either the target.bins is reached or there's no more splits or merges that reduce the error. Otherwise (`TRUE`), run until the target.bins is reached, even if that increases the error.

Value

A list containing the following items (not all of them may be present):

binlo - The "low" value falling into the bin.
binhi - The "high" value falling into the bin.
binct - The number of points falling into the bin.
xtbl - The result of a call to table(x).
xval - The sorted unique values of the data points x. Essentially, a numeric version of names(xtbl).
changed - Flag indicating whether the bins have been modified by the function.
err - Mean square root error between the resulting counts and ideal bins.
imax - For the move, merge and split operations, the index of the bin with the maximum gain.
iside - For the move operation, the side of the move: 0 = left, 1 = right.
gain - Error gain obtained as the result of the function call.

Quantile-based binning

Description

Cuts the data set x into roughly equal groups using quantiles.

Usage

bins.quantiles(x, target.bins, max.breaks, verbose = FALSE)
bins.quantiles(x, target.bins, max.breaks, verbose = FALSE)

Arguments

`x`	A numeric vector to be cut in bins.
`target.bins`	Target number of bins, which may not be reached if the number of unique values is smaller than the specified value.
`max.breaks`	Maximum number of quantiles; must be at least as large as `target.bins`.
`verbose`	Indicates verbose output.

Details

Because the number of unique values may be smaller than target.bins, the function gradually increases the number of quantiles up to max.breaks or until the target.bins number of bins is reached.

Package 'binr'

Help Index

Cut Numeric Values Into Evenly Distributed Groups (bins).

Description

Maintainer

Copyright

License

URL

Installation from github

Author(s)

See Also

Cut Numeric Values Into Evenly Distributed Groups (Bins)

Description

Usage

Arguments

Details

Value

See Also

Examples

Greedy binning algorithm.

Description

Usage

Arguments

Value

See Also

Algorithms minimizing the binning error function `bins.merr`.

Description

Usage

Arguments

Value

See Also

Quantile-based binning

Description

Usage

Arguments

Details

See Also

Package 'binr'

Help Index

Cut Numeric Values Into Evenly Distributed Groups (bins).

Description

Maintainer

Copyright

License

URL

Installation from github

Author(s)

See Also

Cut Numeric Values Into Evenly Distributed Groups (Bins)

Description

Usage

Arguments

Details

Value

See Also

Examples

Greedy binning algorithm.

Description

Usage

Arguments

Value

See Also

Algorithms minimizing the binning error function bins.merr.

Description

Usage

Arguments

Value

See Also

Quantile-based binning

Description

Usage

Arguments

Details

See Also

Algorithms minimizing the binning error function `bins.merr`.