Feature discretization
class: scorecardbundle.feature_discretization.ChiMerge.ChiMerge
ChiMerge is a discretization algorithm introduced by Randy Kerber in "ChiMerge: Discretization of Numeric Attributes". It can transform a numerical features into categorical feature or reduce the number of intervals in a ordinal feature based on the feature's distribution and the target classes' relative frequencies in each interval. As a result, it keep statistically significantly different intervals and merge similar ones.
Parameters
m: integer, optional(default=2)
The number of adjacent intervals to compare during chi-squared test.
confidence_level: float, optional(default=0.9)
The confidence level to determine the threshold for intervals to
be considered as different during the chi-square test.
max_intervals: int, optional(default=None)
Specify the maximum number of intervals the discretized array will have.
Sometimes (like when training a scorecard model) fewer intervals are
prefered. If do not need this option just set it to None.
min_intervals: int, optional(default=2)
Specify the mininum number of intervals the discretized array will have.
If do not need this option just set it to 2.
initial_intervals: int, optional(default=100)
The original Chimerge algorithm starts by putting each unique value
in an interval and merging through a loop. This can be time-consumming
when sample size is large.
Set the initial_intervals option to values other than None (like 10 or 100)
will make the algorithm start at the number of intervals specified (the
initial intervals are generated using quantiles). This can greatly shorten
the run time. If do not need this option just set it to None.
delimiter: string, optional(default='~')
The returned array will be an array of intervals. Each interval is
representated by string (i.e. '1~2'), which takes the form
lower+delimiter+upper. This parameter control the symbol that
connects the lower and upper boundaries.
decimal: int, optional(default=None)
Control the number of decimals of boundaries.
Default is None.
output_dataframe: boolean, optional(default=False)
Whether to output np.array or pd.DataFrame
Attributes
boundaries_: dict
A dictionary that maps feature name to its merged boundaries.
fit_sample_size_: int
The sampel size of fitted data.
transform_sample_size_: int
The sampel size of transformed data.
num_of_x_: int
The number of features.
columns_: iterable
An array of list of feature names.
Methods
fit(X, y):
fit the ChiMerge algorithm to the feature.
transform(X):
transform the feature using the ChiMerge fitted.
fit_transform(X, y):
fit the ChiMerge algorithm to the feature and transform it.
function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.plot_event_dist()
Visualizing feature event rate distribution to facilitate explainability evaluation.
Parameters
x:numpy.ndarray or pandas.DataFrame, shape (number of examples,)
The feature to be visualized.
y:numpy.ndarray or pandas.DataFrame, shape (number of examples,)
The Dependent variable.
delimiter: string, optional(default='~')
The interval is representated by string (i.e. '1~2'),
which takes the form lower+delimiter+upper. This parameter
control the symbol that connects the lower and upper boundaries.
title: Python string. Optional.
The title of the plot. Default is ''.
x_label: Python string. Optional.
The label of the feature. Default is ''.
y_label: Python string. Optional.
The label of the dependent variable. Default is ''.
x_rotation: int. Optional.
The degree of rotation of x-axis ticks. Default is 60.
xticks: Python list of strings. Optional.
The tick labels on x-axis. Default is the unique values
of x (in the format of Python string).
figure_height: int. Optional.
The hight of the figure. Default is 4.
figure_width: int. Optional.
The width of the figure. Default is 6.
data_table: boolean. Optional.
Whether or not to include data table in the plot.
Default is True.
table_vpos: float. Optional.
Only use when parameter 'data_table' is True.
The vertical position of data table below the plot.
table_vpos should be negative float. Default is None,
which means table_vpos will be determined automatically
according to the number of intervals in the feature.
table_hpos: float. Optional.
Only use when parameter 'data_table' is True.
The horizontal position of data table below the plot.
Default is 0.01. Normally there is no need to change
this parameter.
save: boolean. Optional.
Whether or not the figure is saved to a local positon.
Default is False.
path: Python string. Optional.
Only use when parameter 'save' is True.
The local position path where the figure will be saved.
Default is ''.
file_name: Python string. Optional.
Only use when parameter 'save' is True.
The file will be named as f'{path}featuredist_{file_name}.png'
Return
f1_ax1: matplotlib.axes._subplots.AxesSubplot
The figure object is returned.
function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.feature_stat()
Compute the input feature's sample distribution, including the sample sizes, event sizes and event proportions of each feature value.
Parameters
x: numpy.array, shape (number of examples,)
The discretizated feature array. Each value represent a right-closed interval
of the input feature. e.g. '1~8'
y: numpy.array, shape (number of examples,)
The binary dependent variable with 1 represents the target event (positive class).
delimiter: python string. Default is '~'
The symbol that separates the boundaries of a interval in array x.
Return
res: pandas.DataFrame, shape (number of intervals in the feature, 4)
The feature distribution table.
function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.feature_stat_str()
Compute the input feature's sample distribution in string format for printing. The distribution table returned (in string format) concains the sample sizes, event sizes and event proportions of each feature value.
Parameters
x: numpy.array, shape (number of examples,)
The discretizated feature array. Each value represent a right-closed interval
of the input feature. e.g. '1~8'
y: numpy.array, shape (number of examples,)
The binary dependent variable with 1 represents the target event (positive class).
delimiter: python string. Default is '~'
The symbol that separates the boundaries of a interval in array x.
n_lines: integer. Default is 40.
The number of '- ' used. This Controls the length of horizontal lines in the table.
width: integer. Default is 20.
This controls the width of each column.
Return
table_string: python string
The feature distribution table in string format