Feature discretization

class: scorecardbundle.feature_discretization.ChiMerge.ChiMerge

ChiMerge is a discretization algorithm introduced by Randy Kerber in "ChiMerge: Discretization of Numeric Attributes". It can transform a numerical features into categorical feature or reduce the number of intervals in a ordinal feature based on the feature's distribution and the target classes' relative frequencies in each interval. As a result, it keep statistically significantly different intervals and merge similar ones.

Parameters

m: integer, optional(default=2)
    The number of adjacent intervals to compare during chi-squared test.

confidence_level: float, optional(default=0.9)
    The confidence level to determine the threshold for intervals to
    be considered as different during the chi-square test.

max_intervals: int, optional(default=None)
    Specify the maximum number of intervals the discretized array will have.
    Sometimes (like when training a scorecard model) fewer intervals are
    prefered. If do not need this option just set it to None.

min_intervals: int, optional(default=2)
    Specify the mininum number of intervals the discretized array will have.
    If do not need this option just set it to 2.

initial_intervals: int, optional(default=100)
    The original Chimerge algorithm starts by putting each unique value
    in an interval and merging through a loop. This can be time-consumming
    when sample size is large. 
    Set the initial_intervals option to values other than None (like 10 or 100) 
    will make the algorithm start at the number of intervals specified (the
    initial intervals are generated using quantiles). This can greatly shorten
    the run time. If do not need this option just set it to None.
 
delimiter: string, optional(default='~')
    The returned array will be an array of intervals. Each interval is
    representated by string (i.e. '1~2'), which takes the form
    lower+delimiter+upper. This parameter control the symbol that
    connects the lower and upper boundaries.

decimal: int,  optional(default=None)
    Control the number of decimals of boundaries.
    Default is None.

output_dataframe: boolean, optional(default=False)
        Whether to output np.array or pd.DataFrame

Attributes

boundaries_: dict
    A dictionary that maps feature name to its merged boundaries.

fit_sample_size_: int
    The sampel size of fitted data.

transform_sample_size_:  int
    The sampel size of transformed data.

num_of_x_:  int
    The number of features.

columns_:  iterable
    An array of list of feature names.

Methods

fit(X, y): 
    fit the ChiMerge algorithm to the feature.

transform(X): 
    transform the feature using the ChiMerge fitted.

fit_transform(X, y): 
    fit the ChiMerge algorithm to the feature and transform it.    

function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.plot_event_dist()

Visualizing feature event rate distribution to facilitate explainability evaluation.

Parameters

x:numpy.ndarray or pandas.DataFrame, shape (number of examples,)
    The feature to be visualized.

y:numpy.ndarray or pandas.DataFrame, shape (number of examples,)
    The Dependent variable.

delimiter: string, optional(default='~')
    The interval is representated by string (i.e. '1~2'), 
    which takes the form lower+delimiter+upper. This parameter 
    control the symbol that connects the lower and upper boundaries.   

title: Python string. Optional.
    The title of the plot. Default is ''.

x_label: Python string. Optional.
    The label of the feature. Default is ''.

y_label: Python string. Optional.
    The label of the dependent variable. Default is ''.

x_rotation: int. Optional.
    The degree of rotation of x-axis ticks. Default is 60.

xticks: Python list of strings. Optional.
    The tick labels on x-axis. Default is the unique values
    of x (in the format of Python string).

figure_height: int. Optional.
    The hight of the figure. Default is 4.

figure_width: int. Optional.
    The width of the figure. Default is 6.

data_table: boolean. Optional.
    Whether or not to include data table in the plot.
    Default is True.

table_vpos: float. Optional.
    Only use when parameter 'data_table' is True.
    The vertical position of data table below the plot.
    table_vpos should be negative float. Default is None, 
    which means table_vpos will be determined automatically
    according to the number of intervals in the feature.

table_hpos: float. Optional.
    Only use when parameter 'data_table' is True.
    The horizontal position of data table below the plot.
    Default is 0.01. Normally there is no need to change
    this parameter.

save: boolean. Optional.
    Whether or not the figure is saved to a local positon.
    Default is False.

path: Python string. Optional.
    Only use when parameter 'save' is True.
    The local position path where the figure will be saved.
    Default is ''.

file_name: Python string. Optional.
    Only use when parameter 'save' is True.
    The file will be named as f'{path}featuredist_{file_name}.png'

Return

f1_ax1: matplotlib.axes._subplots.AxesSubplot
        The figure object is returned.

function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.feature_stat()

Compute the input feature's sample distribution, including the sample sizes, event sizes and event proportions of each feature value.

Parameters

x: numpy.array, shape (number of examples,)
    The discretizated feature array. Each value represent a right-closed interval
    of the input feature. e.g. '1~8'

y: numpy.array, shape (number of examples,)
    The binary dependent variable with 1 represents the target event (positive class).

delimiter: python string. Default is '~'
    The symbol that separates the boundaries of a interval in array x.

Return

res: pandas.DataFrame, shape (number of intervals in the feature, 4)
    The feature distribution table.

function: scorecardbundle.feature_discretization.FeatureIntervalAdjustment.feature_stat_str()

Compute the input feature's sample distribution in string format for printing. The distribution table returned (in string format) concains the sample sizes, event sizes and event proportions of each feature value.

Parameters

x: numpy.array, shape (number of examples,)
    The discretizated feature array. Each value represent a right-closed interval
    of the input feature. e.g. '1~8'

y: numpy.array, shape (number of examples,)
    The binary dependent variable with 1 represents the target event (positive class).

delimiter: python string. Default is '~'
    The symbol that separates the boundaries of a interval in array x.

n_lines: integer. Default is 40.
    The number of '- ' used. This Controls the length of horizontal lines in the table. 

width: integer. Default is 20.
    This controls the width of each column.

Return

table_string: python string
    The feature distribution table in string format