SCBK MLOps 공식 문서
Modules
Exploratory Data Analysis (EDA) Module
해당 모듈은 데이터셋에 대한 EDA 분석을 수행하는 모듈입니다. Documentation 및 예시로 작성해둔 Jupyter Notebook 파일을 참고하여 분석을 진행하시면 됩니다.
- scbk_mlops.eda.auto_eda(data, report_path='auto_eda_report.html')
sweetviz를 활용한 Auto EDA 리포트를 html로 생성
- Parameters:
data (pd.DataFrame) – EDA를 수행할 입력 데이터.
report_path (str, optional) – 생성된 리포트를 저장할 파일 경로.
- Returns:
None
- scbk_mlops.eda.auto_eda_comparison(data1, data2, report_path='auto_eda_comparison_report.html')
sweetviz를 활용한 Auto EDA 비교 리포트(ex. 데이터프레임의 sub segment)를 html로 생성
- Parameters:
data1 (pd.DataFrame) – 비교할 첫 번째 데이터프레임.
data2 (pd.DataFrame) – 비교할 두 번째 데이터프레임.
report_path (str, optional) – 생성된 리포트를 저장할 파일 경로.
- Returns:
None
- scbk_mlops.eda.plot_area_chart(df: DataFrame, x_column: str, y_column: str, title: str, x_label: str, y_label: str, alpha: float = 0.6, color: str = '#0473ea') None
Area chart 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_column – Y 축.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
alpha – 투명도.
color – 색상.
- scbk_mlops.eda.plot_box_plot(df: DataFrame, column: str, title: str, y_label: str, color: str = '#525355') None
Box plot 그리기
- Parameters:
df – DataFrame.
column – Box plot 컬럼.
title – 차트 제목.
y_label – Y 축 레이블.
color – 색상.
- scbk_mlops.eda.plot_density_plot(df: DataFrame, column: str, hue: str | None = None, shade: bool = True, title: str = 'Density Plot', x_label: str = '', y_label: str = '', color: str = '#38d200') None
Density plot 그리기
- Parameters:
df – DataFrame.
column – Density 컬럼.
hue – Hue.
shade – 음영 처리 여부.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
color – 색상.
- scbk_mlops.eda.plot_grouped_donut_pie_chart(df: DataFrame, group_column: str, value_column: str, colors: list = ['#38d200', '#0473ea', '#525355', '#0061c7'], title_prefix: str = 'Distribution for') None
Grouped donut pie chart 그리기
- Parameters:
df – DataFrame.
group_column – Group 컬럼.
value_column – Value 컬럼.
colors – 색상 리스트.
title_prefix – 차트 제목 Prefix.
- scbk_mlops.eda.plot_heatmap(df: DataFrame, color: str = '#0473ea') None
Heatmap 그리기
- Parameters:
df – DataFrame.
color – 색상.
- scbk_mlops.eda.plot_histogram(df: DataFrame, column: str, bins: int, title: str, x_label: str, y_label: str, color: str = '#0473ea') None
Histogram 그리기
- Parameters:
df – DataFrame.
column – 히스토그램 컬럼.
bins – bin 개수.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
color – 색상.
- scbk_mlops.eda.plot_multi_line_chart(df: DataFrame, x_column: str, y_columns: list, title: str, x_label: str, y_label: str, colors: list = ['#38d200', '#0473ea', '#525355', '#0061c7']) None
Multi-line chart 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_columns – Y 축 리스트.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
colors – 색상 리스트.
- scbk_mlops.eda.plot_pie_chart(df: DataFrame, label_column: str, size_column: str | None = None, colors: list = ['#38d200', '#0473ea', '#525355', '#0061c7'], title: str = 'Pie Chart') None
Pie chart 그리기
- Parameters:
df – DataFrame.
label_column – Label 컬럼.
size_column – Size 컬럼.
colors – 색상 리스트.
title – 차트 제목.
- scbk_mlops.eda.plot_single_value_card(value: float, title: str, subtitle: str | None = None, font_size: int = 24, subtitle_size: int = 16, color: str = '#38d200') None
Single value card 그리기
- Parameters:
value – 표시할 값.
title – 카드 제목.
subtitle – 부제목.
font_size – 폰트 크기.
subtitle_size – 부제목 폰트 크기.
color – 배경색.
- scbk_mlops.eda.plot_spline_chart(df: DataFrame, x_column: str, y_column: str, title: str, x_label: str, y_label: str, color: str = '#0473ea') None
Spline chart 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_column – Y 축.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
color – 색상.
- scbk_mlops.eda.plot_stacked_area_chart(df: DataFrame, x_column: str, y_columns: list, title: str, x_label: str, y_label: str, alpha: float = 0.6, colors: list = ['#38d200', '#0473ea', '#525355', '#0061c7']) None
Stacked area chart 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_columns – Y 축 리스트.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
alpha – 투명도.
colors – 색상 리스트.
- scbk_mlops.eda.plot_stacked_column_chart(df: DataFrame, category_column: str, stack_column: str, colors: list = ['#38d200', '#0473ea', '#525355', '#0061c7']) None
Stacked column chart 그리기
- Parameters:
df – DataFrame.
category_column – Category 컬럼.
stack_column – Stack 컬럼.
colors – 색상 리스트.
- scbk_mlops.eda.plot_table_chart(df: DataFrame, title: str, col_width: float = 0.2, row_height: float = 0.4, font_size: int = 12, header_color: str = '#0473ea', row_colors: list = ['#525355', '#0061c7'], edge_color: str = 'black') None
Table chart 그리기
- Parameters:
df – DataFrame.
title – 테이블 제목.
col_width – 컬럼 너비.
row_height – 행 높이.
font_size – 폰트 크기.
header_color – 헤더 색상.
row_colors – 행 색상 리스트.
edge_color – 테두리 색상.
- scbk_mlops.eda.plot_treemap(df: DataFrame, size_column: str, label_column: str, color: str = '#0473ea') None
Treemap 그리기
- Parameters:
df – DataFrame.
size_column – Size 컬럼.
label_column – Label 컬럼.
color – 색상.
- scbk_mlops.eda.plot_trend_line_chart(df: DataFrame, x_column: str, y_column: str, title: str, x_label: str, y_label: str, color: str = '#0473ea') None
Trend line chart 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_column – Y 축.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
color – 색상.
- scbk_mlops.eda.plot_violin_plot(df: DataFrame, x_column: str, y_column: str, title: str, x_label: str, y_label: str, color: str = '#0473ea') None
Violin plot 그리기
- Parameters:
df – DataFrame.
x_column – X 축.
y_column – Y 축.
title – 차트 제목.
x_label – X 축 레이블.
y_label – Y 축 레이블.
color – 색상.
Data Ingestion Module
해당 모듈은 데이터를 불러오는 기능을 제공하는 모듈입니다. 데이터 추출단계에서부터 미리 정의된 데이터를 불러오는 것을 추천 드리며, 그렇지 않더라도 최대한 일반화하여 진행이 가능하도록 Function을 구성하였습니다.
- scbk_mlops.ingestion.capitalize_columns(df: DataFrame) DataFrame
dataframe 칼럼들을 모두 대문자로 변환. Convert all DataFrame column names to uppercase.
- Parameters:
df (pd.DataFrame) – Input DataFrame to transform.
- Returns:
A new DataFrame with column names in uppercase.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> capitalize_columns(df) COL1 COL2 0 1 3 1 2 4
- scbk_mlops.ingestion.convert_and_save_to_parquet(input_dataframe: DataFrame, output_file_name: str) None
(ENG) Saves txt, csv filetypes into parquet (KOR) txt, csv 등의 파일 형태를 parquet(pq) 형태로 저장해주는 함수입니다.
- Parameters:
input_dataframe (pd.DataFrame) – The DataFrame to be converted. Recommended to bring from the Data Catalog.
output_file_name (str) – Desired name for the output Parquet file (including the “.parquet” extension).
- Returns:
None
Examples
>>> df = pd.read_csv('input.csv') >>> save_as_parquet(df, 'output.parquet')
- scbk_mlops.ingestion.partition_data(df: DataFrame, partition_column: str, output_dir: str, file_format: str = 'csv') None
(KOR) 지정해준 칼럼의 Unique한 값들로 데이터를 개별 파일로 나눠주는 방식 (ENG) Split the DataFrame into individual files based on unique values in a specified column.
- Parameters:
df (pd.DataFrame) – The input DataFrame to be partitioned.
partition_column (str) – The column name to partition the DataFrame by.
output_dir (str) – The directory where the partitioned files will be saved.
file_format (str, optional) – The format of the output files (‘csv’ or ‘parquet’). Default is ‘csv’.
- Returns:
None
- Raises:
ValueError – If the specified file format is not supported.
Examples
>>> partition_data(df, 'Scorecard', './output', 'csv')
- scbk_mlops.ingestion.typecast(df: DataFrame) DataFrame
string, int, float 등 각 칼럼을 인식해 typecast 해주는 함수. Null 값 같은 경우 coerce로 그대로 두는 형태를 선정. Cast DataFrame columns to appropriate data types while preserving null values. Null values are preserved using ‘errors=”coerce”’ where applicable.
- Parameters:
df (pd.DataFrame) – The input DataFrame to be typecasted.
- Returns:
A new DataFrame with columns cast to appropriate data types.
- Return type:
pd.DataFrame
Reporting Module
해당 모듈은 모델링 결과를 리포팅하는 기능을 제공하는 모듈입니다. 주로 모형 개발과 관련된 문서 생성에 초점을 맞추고 있으며, EDA Report는 EDA Module에서 생성하시면 됩니다.
- scbk_mlops.reporting.calculate_woe_iv(data, feature, target)
Helper function to calculate Weight of Evidence (WoE) and Information Value (IV)
- Parameters:
data (pd.DataFrame) – Data containing the feature and target variable.
feature (str) – The binned feature column name.
target (str) – The target variable column name.
- Returns:
DataFrame containing WoE and IV values for each bin.
- Return type:
woe_iv_df (pd.DataFrame)
- scbk_mlops.reporting.data_drift_report(reference_data, current_data, column_mapping=None, report_path='data_drift_report.html')
Generate data drift report (evidently)
- Parameters:
reference_data (pd.DataFrame) – The reference dataset to compare against.
current_data (pd.DataFrame) – The current dataset to evaluate for data drift.
column_mapping (dict, optional) – Column mapping for evidently.
report_path (str, optional) – Path to save the generated report.
- Returns:
None
- scbk_mlops.reporting.dq_add_variable_description(data: DataFrame, variable_description_file: str) DataFrame
Add variable description from the provided description file.
- scbk_mlops.reporting.dq_add_variable_description_char(data: DataFrame, variable_description_df: DataFrame) DataFrame
Add variable description (char) from the provided description DataFrame.
Parameters: - data: pd.DataFrame - Data with variables to describe. - variable_description_df: pd.DataFrame - DataFrame containing variable descriptions.
Returns: - pd.DataFrame: Data with variable descriptions added.
- scbk_mlops.reporting.dq_add_variable_serial_number_char(data: DataFrame) DataFrame
Add variable serial number label (char) to the data.
- scbk_mlops.reporting.dq_add_variable_type(data: DataFrame) DataFrame
Add variable type and serial number to the data.
- scbk_mlops.reporting.dq_add_variable_type_char(data: DataFrame) DataFrame
Add variable type label (char) to the data.
- scbk_mlops.reporting.dq_data_processing(file_path: str) DataFrame
Process data for data quality report.
- scbk_mlops.reporting.dq_data_processing_char(df: DataFrame) DataFrame
Process character data from a DataFrame for data quality report. Selects string columns and computes basic statistics.
Parameters: - df: pd.DataFrame - Input DataFrame containing the data.
Returns: - pd.DataFrame: Processed data for DQ report.
- scbk_mlops.reporting.dq_data_reconstruction_char(data: DataFrame) DataFrame
Reconstruction of the data into report format for character variables.
- scbk_mlops.reporting.dq_median_analysis(data: DataFrame) DataFrame
Perform median analysis and determine variables to drop.
- scbk_mlops.reporting.dq_output_report(data: DataFrame)
Generate the final DQ report and save it as an Excel and CSV file.
- scbk_mlops.reporting.dq_output_report_char(data: DataFrame, output_excel: str, output_csv: str)
Generate final output for character type and save as Excel and CSV files.
Parameters: - data: pd.DataFrame - Data to output. - output_excel: str - Filename for the Excel output. - output_csv: str - Filename for the CSV output.
- scbk_mlops.reporting.dq_process_and_merge_snapshots(snapshot_files)
- scbk_mlops.reporting.feature_selection_report(model, X_train, y_train, report_path='feature_selection_report.html')
Generate feature selection report from GrootCV
- Parameters:
model – The model used for feature selection.
X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target variable.
report_path (str, optional) – Path to save the generated report.
- Returns:
None
- scbk_mlops.reporting.generate_dq_report(reference_data, current_data, report_path='custom_data_quality_report.html')
Generate Evidently DQ Report
- scbk_mlops.reporting.generate_eligibility_waterfall(df, segment_column, segment_values, timestamp_column, timestamps, exclusion_criteria, response_columns, output_folder)
Generate a waterfall eligibility report based on exclusion criteria.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing customer data.
segment_column (str) – Column name indicating the customer segment.
segment_values (list) – List of values in segment_column defining the segment of interest.
timestamp_column (str) – Column name indicating the timestamp.
timestamps (list) – List of timestamps to process.
exclusion_criteria (list) –
List of dictionaries, each defining an exclusion criterion. Each dictionary should have keys:
’name’: Name of the criterion.
’flag_column’: Name of the flag column to create.
’condition’: A function that takes df and returns a boolean Series.
response_columns (list) – List of response columns to analyze.
output_folder (str) – Folder path to save the CSV files.
- Returns:
None
- Outputs:
CSV files saved in the output_folder containing the eligibility waterfall report.
- scbk_mlops.reporting.generate_fine_classing_report(data, target_variable, report_path='fine_classing_report.html')
Generate fine classing report
- Parameters:
data (pd.DataFrame) – Dataset for fine classing.
target_variable (str) – The target variable for classification.
report_path (str, optional) – Path to save the generated report.
- Returns:
None
- scbk_mlops.reporting.generate_toc_model_card(train_data_path='train_data.csv', test_data_path='test_data.csv', target_column='target', prediction_column='prediction', report_dir='./model_card', model_name='Marketing Propensity Model', version='v1.0', model_description='', model_author='', model_type='', model_architecture='', date='', primary_use_case='', out_of_scope='', training_dataset_description='', training_data_source='', training_data_limitations='', evaluation_dataset_description='', evaluation_metrics='', decision_threshold='', considerations='', threshold_comment='', features_of_interest=None, limitations=None, ethical_considerations=None)
Generate a Model Card using Evidently.ai with customizable text fields and plots.
- Parameters:
train_data_path (str) – Path to the training data CSV file.
test_data_path (str) – Path to the testing data CSV file.
target_column (str) – Name of the target column in the datasets.
prediction_column (str) – Name of the prediction column in the datasets.
report_dir (str) – Directory to store model card artifacts.
model_name (str) – Name of the model being documented.
version (str) – Version of the model.
model_description (str) – Description of the model.
model_author (str) – Author of the model.
model_type (str) – Type of the model.
model_architecture (str) – Architecture of the model.
date (str) – Date of the model.
primary_use_case (str) – Primary use case of the model.
out_of_scope (str) – Applications out of scope for the model.
training_dataset_description (str) – Description of the training dataset.
training_data_source (str) – Source of the training data.
training_data_limitations (str) – Limitations of the training data.
evaluation_dataset_description (str) – Description of the evaluation dataset.
evaluation_metrics (str) – Evaluation metrics used.
decision_threshold (str) – Decision threshold for classification.
considerations (str) – Caveats and recommendations.
threshold_comment (str) – Comments about decision thresholds.
features_of_interest (list, optional) – List of features to highlight in the report.
limitations (str, optional) – Known limitations of the model.
ethical_considerations (str, optional) – Ethical considerations for model use.
- Output:
None: Saves the model card report as an HTML file in the specified directory.
- scbk_mlops.reporting.make_float(x)
- scbk_mlops.reporting.output_dict_resp_analysis()
통계자료 출력 (Null 값 개수/Decrease 개수, Percentile - 이건 참고용)
- scbk_mlops.reporting.response_analysis_to_excel(output_dict, file_name, topic='', leave_opened=False)
Exports data from the output_dict to an Excel file with the specified name.
Args: - output_dict (dict): Dictionary containing months as keys and dictionaries with dataframes ‘df1’ and ‘df2’ as values. - file_name (str): Desired name/path for the Excel file. - topic (str): Topic (product) for the analysis. - leave_opened (bool): Leaves the workbook opened for final check.
- scbk_mlops.reporting.response_crosstab(df: DataFrame, r_col_name: str, a_col_name: str, band=[0, 1000000, 5000000, 10000000, 30000000, 50000000, 100000000, inf], band_labels=['0Mto1M', '1Mto5M', '5Mto10M', '10Mto30M', '30Mto50M', '50Mto100M', '100M+'], pct_bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, inf], pct_labels=['0-10%', '10%-20%', '20%-30%', '30%-40%', '40%-50%', '50%-60%', '60%-70%', '70%-80%', '80%-90%', '90%-100%', '100%+'], tot_period=['0'], positive_direction=True)
(ENG) Creates a crosstab to use when exploring the response definition (KOR) Response 변수 정의를 위한 crosstab 생성
Output: - client_crosstab: returns a table based on client number counts for each cross ‘bins’ - balance_crosstab: returns a table based on balance amount for each cross ‘bins’
Args: - df (DataFrame): input data - r_col_name: ratio column name (for percentage calculation) - a_col_name: amount column name (for balance calculation) - band (List): List of predefined ranges for ‘bins’ - band_labels (List): List of predefined names for each ‘bins’ - pct_bins (List): List of predefined percent ranges for ‘bins’ - pct_labels (List): List of predefined percent names for each ‘bins’ - tot_period: ‘0’ takes the entire period while others will take individual snapshots based on the input (typically base_yymm) - positive_direction: positives and negatives should be taken into consideration
By default, the bins are left inclusive and right exclusive
- scbk_mlops.reporting.snapshot_count(df, base_yymm, segment, resp)
Snapshot profile report
Data Engineering Module
해당 모듈은 데이터 전처리 및 Feature Engineering을 수행하는 모듈입니다.
- scbk_mlops.data_engineering.Data_Sampling(samplingdata=None, user_specified_ratio=None, response_variable=None)
Returns sampled data with snapshot, id and target variable where target variable is deduplicated and non-responders are sampled using ratio at the snapshot level.
Inputs: df: data (n_cust and responders) user_specified_ratio: Ratio to select the Non-Responders sample response_variable: The response variable to be used for sampling.
Returns: final_sample: DataFrame with sampled data.
- scbk_mlops.data_engineering.add_change_features(df) DataFrame
Adding Change Features
- scbk_mlops.data_engineering.add_penetration_features(df) DataFrame
Calculate penetration features by dividing account balances by AUM balances
- scbk_mlops.data_engineering.apply_feature_selection(train_df, test_df, resp=None)
Apply feature selection to train and test datasets.
Parameters: - train_df (pd.DataFrame): The training dataset. - test_df (pd.DataFrame): The testing dataset. - resp (str): The target field/response variable.
Returns: - train_reduced_df (pd.DataFrame): The reduced training dataset after feature selection. - test_reduced_df (pd.DataFrame): The reduced testing dataset after feature selection.
This function transforms both the train and test datasets using the fitted feature selection pipeline and ensures that any protected variables (if applicable) are included in the final datasets for potential fairness analysis.
- scbk_mlops.data_engineering.feature_selection(S2: DataFrame, resp='', existing_pipeline='')
Apply GrootCV feature selection using the provided training data.
Parameters: - S2 (pd.DataFrame): The training dataset. - resp (str): The target field/response variable. - existing_pipeline (str): The name of an existing feature selection pipeline to load and apply.
This function either fits a new feature selection pipeline on the provided dataset or loads an existing one and applies it for feature selection.
- scbk_mlops.data_engineering.fit_feature_selection(train_df, resp)
Fit the feature selection pipeline and save it to a file.
Parameters: - train_df (pd.DataFrame): The training dataset. - resp (str): The target field/response variable for feature selection.
This function fits a feature selection pipeline on the provided training data and saves the resulting pipeline object as a pickle file, named with a timestamp for versioning.
- scbk_mlops.data_engineering.fit_feature_selection_pipeline(train_df, customer_id_field='CIFNO', timestamp_field='base_yyyymm', target_field='')
Build and fit the feature selection pipeline.
Parameters: - train_df (pd.DataFrame): The training dataset. - customer_id_field (str): The field representing customer IDs. Defaults to ‘CIFNO’. - timestamp_field (str): The field representing timestamp or time period. Defaults to ‘base_yyyymm’. - target_field (str): The target field/response variable.
Returns: - pipeline: A fitted feature selection pipeline.
This function breaks down the input dataset into features (X) and the target (y), applies sample weighting to account for class imbalance, and then fits a feature selection pipeline consisting of two stages of GrootCV and collinearity filtering.
- scbk_mlops.data_engineering.output_eligible_dataset()
Output eligible dataset for model input
- scbk_mlops.data_engineering.process_null_values(df)
Preprocess null values in the DataFrame based on column data types.
- Parameters:
df (pd.DataFrame) – Input DataFrame with potential null values.
- Returns:
DataFrame with null values processed based on data types.
- Return type:
pd.DataFrame
- scbk_mlops.data_engineering.split_data(df)
Split data into three DataFrames S1, S2, S3 based on unique BASE_YYMM values.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing a column ‘BASE_YYMM’.
- Returns:
- Three DataFrames (S1, S2, S3), where S1 corresponds to the earliest,
S2 corresponds to the next, and S3 corresponds to the latest BASE_YYMM.
- Return type:
tuple
- Raises:
ValueError – If ‘BASE_YYMM’ column does not have exactly three distinct values or if the values are not in the required format (YYYYMM).
Model Engineering Module
해당 모듈은 모델링을 수행하는 모듈입니다. AutoML 및 MLFlow를 활용하여 모형을 만드는 함수를 구현하였으며, 해당 함수를 활용하여 진행하시면 됩니다.
- scbk_mlops.model_engineering.automl_pipeline(data, target, experiment_name)
Pycaret을 사용하여 자동화된 모델 선택 및 튜닝 수행
- Parameters:
data (pd.DataFrame) – 입력 데이터셋.
target (str) – 데이터셋에서 목표 변수의 이름.
experiment_name (str) – Pycaret 실험의 이름.
- Returns:
Pycaret이 선택하고 튜닝한 최고의 모델.
- Return type:
best_model
- scbk_mlops.model_engineering.cross_validation(model, data, target, cv, scoring)
강력한 모델 평가를 위해 교차 검증 기법 적용 (e.g. k-fold stratified k-fold)
- Parameters:
model – 평가할 모델.
data (pd.DataFrame) – 교차 검증에 사용할 데이터셋.
target (str) – 목표 변수의 이름.
cv (int) – 교차 검증 폴드 수.
scoring (str) – 평가에 사용할 스코어링 메트릭.
- Returns:
각 폴드의 스코어를 포함한 교차 검증 결과.
- Return type:
cv_results (pd.DataFrame)
- scbk_mlops.model_engineering.init_mlflow(experiment_name, tracking_uri=None)
MLFlow 실행
- Parameters:
experiment_name (str) – 사용할 MLFlow 실험의 이름.
tracking_uri (str, optional) – MLFlow 추적 서버의 URI. 기본값은 로컬 파일 시스템.
- Returns:
None
- scbk_mlops.model_engineering.load_oot_dataset(train_path, valid_path, test_path)
훈련 검증 테스트 데이터셋 로드
- Parameters:
train_path (str) – 훈련 데이터셋 파일의 경로.
valid_path (str) – 검증 데이터셋 파일의 경로.
test_path (str) – 테스트 데이터셋 파일의 경로.
- Returns:
훈련 데이터셋. valid_data (pd.DataFrame): 검증 데이터셋. test_data (pd.DataFrame): 테스트 데이터셋.
- Return type:
train_data (pd.DataFrame)
- scbk_mlops.model_engineering.mlflow_pipeline(model, params, metrics, artifact_path)
실험 추적 및 모델 라이프사이클 관리를 위해 MLFlow 사용
- Parameters:
model – MLFlow에 로깅할 모델.
params (dict) – 모델과 연관된 매개변수.
metrics (dict) – 모델의 평가 메트릭.
artifact_path (str) – 모델 아티팩트를 저장할 경로.
- Returns:
MLFlow 실행 ID.
- Return type:
run_id (str)
- scbk_mlops.model_engineering.output_scored_predictions(data, model_path)
저장된 최고의 모델을 사용하여 각 행에 대한 예측 점수 출력
- Parameters:
data (pd.DataFrame) – 예측할 입력 데이터.
model_path (str) – 저장된 모델 파일의 경로.
- Returns:
각 행에 대한 예측 점수.
- Return type:
predictions (pd.Series)
- scbk_mlops.model_engineering.pycaret_automl(data, target, session_id=None)
AutoML을 위해 Pycaret 실행
- Parameters:
data (pd.DataFrame) – 모델 학습을 위한 입력 데이터.
target (str) – 예측할 목표 변수의 열 이름.
session_id (int, optional) – 재현성을 위한 세션 ID.
- Returns:
Pycaret에서 선택한 최고의 모델.
- Return type:
best_model
- scbk_mlops.model_engineering.save_best_model(model, file_name)
평가 메트릭에 따라 가장 성능이 좋은 모델을 .pkl 파일로 저장
- Parameters:
model – 저장할 학습된 모델 객체.
file_name (str) – 모델을 저장할 파일의 이름.
- Returns:
None
- scbk_mlops.model_engineering.save_experiment_parameters(params, file_path)
Kedro에 사용할 실험 매개변수 저장
- Parameters:
params (dict) – 저장할 실험 매개변수의 딕셔너리.
file_path (str) – 매개변수를 저장할 파일 경로.
- Returns:
None
- scbk_mlops.model_engineering.save_scored_predictions(predictions, file_path)
향후 사용을 위해 예측 점수를 .json 파일로 저장
- Parameters:
predictions (pd.Series or pd.DataFrame) – 저장할 예측 점수.
file_path (str) – 예측 점수를 저장할 .json 파일의 경로.
- Returns:
None
Model Evaluation Module
해당 모듈은 모델링 결과를 평가하는 기능을 제공하는 모듈입니다. 주로 Classification 및 Clustering을 위한 함수를 제공하며, 해당 함수를 활용하여 모델링 결과를 평가하시면 됩니다.
- scbk_mlops.model_evaluation.bias_fairness_assessment(model, X_test, y_test, sensitive_feature)
Wasserstein 거리 플롯 또는 기타 공정성 메트릭을 사용하여 편향 및 공정성 평가
- Parameters:
model – 평가할 모델.
X_test (pd.DataFrame) – 테스트 입력 데이터.
y_test (pd.Series) – 테스트 목표 변수.
sensitive_feature (str) – 민감한 특성의 열 이름.
- Returns:
편향 및 공정성 평가 결과 딕셔너리.
- Return type:
fairness_metrics (dict)
- scbk_mlops.model_evaluation.check_governance_measures(governance_policies)
모든 거버넌스 관련 조치를 고려하도록 확인
- Parameters:
governance_policies (list) – 적용해야 할 거버넌스 정책의 리스트.
- Returns:
각 정책에 대한 준수 여부.
- Return type:
compliance_report (dict)
- scbk_mlops.model_evaluation.create_leaderboard(model_results)
Create model leaderboard based on @30 S3
- Parameters:
model_results (list of dict) – 각 모델의 결과를 포함한 딕셔너리의 리스트. 예: [{‘model_name’: ‘Model1’, ‘accuracy’: 0.95, ‘f1_score’: 0.94}, …]
- Returns:
모델 성능을 비교한 리더보드 데이터프레임.
- Return type:
leaderboard (pd.DataFrame)
- scbk_mlops.model_evaluation.explainable_ai(model, X_train)
SHAP LIME과 같은 해석 가능성 방법을 통합하여 투명성과 해석 가능성 강화
- Parameters:
model – 해석할 모델.
X_train (pd.DataFrame) – 훈련 입력 데이터.
- Returns:
SHAP 및 LIME 해석 결과를 포함한 딕셔너리.
- Return type:
explanations (dict)
- scbk_mlops.model_evaluation.feature_impact(model, X_train, y_train, grootcv_importances)
GrootCV 결과와의 차이를 평가
- Parameters:
model – 평가할 모델.
X_train (pd.DataFrame) – 훈련 입력 데이터.
y_train (pd.Series) – 훈련 목표 변수.
grootcv_importances (pd.Series) – GrootCV로부터 얻은 특징 중요도 시리즈.
- Returns:
모델과 GrootCV 간의 특징 중요도 비교 결과.
- Return type:
comparison (pd.DataFrame)
- scbk_mlops.model_evaluation.generate_leaderboard_report(models, X_test, y_test)
모델 비교 (e.g. S1 vs S2 S1 vs S3) 리더보드 보고서 생성
- Parameters:
models (dict) – 모델 이름과 모델 객체의 딕셔너리. 예: {‘Model1’: model1, ‘Model2’: model2}
X_test (pd.DataFrame) – 테스트 입력 데이터.
y_test (pd.Series) – 테스트 목표 변수.
- Returns:
모델별 성능 지표를 포함한 리더보드 보고서.
- Return type:
report (pd.DataFrame)
- scbk_mlops.model_evaluation.model_evaluation_rai(model, X_test, y_test, sensitive_features)
Responsible AI (RAI) 도구를 고려하여 공정성 검토
- Parameters:
model – 평가할 모델.
X_test (pd.DataFrame) – 테스트 입력 데이터.
y_test (pd.Series) – 테스트 목표 변수.
sensitive_features (pd.Series) – 민감한 특성의 시리즈.
- Returns:
공정성 평가 메트릭 딕셔너리.
- Return type:
rai_metrics (dict)
- scbk_mlops.model_evaluation.model_explainability(model, X_train)
모델 해석 가능성 통찰 제공 (Global Explainability/Feature Impact/Local Explainability)
- Parameters:
model – 해석할 모델.
X_train (pd.DataFrame) – 훈련 입력 데이터.
- Returns:
해석 가능성 보고서를 포함한 딕셔너리.
- Return type:
explainability_reports (dict)
- scbk_mlops.model_evaluation.model_fairness_check(model, X_test, y_test, sensitive_features)
모델 예측에서 편향을 식별하고 완화하기 위한 공정성 감사 수행
- Parameters:
model – 평가할 모델.
X_test (pd.DataFrame) – 테스트 입력 데이터.
y_test (pd.Series) – 테스트 목표 변수.
sensitive_features (pd.Series) – 민감한 특성의 시리즈.
- Returns:
공정성 감사 결과.
- Return type:
audit_results (dict)
- scbk_mlops.model_evaluation.output_evaluation_artifacts(reference_data, current_data)
공정성 편향 등과 같은 다양한 RAI 및 거버넌스 평가 자료 출력
- Parameters:
reference_data (pd.DataFrame) – 기준 데이터셋.
current_data (pd.DataFrame) – 평가할 현재 데이터셋.
- Returns:
None
Model Monitoring Module
해당 모듈은 모델의 모니터링을 수행하는 모듈입니다. 주로 모델의 예측 결과를 모니터링하고, 이상치를 탐지하는 기능을 제공합니다.
- scbk_mlops.model_monitoring.alert_report(metric_thresholds, current_metrics)
Generate alert report upon trigger
- Parameters:
metric_thresholds (dict) – 메트릭 임계값의 딕셔너리.
current_metrics (dict) – 현재 메트릭 값의 딕셔너리.
- Returns:
트리거된 알림의 상세 정보를 담은 딕셔너리.
- Return type:
alert_report (dict)
- scbk_mlops.model_monitoring.evidently_dashboard(reference_data, current_data, column_mapping=None)
Generate evidently.ai dashboard for drift check
- Parameters:
reference_data (pd.DataFrame) – 기준 데이터셋.
current_data (pd.DataFrame) – 현재 평가할 데이터셋.
column_mapping (dict, optional) – 컬럼 매핑을 위한 딕셔너리.
- Returns:
None
- scbk_mlops.model_monitoring.model_decision_making(performance_metrics, drift_metrics, thresholds)
현재 모델이 충분한지 또는 재훈련이 필요한지 결정
- Parameters:
performance_metrics (dict) – 모델 성능 메트릭의 딕셔너리.
drift_metrics (dict) – 데이터 또는 모델 드리프트 메트릭의 딕셔너리.
thresholds (dict) – 성능 및 드리프트 임계값의 딕셔너리.
- Returns:
‘retrain’ 또는 ‘keep’ 중 하나.
- Return type:
decision (str)
- scbk_mlops.model_monitoring.output_recommendation(decision)
현재 모델을 계속 사용할지 또는 새로운 모델을 훈련할지에 대한 추천 제공
- Parameters:
decision (str) – 모델 결정 (‘retrain’ 또는 ‘keep’).
- Returns:
추천 사항에 대한 설명 문자열.
- Return type:
recommendation (str)
- scbk_mlops.model_monitoring.rai_dashboard(reference_data, current_data, model, sensitive_features, target_column)
Generate RAI dashboard for drift check
- Parameters:
reference_data (pd.DataFrame) – 기준 데이터셋.
current_data (pd.DataFrame) – 현재 평가할 데이터셋.
model – 평가할 모델 객체.
sensitive_features (list or pd.Series) – 민감한 특성의 리스트 또는 시리즈.
target_column (str) – 목표 변수의 열 이름.
- Returns:
생성된 RAI 대시보드 객체.
- Return type:
dashboard
- scbk_mlops.model_monitoring.update_monitoring(new_results, monitoring_data_path)
가장 최근 모형 결과로 모니터링 정보 업데이트
- Parameters:
new_results (pd.DataFrame) – 최신 모델 결과 데이터.
monitoring_data_path (str) – 모니터링 데이터를 저장할 파일 경로.
- Returns:
None