Metadata-Version: 2.1
Name: spd-eda
Version: 0.0.0
Summary: initial testing
Author-email: sdooley <author@example.com>
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas (>=1.4.*)
Requires-Dist: xlsxwriter (>=3.0.*)
Requires-Dist: pyodbc (>=4.0.*)
Requires-Dist: seaborn (>=0.12.*)
Requires-Dist: scipy (>=1.9.*)
Requires-Dist: awswrangler (>=2.16.*)


# SPD's collection of EDA tools

## For use on standard dataframes

##### EDA on pandas dataframes
Use the `DataFrameEda` class for quick analysis on a pandas dataframe.

Arguments:
- dataframe
- user-defined aggregation function (returns a series when applied to dataframe rows)
- control variable list: list of one-way variables to produce control totals for the dataframe

Usage:
```
from spd_eda import DataframeEda

eda_obj = DataframeEda(df, my_agg_fcn, control_var_list=['city'])`, where `my_agg_fcn` is some user-defined function.
```


##### Simple `ExcelExport` class
The `ExcelExport` class can write formatted output into excel.

Arguments:
- filename
- control_df_list: a list of dataframes to include on the 'control' tab of the output
- var_summary: single dataframe with one record per column... first column should be the variable name
- var_info_dict: dictionary where keys are column names & values are one-way stats for the variable (using the agg function)
- col_type_dict (optional): dictionary to control number formatting in excel
- col_cond_format_list (optional): list of columns to conditionally format in the excel

Usage:

```
from spd_eda import DataframeEda, ExcelExport

eda_obj = DataframeEda(df, my_agg_fcn, control_var_list=['city'])

# write it out
col_type_dict = {
        'fid': '#,##0',
        'building_a': '#,##0',
        'price_rang': '#,##0',
        'price_ra_1': '#,##0',
        }
col_cond_format_list = ['fid']
ExcelExport('EDA_station_16_addresses.xlsx',
            eda_obj.control_df_list, eda_obj.var_summary, eda_obj.var_info_dict,
            col_type_dict=col_type_dict, col_cond_format_list=col_cond_format_list)
```

## For use on PA/Predict entities

Must be running from a machine with database access.


##### DataView EDA 
The `DataViewEda` class for doing quick analysis on a data view in the Predict software.

Arguments:
- database name
- data view name
- secondary_var_list (optional): list of secondary fields that be part of two-way comparisons for all data elements

Once object is created, use the `export_summary()` method to create formatted output in excel.

Usage:

```
from spd_eda import DataViewEda

test_dv_eda = DataViewEda(
        'GuidewireResearchCM_Insight',
        'eda_testing',
        secondary_var_list=['ax_year', 'LOSSSTATE'],
        )

test_dv_eda.export_summary("DataViewEDA_eda_testing.xlsx", obj_to_format=['LR'])       
```


##### AnalysisSummary 
The `AnalysisSummary` class supplements the analysis that comes from the Predict software, producing a file of the form `ModelSummary_<analysis_id>.xlsx`.

Arguments:
- database name
- analysis ID
- geo_col: column name <b> already included in the data view</b> to use for geography (will be used in two-way exhibits)
- naics_col: column name <b> already included in the data view</b> to use for naics (can be anything... will just be used for two-ways)
- objectives_to_format: list of metrics to conditionally format in the output

Usage:

```
db_name='GuidewireCooperativeModelsClaimsHO_Insight'
analysis_id='711b0a41-a49b-46df-8dff-68dd04776520'

AnalysisSummary(db_name=db_name, analysis_id=analysis_id, geo_col='LOSSSTATE', naics_col='LOB', objectives_to_format=['Freq'])      
```



## For use on Athena tables

Must have appropriate AWS credentials to connect.


##### Athena Table
Use the `AthenaTable` class for doing quick analysis on a data view in the Predict software.

Arguments:
- database name
- table name

Once instantiated, a variety of methods are available to examine the table.

Instantiation:

```
from spd_eda import AthenaTable

EDA_osha_accident_injury_raw = AthenaTable('assess_db', 'osha_accident_injury_raw')      
```

Available properties:

- `row_count`: integer value with number of rows in the table
- `information_schema`: dataframe with metadata on the columns in the table
- `col_info`: dictionary of key: value pairs, where key is the column name & value is dataframe of (binned) record counts

Available methods:

- `get_sample_records(num_rows=10, filter_string="1=1", col_subset_list=[])` - returns dataframe
- `get_row_count(filter_string="1=1")` - returns row count, note the optional filter
- `get_custom_counts(custom_expression)` - returns record counts based on the provided expression
- `get_records_per_thread_summary(thread_list, filter_string="1=1")` - Define a "thread" as set of columns, provides value distribution of # records that exist within each thread (useful for finding keys)
- `get_thread_examples_with_specified_num_records(thread_list, records_per_thread, num_thread_examples=1, filter_string="1=1")` - returns dataframe with examples of "threads" with the desired number of "records-per-thread".  Useful in conjunction with `get_records_per_thread_summary()`
- `get_column_info()` - This populates the `col_info` attribute... can take a long time to run.
- `write_summary(filename)` - creates excel file summarizing the table

```
from spd_eda import AthenaTable

EDA_osha_accident_injury_raw = AthenaTable('assess_db', 'osha_accident_injury_raw')

# distributions by column & expression
EDA_osha_accident_injury_raw.get_custom_counts("degree_of_inj")
EDA_osha_accident_injury_raw.get_custom_counts("SUBSTRING (load_dt, 1, 4)")

# thread hunting
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr'])
EDA_osha_accident_injury_raw.get_records_per_thread_summary(['summary_nr', 'injury_line_nr'])
EDA_osha_accident_injury_raw.get_thread_examples_with_specified_num_records(['summary_nr', 'injury_line_nr'], 2, num_thread_examples=3)

# generating excel summary
EDA_osha_accident_injury_raw.get_column_info()
EDA_osha_accident_injury_raw.write_summary('EDA_osha_accident_injury_raw.xlsx')
```

