Skip to content

usc-isi-i2/dsbox-profiling

Repository files navigation

Build Status

Introduction

A TA1 primitives for d3m project Currently it generates data profiles for tabular data. We use DataFrame (supported by Pandas) as our main data type.

Requirements

see setup.py

Installation

  1. install d3m project first.

  2. use pip:

pip install dsbox-dataprofiling

Usage

see example.py

you can chose the metafeautures based on what you need. Our computable metafeatures including:

computable_metafeatures = ['ratio_of_values_containing_numeric_char', 'ratio_of_numeric_values', 
    'number_of_outlier_numeric_values', 'num_filename', 'number_of_tokens_containing_numeric_char', 
    'number_of_numeric_values_equal_-1', 'most_common_numeric_tokens', 'most_common_tokens', 
    'ratio_of_distinct_tokens', 'number_of_missing_values', 
    'number_of_distinct_tokens_split_by_punctuation', 'number_of_distinct_tokens', 
    'ratio_of_missing_values', 'semantic_types', 'number_of_numeric_values_equal_0', 
    'number_of_positive_numeric_values', 'most_common_alphanumeric_tokens', 
    'numeric_char_density', 'ratio_of_distinct_values', 'number_of_negative_numeric_values', 
    'target_values', 'ratio_of_tokens_split_by_punctuation_containing_numeric_char', 
    'ratio_of_values_with_leading_spaces', 'number_of_values_with_trailing_spaces', 
    'ratio_of_values_with_trailing_spaces', 'number_of_numeric_values_equal_1', 
    'natural_language_of_feature', 'most_common_punctuations', 'spearman_correlation_of_features', 
    'number_of_values_with_leading_spaces', 'ratio_of_tokens_containing_numeric_char', 
    'number_of_tokens_split_by_punctuation_containing_numeric_char', 'number_of_numeric_values', 
    'ratio_of_distinct_tokens_split_by_punctuation', 'number_of_values_containing_numeric_char', 
    'most_common_tokens_split_by_punctuation', 'number_of_distinct_values', 
    'pearson_correlation_of_features']

for the specific meaning and data structure of the metafeature, you can lookup this page: data_metafeatures

About

The data profiling TA1 component of DSBox

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages