A framework and process for exploring new, unfamiliar data sets, using Python

This is an attempt to formalize the processes I use when becoming familiar with a new data set.

Edwin Fig

Creative life-long learner and tinkerer with over 20 years of experience helping people with technology. INFP / Pioneer / Provider

More posts by Edwin Fig.

Edwin Fig

22 Jul 2020 • 1 min read

Picking up from other post on "What is Multivariate Data Analysis"...

A quick refresher:

Skewedness and Kurtosis

Skewedness is a measure of symmetry. The lower the number, the more symmetrical it is.

This little histogram has a Skewness = 1.882876

Kurtosis is a measure of whether the data has many or few outliers. The histogram above has a value of 6.536282.

df['COLUMN'].skew()
df['COLUMN'].kurt()

Standard Deviation

This measures the variation of a set of values.

I never use this...but maybe I should.

A low value indicates less variety. More values are near the average.

A higher number indicates more variety. Values are more disperse.

I like to think of it as a measure of the spread.

My Data Exploration process using Excel

Excel has a graphical UI. Python is code. In theory, the steps we use to approach data exploration should be similar. But there's a mental shift that needs to occur. Some of what is intuitive in Excel (at least with many years of practice) simply doesn't translate directly in Python.

Still, it should help to understand some of the steps I would traditionally take.

More to come...

A framework and process for exploring new, unfamiliar data sets, using Python

Edwin Fig

Edwin Fig

Skewedness and Kurtosis

Standard Deviation

My Data Exploration process using Excel

Notes on Multilingual Parsing and its significance

What are the latest trends in graph data science?

How do graph databases aid in business process optimization?

Notes on Amazon's AWS Cloud Data Services - Part 1

I'm excited about my new song