Data – The first challenge

steve@thoughtsociety.org Amateur Data Science, Course, General, Wrangling Leave a Comment

The Premise

The premise of our Amateur Data Science course is to empower professionals, students or hobbyists to make this pursuit on rapid ascent without a hard reset of one’s career and educational decisions. Finding and curating data is one of the most consequential skills we will need to acquire.

There Is A Lot

When I first began to pursue data science, I was overwhelmed. Even after reading a great deal of the Medium articles, subscribing to data science blogs and messing around a little with Python, it became apparent that there was a ton of new concepts to digest before I could do even the simplest thing in this field.

What Do I Know

After coming a little up to speed on Python and the supportive libraries that everyone used, it became obvious that data wrangling was a key part of the practice and I knew little to nothing about it.

At this point I had a few skills under my belt.  My familiarity with dataset acquisition increased as I lurked around Kaggle.com, followed PythonProgramming.net and generally experimented with getting data from open APIs.  This led me to develop useful EDA (Exploratory Data Analysis) skills that I knew in my bones would result in better analysis in the long run. 

First Code

It is not hard to code up a pandas_datareader script that can go out and get tables from stock sites. Although nearly featureless in nature, these stock price tables allowed me to experiment with different visualization techniques, run some basic operations on the data and sharpen my data manipulation prowess.

import pandas as pd
from datetime import datetime
import pandas_datareader.data as web
start = datetime(2015,1,1)
end = datetime(2018,12,31)
# AAPL is a data frame for apple stock
AAPL = web.DataReader("AAPL",'iex',start,end)
AAPL.head() # Show the first five observations

Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume
Date
2018-03-27	229.90	230.980	212.250	213.80	5164261.0	0.0	1.0	229.90	230.980	212.250	213.80	5164261.0
2018-03-26	218.83	229.150	218.500	228.91	4434467.0	0.0	1.0	218.83	229.150	218.500	228.91	4434467.0
2018-03-23	219.52	222.455	214.780	215.02	4188139.0	0.0	1.0	219.52	222.455	214.780	215.02	4188139.0
2018-03-22	223.86	225.870	220.255	220.52	2931334.0	0.0	1.0	223.86	225.870	220.255	220.52	2931334.0
2018-03-21	228.76	229.250	225.610	226.85	3910324.0	0.0	1.0	228.76	229.250	225.610	226.85	3910324.0
Slice Away

One of fundamental structures of the Pandas library is the Data Frame. Think of this as akin to a programmable spreadsheet where you can use Python operations to manipulate, re-arrange and run arithmetic operations to both change and or select data. Often, we use ‘slicing’ as a way to  basically cut down into parts of the matrix, tweeze out some specific data from a cell, row or column and then do something with it afterwards.  

AAPl["Close"][10:20] # Slice and show observations 10-20
Date
2018-03-13    179.97
2018-03-12    181.72
2018-03-09    179.98
2018-03-08    176.94
2018-03-07    175.03
2018-03-06    176.67
2018-03-05    176.82
2018-03-02    176.21
2018-03-01    175.00
2018-02-28    178.12
Name: Close, dtype: float64
The Questions

This leaves me with a set of questions for the would-be amateur that I also ask myself before setting out on a new data quest.

  • How do we know which data we need?

Make up a story. Write a paragraph to describe what you are trying to solve and tell it like you are explaining it to a friend. Identify the missing information that if you had, would be the bones of the story. When you have a good idea of what you are trying to solve, it gets easier to describe the specific observations (rows of a data table) against a corresponding set of features (columns) that you need to tell the story. This is tricky because you don’t yet know what you are looking for or if it is actually out there.

Ask yourself if this data could be acquired.  Is it owned by someone else? Can you get it free or would you have to license it for a fee? Sometimes a good idea may not be feasible because the data is proprietary or too expensive. Other times it is just out in the wild to be used.

Once you have gone through this a few times you start to become familiar with open source data that you can fork (borrow sort of) and in many cases, it is already structured in a table or spreadsheet that is easy to ingest into your scripts.  

  • Where do we find data?

In many cases, investigators share datasets on Github like the Awesome-Public-Datasets repo. These are easily downloadable to your system and often are in popular formats like .csv (comma-separated values) XLSX, JSON and XML. 

Google Public Data has a search feature to find and access publicly available datasets.

U.C. Irvine has a machine-learning dataset repository on their site.

Other places where data is housed and shared are on government sites like Census, FBI, Fed, World Bank and commercial sites like Yahoo/Finance, IEX, Quandl, Yelp, NY Times , FiveThirtyEight’s dataset github repo , Kaggle’s datasets, and NGO’s like Pro-Publica.

  • How do we acquire data?

Well, that is what you need to learn.  Simply, using any (legal & Ethical) means  necessary. That can be just downloading an Excel sheet or JSON file. It can be scraped, the technique of programmatically copying data from web pages and getting what you need into data structures in your code. Caveat: if the site doesn’t object. 

Or it can be signing up for an API key with any number of benevolent companies that share their data with developers. Sometimes I just google search ‘company-name developers’ and I get to their developer portal. Sometimes this is open to non-customers. Facebook, Twitter, Yelp all have sign-in based open APIs which are free but you can get ‘premium data’ with some kind of paid account.  This github repo from Todd Motto – ‘Public Apis’ will introduce you to a plethora of datasets that can be acquired through API calls.

IEX – Investor’s Exchange has a completely open API with multiple ways to acquire its data. This is the way Yahoo and Google used to offer stock data but they stopped in 2018. You will find more open APIs out there but not on the really valuable business data since that is lucrative for its owners to charge money for. That data probably took quite a bit of resources to create.

Often, ‘Pythonic’ data retrieval is a good thing because you can code in post-processing right after pulling it into your script. Also, you don’t have to store this data locally, just go out and fetch it when you need to. It really doesn’t matter if the data is hand or programmatically fetched. It just depends on type, size and what you want to do with it. If you fetch data files by hand and then store them locally, you can use Pandas ‘read_xxx’ (CSV, JSON, XLSX etc) to ingest them into your script. In this case, you will not have the advantage of fresh data acquired via API calls.

Get the Pandas Cheat Sheet. You should be able to make good use of it.

  • How do we structure it for analysis?

So we are at the point where there is no generic answer to this question. If you started by identifying what data you need and completed the acquisition, you may be ready to manipulate it to perform your analysis. We strongly suggest experimenting with Indexing and Selecting Data in the pandas.pydata.org documentation. This will provide you with the basic tools to navigate through a multi-dimensional Pandas data frame with Python. There have been many stackoverflow.com questions and answers about this and you will be surprised how logical and straightforward it is. Prepare to slog for this knowledge.  It will be worth it.

  • Are the features you will need all there?

If we examine one of the Titanic data sets from Kaggle, we see that there are a number of things we can do with Pandas to bend this dataset to our will.  Follow along with the Jupyter notebook embedded here.

In the notebook we saw that there a number of ways to slice and dice data frames with Pandas to get what we want out of them.  

Here are some questions I ask after my first EDA:

  • Are there unnecessary features present?
  • Are there empty cells?
  • Is the data wacky? (out of range or outliers)
  • Are columns rightly named?
  • Are there invalid observations? – Possibly incorrect data type or format
  • Are there missing features that can be derived from existing ones?

Asking these questions before re-structuring will give you a simple roadmap to preparing your data for clean analysis. Of course you will have to explore and experiment a little with the data so you can determine what shape it is in.

Pandas provides quite a bit of the blades and bits to cut this up as you see fit. Keep sharp by practicing and have no fear.

This post has an accompanying track in the Amateur Data Science course within this site.  Check it out.

Leave a Reply

Your email address will not be published. Required fields are marked *