The Essence of Pandas

steve@thoughtsociety.org Amateur Data Science, Course, Wrangling Leave a Comment

           

First Thoughts About Pandas

Everything depends, right?  If you are working with a dataset that is structured in a particular way, then you would perform a set of explorations depending on the dataset. Having command of Pandas fundamentals will allow you to reach into your quiver and apply the arrows you need for the task at hand.  There is no single recipe that applies to all datasets.

By most accounts, Pandas is:


A Python library geared towards the manipulation and analysis of data.


Get Acquainted

Pandas has become the 2nd most important tool for data science after Python. In our quest to become proficient in Da-Sc, we need to acquire a knack for handling data within our code. 

In case you want to explore the docs first, get the Pandas Cheatsheet and then go to https://pandas.pydata.org/pandas-docs/stable/index.html and start reading. An alternate cheat sheet done by minsuk-heo in github is another great collection of Pandas fundamental snippets which will help you practice.

Consider a look at an Excel spreadsheet of column headings with data rows going down the sheet. Someone has taken the time to either enter this data or generate it programmatically.  Remember, it could be large but ‘flat’ as they say where the data is simply a 2d matrix of rows and columns.  Observations are generally in rows and features in columns. Remember this when we are scaling ‘Machine-Learning’ mountain.

In our work, data will not always be flat and is often multi-dimensional.  There could be multiple indices in a dataframe. There may also be aggregations in place where some observations have been processed to fulfill a formulaic goal.  ie; mean or std deviation over a temporal or spatial context for statistical analysis. 

Whatever the shape and format of your incoming raw dataset, Pandas is your Swiss army knife for wrangling and re-structuring data for your investigation.

Start Simple

To get acquainted with the fundamental tools within Pandas, start out by working with Minsuk’s Pandas CheatSheet notebook. What he does is take you through a lot of very basic examples of using Pandas to load data into your notebook, manipulate and structure it.  You can clone his repo to your PC and run the exercises from a Jupyter notebook with ease.  I highly recommend you do that before going beyond the basics. Once you clone his repo, you should have the practice data files there as well in a folder called ‘data’.

Some Pandas Exercises from Minsuk’s Repo

Next Level 

After you are through the practice notebook at least once, we will work with a well-known dataset. I recommend something like the Kaggle Census Demographic Dataset.This has been sliced and diced in many ways by a sizable group of DatSci’s whom have shared their notebooks and plots.  This will give you a head start with the exercise so you can compare your results with others. Pretty helpful when you are learning.

After joining Kaggle (which is free), navigate to the link I provided and download the Census dataset to your hard drive. (or cloud account). My notebook for working with the Census dataset, ‘The Pandas Play Pen’  is on my Github gist repo and will be posted in Track-2-Section-3 ‘code’ toggle of the Amateur Data Science Course. There is also a video demo of what I cover in this post.

The Reason I Call It A Play Pen

In my own journey, I found that before I knew anything about Pandas, I was thinking like an Excel person.  Everything operation I wanted to perform with a data file had some equivalence in Excel.  But as we know, Excel is overwhelmed when dealing with large data files. It seems much slower when you are doing this interactively in a Microsoft Office program, right?  Once you start doing this virtually, in-memory, you see how much time and processing power you save and things go much smoother. You also don’t have to keep exporting and re-importing the xlsx or .csv file into your code. Joins, groupby, slicing through series or dataframe copying just isn’t well suited to Excel. With Pandas, you have a sharp set of scalpels combined with multi-dimensional views into very large datasets.  Don’t try that with a spreadsheet program. Sounds like fun, right?

So ‘What’ and ‘Why’

With the Census data you have a lot of demographic data that is related to states and counties in the U.S. If you think about it, this can tell you a lot about the U.S. but it can also aid in political polling, demographic analysis and a host of other insights that you can glean from the data.  

Pandas Play Pen

Track 2: Section 3: Pandas includes a Jupyter Notebook: ‘Pandas Play Pen.ipynb‘.   You can download this file and bring it up in your local instance of Jupyter.  Just don’t forget to edit the census-path variable to point to your copy of the Census dataset.  If you save it the same folder as the notebook, you can use:

census = pd.read_csv("../acs2015_census_tract_data.csv")

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *