Plotting For Data Science

steve@thoughtsociety.org Amateur Data Science, General Leave a Comment

Viktor-Forgacs

As humans, we seem to be pretty good at visual comprehension. Many of us can pick out features or highlights from complex visual representations.  You know this works so I won’t try to convince you with the minutiae of visual science. It is worth noting that in many instances, having a key focal point or set of points helps to filter out noise from subject.  In the case of  data visualization, we want to ensure that the insight we are sharing does stand out and that observers can see what we saw when it was created.

Very Well Formed Visualizations

Info-Is-Beautiful Runner-Up

One of the runner-ups in the ‘Information is Beautiful’  World Data Visualization Prize shows how complexity can be brought to its knees without losing the idea being presented. In the realm of slightly absurd but fun, check out FlowingData’s map of the Best Breweries in the U.S. 

Here is a snapshot of that: It can safely be said that the brewery map information is well-represented and that we can glean what the its creator meant to show us in a glance.

Flowing-Data : Top Brewery States

Information is Beautiful: World’s Biggest Hacks and Breaches

Another example, yet significantly more complex is this interactive viz of the world’s biggest hacks and breaches’  

The Great Gapminder Visualization (with Plotly)

This is a world-famous visualization conceived by Hans Roesling (July 1948 – February 2017) which he gave at a seminal Ted talk.  It shows how you can impart insights using highly relevant data in a simple form and allow interaction with it. This is done with Plotly but can also be done with Dash which we cover in a future course.

The Gapminder visualization below is live. You can grab stuff and play with it.

Simpler Graphs and Plots

If we focus on learning how to plot data in Python and have advanced through the course thus far, you may be more inclined to take on plots of this nature:  

We can build plots like this from garden variety datasets and use open-source tools such as the Python Plot.ly API to visualize with.

Plotting Libraries You Should Get to Know

So far in the Amateur Data Science Course, we have used just 1 or 2 plotting libraries. One from Matplotlib and  another Seaborn which integrates with matplotlib. What we will be covering in Track-3 : Visualization, are the following libraries:

  • Matplotlib – Goto basic plotting with Python
  • Seaborn – Matplotlib with extensions for statistical plots
  • Plot.ly – Extensively outfitted, open Source plotting API 

 

 

What we should hope to learn in Track-3:Visualization is how to use some of the most widely used plotting libraries in data science. We will see what fits our project needs, look at the pro’s and con’s of each while working with real-world data.  In my own experience, it looked daunting at first but once you get some experience, you will see how easy it is. You can stack knowledge on top of that foundation until it is second nature.

Installation

With each library except Matplotlib and Seaborn (included in Anaconda Distribution) , you will have to install the packages into your environment. Included in the track sections will be detailed installations. We can test to make sure you got them in correctly.

Matplotlib, Seaborn and Plotly will run fine in a Jupyter notebook.  This course centers around Jupyter since it is so easy to share working notebooks online. 

Self-Learning Visualization

In my experience, I found that once I wanted to write a visualization with Python and I knew it was going to use one of the libraries in my list, I searched around and found many Medium posts, Kaggle kernels, github repos, stackoverflow questions and YouTube videos on that general subject. This helped me to concentrate learning using shared knowledge and experiences of other developers and data scientists.  Sometimes the post was over my head and others right in my wheelhouse.  That is not to say you won’t run into the same problem, but I found it to be motivating for what started out seeming like a monumental task only to find that I was not alone.  This really helps and I will refer to many of these sources in posts that go along with the course tracks and sections.

Time To Hit It Hard

Well, I never said it would easy but if I could do this, so can you. The visualization track is going to be one of the deepest so far and you can expect to get lost in some of it pretty quickly. Don’t give up.  The videos and notebooks will keep you above water and you will eventually swim out on your own. 

Watch for references to supporting posts outside the course and prepare to spend time reading other people’s code on the various visualization framework example pages. This is very helpful. Kaggle is great for that since code-side commenting is very liberal. I wouldn’t have gotten very far without cutting and pasting example code into a notebook and trying it out. There is no shame in re-using that code because that is what the creators of these frameworks want us to do. Once you understand it, modify it.  You are essentially paying homage to the framework creators by using the framework.

If you come up with something cool, don’t forget to push it to your Github repo. Eventually, someone else is going to be searching for just what you have been working on more than likely they will fork and mod it themselves.

Share Please
hello

The discord channel will be a great place to exchange contextual knowledge about this track with co-learners. I will also lurk there to assist. You will be pleasantly surprised at how well this kind of collaborative environment works when you are slogging through a difficult stage of your learning. You will not be alone. Others may have already solved something you are just now encountering and you may find yourself doing the same for others. That is the power of open-source collaboration. Where would we be without it?

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *