This is Social Learning
At the outset, we need a basic path that shows where we are and where we need to go. The path to that involves some bootstrapping in order to build muscle with the tools and techniques we are going to find ourselves using every day.
Remember, all of this is self-paced. Go at a comfortable speed for you. It is persistent information so you can return to anything at any point. The Slack or Discord rooms will be live with fellow learners so you can sharpen your newly acquired knowledge or ask questions of the group. There will be YouTube videos to provide some working examples and help get things setup.
I use a MacBook Pro. I am sure a lot of you are on Windows. That’s ok. I will try to keep things system agnostic since I assume you will know how to install stuff for your environment. There will be instances where things are more complex and very different for these two O/S’s. In that case I will try to get someone in the group to provide system-specific insights to help. These issues can be addressed in the chat channels.
Mapping It Out
I like visuals. So here is a map drawn to show how things relate and connect. We are open to improvements so if you think we left something important out or suggest a rearrangement, have at it. A complete clickable resource list will be included at the end of the post.
The map shows more than what might be required to reach the end. Start out by taking a crash course in Python from my great instructor on Udemy, Jose Portilla. I guarantee you will become comfortable with Python in less than 45 days if you stick to it.
Anaconda will keep you from going insane with Python installs, virtual environments, package management and everything you feared would stop you from going on this path. Anaconda has been steadily improved by a large community of users and its developers working hand in hand to make it great.
With Anaconda you also get the absolutely amazing Jupyter Notebooks. If data science is your trade than Jupyter is your blade. I cannot stress how great Jupyter notebooks are when you are trying to develop an algorithm and get your code right. You can visualize right in the notebooks and then share these universally online. Jupyter will become your go-to tool to develop working code faster than you can imagine. Much of what you see on Kaggle is run in Jupyter notebooks.
Data wrangling sometimes requires you just go scrape some data from the web. In many cases, it is allowed but if there is an open API, you should go through that if feasible. Scrapy is a set of open source Python functions that simplify scraping a web site. You could of course do it manually, using your own primitive functions but there is a lot of smarts built into this library to make it much easier to pull off.
Once crawled, you can use BeautifulSoup to parse the HTML tags. Numpy is a scientific function library. Things like Pi, Sqrt, multi-dimensional arrays and other math functions, not intrinsic to Python itself are found within.
Pandas turns out to be the super power you want to have if you are dealing with reasonably large datasets. Designed to handle complex data whether clean or to be cleaned, Pandas becomes an essential tool you want to master. It can be used to suck in a CSV file, JSON, XLSX or XML and then reformat them to suit the task. When you have to find and neutralize null data, Pandas has a function for that. Many in the data science community have become very comfortable with Pandas and it is indispensable in many analytics projects. In addition, it can perform statistical functions while manipulating data which is really important. Think of it as a ‘programmatic Excel’.
Eventually, you will need to visualize data in graph form. This is where Matplotlib comes in. Pretty well understood and documented, you can start plotting minutes after installing it and doing a little reading and experimentation in Jupyter. This is a pretty standard learning path to get to a point where you can gather, clean, arrange, operate on and visualize data. Many plotting libraries stand on the shoulders of Matplotlib, which has been surpassed by these newer and more sophisticated plotting packages for Python.
Seaborn is one of the packages that builds statistical functionality on top of Matplotlib. In addition, it integrates closely with Pandas. You find less and less code is required to achieve a wrangling-processing-visualization pipeline when you use these three tools together.
Your soft introduction to Machine Learning will be with SciKitLearn. It is also built on Numpy, SciPi and Matplotlib. After just a short amount of experimentation, you will feel like you’re flying. SKL has all the pieces figured out to perform the classification, regression, clustering and modeling that define ML. It does require a pedestrian understanding of linear algebra to be able to write your own functions but much of that is already done for you in this library. It took a few go rounds for me to fully grok what was happening and now I am much more comfortable with the concepts.
Graphical plotting is an essential skillset we need to express the insights we gain from this work. Standard EDA or Exploratory Data Analysis is made much easier when you have some experience with plotting tools. There are numerous plotting libraries out there but three or four in particular are well adopted which accounts for the ecosystems around them. I recommend one or more of these front-runners so you can master them yourselves. Forums are really important with these libraries and each has a thriving developer community to draw expertise from. Believe me, you will need it.
For instance, Plot.ly and Bokeh are equivalent, Dash is an extension of Plot.ly for making interactive dashboards and Matplotlib is a basic plotting library that is easy to learn when you are starting out.
About Dash: A fair amount of time was spent learning how to use this library. It started with a short course by Jose on Udemy but I took it much further on a self-experimentation bend. At the end of the day, I had the ability to build Docker containerized applications that I pushed to Github and then pulled down to the EC2 Linux server in a handful of keystrokes. Very rapid development and really error free as far as environments are concerned. We will go through how all of that works in later posts. I was able to build a multi-graph stock dashboard with interactive magnifiers traversing time-series data. Pretty cool.
Kaggle deserves a big mention here. This is a Data Science contest site that has morphed into one of the most avid learning communities on the internet. Not much changed after it was acquired by Google but you do see a lot more funded challenges coming from the big G than before. Kaggle makes intense use of Jupyter Notebooks to share ‘kernels’ which is a team or individual’s code expression of how they would solve the data science problem. In addition, Kaggle shares all of the datasets used for these challenges and even has leaderboards to track team competition. They have a nice training section which takes a newbie through the different disciplines required for learning data science. The training modules are rendered through Jupyter notebooks (of course) which makes it totally easy to follow. I highly recommend once you become more versed in the concepts and terminology that you dive into those tracks and try to complete them all. You won’t be disappointed.
Tensorflow is unbelievable! It represents the integration of several machine-learning algorithms for high performance numerical computing. From the genii of Google Brain, the architecture supports numerous platforms such as CPUs, TPUs and GPUs on Desktops, mobile and servers. It is especially effective with deep learning. There is a ‘Tensorflow playground’ on the web which you can try that explains some of the algorithms interactively.
Docker is something you will want to learn along your path. It makes it pretty easy to build interactive analytics that you plan to post to the web. If your building a ‘one-page app’ or simply a single dashboard or interactive chart, they will most likely need to be hosted on a virtual server someplace in the cloud. This requires 20 or so packages, Python, Flask, Gunicorn and Nginx. You also need an SSL cert in many cases. If you have to make changes, do you want to deal with a complicated package manager and virtual environment or would you rather just use a container that holds everything that already works and just update what is changed, pack it up and push it to github? Once you are even minimally comfortable with Docker, your development life will improve a lot. There will be a separate stop just for Docker in a future post.
I spent a good deal of time working out how I was going to allow interaction with my Python dashboards and realized that I needed to self-host them in the cloud. Then it was pretty simple for me to embed them as iFrames in WordPress for full interactivity. I could also have surfaced them by endpoints in a custom web page or application. WordPress won’t allow iFrame embeds from a non-ssl source so I had to implement SSL Certs on my Nginx server just to get started. We will go through how to do that.
The cloud side has the 3 bigs. Google Cloud Platform, Microsoft Azure and Amazon Web Services. There is also Digital Ocean. I use AWS because it has a 1-year free-tier which is great to get you going. They have so many services and the documentation is great. I spun up a T2-Micro-Instance Ubuntu Linux instance with 16gb of EBS, installed Docker and created containers for Nginx (proxy with SSL cert) and as many Gunicorn/Flask containers as I had dashboard applications. These used the virtual network between the containers to implement a web server and proxy combo to allow me to access any of these applications by their unique names. Using only a single Route53 registered domain name, tsworker.com, I was able to route to several container dash apps, each appearing at tsworker.com/app_name. Finally, I needed to have an SSL cert that automatically renewed every three months without my involvement and ‘letsencrypt’ provided that. This renewal process was scripted into the public-facing Nginx container where my domain routes to first.
In our ‘Resources’ category, you will find stamps for all of the tools and services mentioned in this post.
Just one more thing: I strongly recommend using Toby tab manager for Chrome, Firefox and Opera. It allows you to drag open tabs into organized categories and then visually select them. That will also be in the resource page. I would not be able to build my internal knowledge base without some way of quickly accessing tabs. Bookmarking sucks on all browsers and social bookmarking is flaky. This works great and I love it.
ThoughtSociety is the product of advanced collaborative online learning and I am a true believer. In the past year, I have self-learned data science, Python programming, web technology and cloud development faster than the last 3 years when I worked in Silicon Valley. I want to share that experience with as many people as are willing. It works and is great fun and joy. So let’s get going.
Ciao for now.