On my last blog post, Getting on the Right Path for a Machine Learning Career, I talked about the skills, courses and requirements needed to pursue a career in data science. As I mentioned, artificial intelligence (AI) offers many specialized areas and it’s important to select the field that best fits with your skills and interests.
But aside from the educational component, a good data scientist needs to be a good story teller capable of unraveling and effectively communicating the story behind the petabytes of data that when analyzed reveal a story that leads us to answer questions that would have been impossible without data science.
But to reach this story teller status, a good data scientist needs to be adept at completing key machine learning steps, including exploratory analysis, data preparation, statistical analysis, programming, algorithm implementation, research, visualization and writing. Often, all of these steps are done by one person or in teams.
While each step is important to the full machine learning process, data preparation is perhaps the most essential element to effective algorithms, yet the most time-consuming task. In fact, a recent article points to the 80/20 rule, which states that most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data.
Recently, as participants in the xView Challenge Wovenware addressed all of these critical stages of the machine learning life cycle while creating our object detection model based on deep learning. With this event – and each time we start a new project – we learn new things about team work, our expertise and how to best leverage it.
But from the xView Challenge project, here are the two key steps we took and lessons learned:
Data Preparation
We spent a good amount of time preparing the dataset and creating experiments. performing tasks such as, image chipping, data augmentation and exploratory analysis, in order to create clusters of related patterns. And interestingly most of the work was done by software developers following the lead of data scientists, with little knowledge of the predictive models to be created.
Experiment Creation
Once the data was ready and we completed exploratory analysis, it was time to start experimenting. This is another task that takes time as you iterate over and over with different parameters. In the past, we created four or five experiments and from the results, decided on a final approach. For this challenge, we went crazy and defined dozens. So, we took a step back and decided to automate our experiments creation and from this, our Wovenware Experiments Factory was born. In a matter of days, we created the beta version where we could input different parameters, such as architectures, processor, augmentation methods, resolutions and lists of patterns. Based on this, in a matter of minutes we can configure our factory to run hundreds of experiments by reading recipes from our warehouse — just like a baker baking using his time-tested recipes from a little wooden box.
These two steps in the deep learning flow, are examples of how software developers can start contributing to a team of data scientists and incrementally gain the skills while studying to become a master.
In a follow-up post I will talk about the tools and techniques for data cleansing and preparation. Until then, I would love to hear your thoughts and questions.