General enquiries :
+44 (0)20 7602 6000

Open Source Tools for Smarter Data Science

Wednesday 29 May 2019 Data Insight & AnalyticsData Visualisation

John Tansley's picture
By John Tansley

The advent of open source tools such as source control and notebooks has enabled data scientists to work in a far more productive way. In this post, I’ll cover how exactly they’ve helped us work better, in a more transparent and reproducible way. In the next post, we’ll look at some of the new things these tools have allowed us to do.

 

A (Mainly) Free and Collaborative World

There is now a huge range of resources and tools readily available online, most of which are free. The main languages used for data science, R and Python, can be downloaded with a couple of clicks. They also have nice development environments available (RStudio for R or Spyder for Python). These environments make it very easy to run interactive and graphical analyses, as well as develop robust code for production. These environments are now at least as fully featured as older closed source approaches such as SAS, MATLAB, or SPSS.

Free training courses are available in all these tools, from specialised providers such as Coursera, Udacity, or DataCamp. Traditional universities are also getting involved, with CalTech for instance providing a free online “Learning from Data” lecture course. As well as formal training courses, there are a range of online help providers and communities of users, using sites such as StackOverflow. The large user base of open source tools mean that these communities are typically very active, and will usually respond to any query within minutes.

Although not part of the open source world, and generally not free, another key component of today’s data science is the availability of high performance scalable platforms. These include providers such as Amazon’s AWS, Microsoft Azure, or Google Cloud. These platforms make it straightforward to run any analyses on powerful hardware, meaning that even highly complex queries can be run quickly where needed. These platforms are able to quickly spool up powerful machines on demand, meaning that you typically only pay for the short time that your queries are being run.

This combination of powerful data science programming languages, accessible training and help, and scalable computation has helped make data science both more powerful and more straightforward to use.

 

Notebooks for Data Science

One of the most important developments in today’s data science has been the rise of notebooks. Not to be confused with the paper kind, these are powerful interactive tools that let users merge low level code, documentation, charts and maps. The best-known examples are probably Jupyter notebooks for Python, and RMarkdown notebooks for R. By combining the code, documentation, and outputs in a single document, notebooks allow reproducible, understandable and shareable working. Notebooks serve almost as a data pipeline: raw data is fed into the top, and a nicely formatted output comes out at the end.

10 years ago, a typical analytics project at CACI would have involved analytics in SAS, using low level code. The outputs would then generally have been presented as PowerPoint, Word, or Excel documents. Typical projects included Direct Mail response models, customer churn models, and customer segmentations. They would have been run offline, as one-off processes, often involving several SAS scripts. As well as being time consuming, this way of work is often quite brittle. More often than I’d have liked, I’ve been in the final stages of a project, when some issue has been discovered in the raw data, meaning all previous analysis had to be re-worked.

Notebooks provide a far more repeatable way of working. I’ve surprised clients recently by being relatively unfazed when they have spotted issues in the raw data they initially provided. Working with notebooks, I’ve known that all I needed to do is feed the updated raw data into my analysis notebook. Notebooks can be used to create the final client deliverables: nicely presented web pages or html documents can automatically be created from the raw data. Notebooks can also easily be run on platforms such as Amazon’s SageMaker, meaning that they can scale up to huge volumes of data or complex processing when needed.

 

Source Control for Data Science

The use of source control tools such as GitHub has also improved ways of working within data science teams. In a previous post, I touched on how these source control tools have enabled the rise of open source software, by letting many users share and contribute to the same code base. This kind of approach avoids duplication of effort and people re-inventing the wheel, and leads to far more robust and maintainable code, as any user can test and propose improvements. These same benefits can also be realised within an organisation’s data science team. By sharing code within the team using source control, processes can be standardised, changes to code can be easily tracked and shared, all of which supports best practice across the team. As a result, the use of source control enables rapid cumulative improvement within a data science team, in the same way it has in the larger open source software space.

 

Porting Models to Open Source Platforms

At CACI, our data scientists are regularly helping clients move models or algorithms developed in traditional software and environments to open source. For example, last year a large media company wanted a better process for their churn modelling. The existing process took a day to run in Excel and its manual nature meant errors often crept in, leading to a certain lack of trust in the whole process. CACI demonstrated improved accuracy over Excel using open source R models. Automating data feeds and the use of machine learning reduced processing time to 1 minute! This kind of approach frees analysts from tedious manual work, and lets them concentrate more on understanding and enhancing the models, for instance by sourcing other input variables.

 

What Next?

To get started with using open source modelling tools yourself, the easiest approach is probably to use R in the RStudio environment. To do so, you can simply download R and RStudio for free, then learn the basics using any number of free training courses. Set a goal of automating a simple, useful business task (this may be just data processing, not necessarily data science). If you’d like to find out more, contact us to find out about our experiences. We can also help with best practice, as well as with porting models from older systems such as SAS into R or Python.

John Tansley explains how open source tools has enabled data scientists to work more productively.

Open Source Tools for Smarter Data Science

Comments

Add new comment