General enquiries :
+44 (0)20 7602 6000

How open source is transforming data science

Tuesday 21 May 2019 AIData Insight & AnalyticsMarketing Technology

John Tansley's picture
By John Tansley

Machine learning is today becoming more and more part of everyday life, with home assistant boxes recognising our voices, websites pushing us highly customised recommendations, and face recognition at airports. All these developments seem to have come together very quickly over the space of a couple of years. The rise of these innovations is inextricably linked with the rise of open source approaches. Today, we have access to a powerful range of machine learning and data science tools that we didn’t have a few years back and intriguingly, most of them are free. How did this happen? Why are things both better and cheaper? In this post, I’ll dig into why I feel open source has enabled and supported this huge growth of powerful new machine learning approaches.

In this post, I’ll dig into why I feel open source has enabled and supported this huge growth of powerful new machine learning approaches.


What is open source?

Open source covers any software for which the source code (the programming instructions that make up any piece of software) can be viewed by anyone.

This doesn’t necessarily mean that the software is free of charge, although that is often the case as well. What this openness does do, however, is enable sharing and collaboration in the creating and updating of the software. It is this collaboration enabled by open source that has led to open source tools becoming part of our daily lives. The internet is run largely on open source Linux servers, while 2 billion Android devices have reshaped the way we communicate. Bitcoin, currently holds around £70 billion. To ensure the absolute security of this vast amount of money, the core Bitcoin code has to be open source in order to be completely transparent, trusted, and verifiable. The open source approach isn’t purely limited to software, in April 2019 Toyota open sourced 24,000 of their patents on hybrid cars, to help stimulate further innovation across the industry.


Who contributes to open source?

One major advantage of open source is the huge number of contributors and developers. GitHub, probably the most commonly used platform for sharing open source code had around 2 million contributors in 2017. Interestingly, 24,000 contributors came from large IT companies such as Google, Microsoft, IBM or Amazon. This vast number of contributors ensures that the main open source projects are incredibly well maintained, tested, and updated. Individual contributors tend to be motivated to work on problems in which they have a personal interest. Often developers will create tools that they would like to use themselves, which ensures that projects are often well aligned with developer needs and current gaps in the market. However, it may seem harder to understand the rationale behind commercial companies contributing to open source. Surely the most commercially pragmatic approach would be to download any code of interest, make changes as needed and then keep that code closed to maintain a competitive advantage? However, this misses the fact that ongoing development by other companies will then pass you by. In order to get the very latest contributions and updates, it is generally in a company’s interest to submit and share their own improvements. This then ensures that they have access to both their own updates and the most up to date enhancements from the community.


Source control underpins open source software

One of the key developments that supports the development of open source tools is the central part played by source control tools such as GitHub. Source control is the platform that enables many users to contribute to the same knowledge base without causing too much confusion. All changes are always traceable and linked to a particular contributor. If you’ve ever ended up with countless slightly different versions of the same presentation on your laptop – you need source control! Source control has created a fundamental change in the way software is developed and shared.

One big advantage of source control is the way that it provides a centralised repository of knowledge. Before the widespread use of source control, there was often much duplication of effort in the implementation of new algorithms, often with many separate companies maintaining their own versions. With the advent of open source, we see this happening much less, with commercial software houses tending to increasingly link into open source algorithms rather than maintaining their own versions. This has a major effect on how open source algorithms are updated and improved: far more contributors are working on one core implementation, leading to a much faster innovation cycle. In fact, I would go as far as to say that the main advantage of open source is not being free, but enabling rapid cumulative improvement.

The seemingly unstoppable rise of data science over the last few years has been enabled by this collaborative approach of the open source community. Advanced techniques such as the natural language processing approach that assistants such as Siri or Alexa use to understand us are now taken for granted. It will be very exciting to see what other capabilities we’ll be taking for granted in a few years’ time.


What next?

In my next post, I’ll have a look at how these open source benefits can also be realised within your data science team, by using source control and notebook based techniques to work more effectively and collaboratively.


Find out more

If you’d like to find out more about how Forecaster leverages the power of open source to provide best of breed demand forecasting, get in touch.

Our resident data scientist, John Tansley, digs into why open source has enabled and supported a huge growth of new machine learning approaches.

How open source is transforming data science


Add new comment