Open source is the core of the rapid development of innovation and technology. Here, we will discuss how to select open source machine learning tools for different use cases. Although machine learning is still in its early stages of development, the potential value of medical, security, and personalized marketing has made it possible for companies to learn machine learning.
Why choose a machine learning framework?
The benefits of using open source tools are not just in their usability. Generally speaking, there are a large number of data engineers and data scientists who are willing to share data sets and pre-training models at this level of project. For example, you can use a classification model to train data from ImageNet instead of using scratch to create graphical awareness. Open source machine learning tools also allow you to migrate learning, which means you can solve machine learning problems by applying other aspects of knowledge. So, you can put some abilities into a model that can learn to identify cars or help us accomplish other tasks.
Depending on the problem you need to deal with, pre-trained models and open datasets may not be as accurate as custom, but open source frameworks don't require you to collect data sets, which can save you a lot of time and effort. According to Andrew Ng, a former data scientist at Baidu and a professor at Stanford University, the use of open source models and data sets will be the second biggest driver of business success after supervised learning.
Among the many active but less popular open source tools, we will select 5 for in-depth discussions to help you find a tool that suits you and start your path to data science exploration. Next, we get to the point.
1. TensorFlow
TensorFlow was originally designed for internal use by Google. In 2015, the Apache 2.0 source code was released for open source. Google’s reputation and excellent flow charts when building models have attracted a large group of TensorFlow advocates.
TensorFlow is a great python tool for deep neural network research and complex mathematical calculations, and it even supports reinforcement learning. TensorFlow is also unique in data flow graph-structures, including nodes (mathematical operations) and edges (numeric arrays or tensors).
1.1 Data sets and models
The flexibility of TensorFlow is reflected in the possibility of conducting research or repeating machine learning tasks based on it. Therefore, you can use a low-level API called TensorFlow Core. You can control the models through it and train them with your own data set. But there are also public pre-training models for building more advanced APIs on top of TensorFlow Core. The most popular modes you can use today are MNIST, a traditional dataset that helps identify handwritten digits on images, or Medicare Data, a dataset from Google that helps predict medical service charges.
1.2 Audience and learning curve
For those who are exploring machine learning for the first time, the diversity of TensorFlow functionality can be a bit more complicated. Some people even think that the library does not accelerate the learning curve of machine learning, but makes it steeper. TensorFlow is a lower-level library, but it requires a lot of code and a good understanding of the data science details to make it easier to use for project development. So if your data science team is IT-centric, then it may not be your best choice, we will discuss a simpler alternative.
1.3 Use cases
Considering the complexity of TensorFlow, its use cases mainly include solutions from large companies with experts in the field of machine learning. For example, the UK online supermarket Ocado uses TensorFlow to prioritize the limited availability of their contact centers and improve demand forecasts. At the same time, global insurance company Axa uses the library to predict large car accidents that their users will likely be involved in.
2. Theano: mature library with extended performance
Theano is a lower-level library based on the Python language for scientific computing, and it typically defines, optimizes, and evaluates mathematical expressions as the goal of deep learning. Despite its excellent computing performance, its complexity is still prohibitive for many users. For this reason, Theano is primarily used in low-level wrappers such as Keras, Lasagne, and Blocks for high-level frameworks for rapid prototyping and model testing.
2.1 Data sets and models
Theano has a public model, but the more expensive framework has a large selection of tutorials and training data sets to choose from. For example, Keras stores the available models and detailed tutorials in its documentation.
2.2 Audience and learning curve
If you use Lasagne or Keras as the top level wrapper, you will have plenty of tutorials and pre-trained data sets. In addition, Keras is considered to be the easiest library to start with in the early deep learning exploration phase.
Because TensorFlow was designed to replace Theano, it lost a lot of fans. However, many data scientists have found that there are many advantages that make them use outdated versions.
2.3 Use cases
Taking into account industry standards for deep learning research and development, Theano was originally used to complement the most cutting-edge deep learning algorithms. However, considering that you may not be using Theano directly, you can use many of its features as a basis for using other libraries such as digital and image recognition, object localization, and even chatting with robots.
3. Torch: A framework supported by Lua scripting language for Facebook
Torch is often referred to as the simplest deep learning tool for beginners. Because it is a simple scripting language, developed by Lua. Although there are fewer people using this language than Python, it is still widely adopted - Facebook, Google, and Twitter.
3.1 Data sets and models
You can find a list of popular datasets to load on its GitHub cheatsheet page. In addition, Facebook has released an official code for the implementation of Deep Remaining Networks (ResNets) and uses pre-trained models to fine-tune its own data sets.
3.2 Audience and learning curve
The number of engineers using the Lua language in the market is much smaller than that of Python. However, the Torch grammar reflects that Lua is easier to read. Active Torch contributors love Lua, so this is a great choice for beginners and those who want to expand their toolset.
3.3 Use cases
Facebook used Torch to create DeepText, which classifies users' information on the site in minutes and provides more personalized content targeting. Twitter, with the support of Torch, has been able to recommend tweets based on algorithmic timelines (rather than in reverse order).
4. Scikit-learn
Scikit-learn is a high-level framework for supervised and unsupervised machine learning algorithms. As part of the Python ecosystem, it is built on top of the NumPy and SciPy libraries, each responsible for lower-level data science tasks. However, when NumPy handles numerical calculations, the SciPy library contains more specific numerical flows, such as optimization and interpolation. Subsequently, scikit-learn was used for machine learning. In the Python ecosystem, the relationship between these three tools and other tools reflects different levels of data science: the higher the level, the more specific problems can be solved.
4.1 Data sets and models
The library already contains some classifications and standard data sets for regressions, although they do not represent the real situation. However, the diabetes dataset used to measure disease progression or the iris job dataset for pattern recognition is a good explanation of how machine learning algorithms work in scikit. Moreover, the library provides information for loading datasets from external sources, including sample generators for tasks, such as multi-class classifications and decompositions, while providing advice on the use of popular datasets.
4.2 Audience and learning curve
Although as a powerful library, scikit-learn focuses on ease of use and documentation. It is a tool that can be operated by non-experts and novice engineers because it is simple to use and contains a large number of well-described examples and enables machine learning algorithms to be quickly applied to data. Based on reviews from software stores AWeber and Yat, scikit is ideal for projects that have time and human resources constraints.
5. Caffe/Caffe2: Easy to use and with a large number of pre-trained models
Unlike Theano and Torch, which were born for research, Caffe is not suitable for text, sound or time series data. Caffe is a dedicated machine learning library for image classification. Support from Facebook and the recent open source Caffe2 made the library a popular tool for 248 GitHub contributors.
Although it was criticized for its slow development, Caffe's successor, Caffe2, eliminated the problems of the original technology by enhancing flexibility, weightlessness, and supporting mobile deployment.
5.1 Data sets and models
Caffe encourages data sets from industry and other users. The team fosters collaboration while linking a large number of popular datasets that were previously trained by Caffe. The biggest advantage of the framework is the Model Zoo—that is, a large number of pre-trained models created by developers and researchers. You can use, combine models, or just learn and train your own models.
5.2 Audience and learning curve
The Caffe team claims that you can skip the learning part and start exploring deep learning directly using existing models. The library's target audience is those who want to experience deep learning first and commit to driving community development.
5.3 Use cases
Through the use of state-of-the-art convolutional neural networks (CNNs), deep neural networks have been successfully applied to visual image analysis, even to the visual effects of autonomous driving. Caffe helped Facebook develop its real-time video filtering tool to apply the famous art style to the video. Pinterest also uses Caffe to extend the visual search function and agrees that users find specific objects in the image.