AUTOML is the rapid construction of AI models through automated machine learning, which simplifies the machine learning process and makes it easier for more people to use AI technology. Recently, software industry giant Salesforce Open source of its automl library Transmogrifai. Shubha Nabar, senior director of data science at Salesforce Einstein, wrote about the AUTOML library, best automl framework including workflow and design principles, in Medium.
- GitHub Link: Https://github.com/salesforce/TransmogrifAI
- TRANSMOGRIFAI Official website: https://transmogrif.ai/
In the past decade, although machine learning has made great strides, it is still difficult to build a production-ready machine learning system. Three years ago, when we started building machine learning capabilities on the Salesforce platform, we found it even harder to build enterprise-class machine learning systems. To solve the problems we encountered, we built the Transmogrifai, automl framework an end-to-end automated machine learning library for structured data. Today, this library has helped drive our Einstein AI platform in production. Here, we are delighted to share this project with the open source community, enabling other developers and data scientists to build machine learning solutions on a large scale and quickly.
When we give a machine learning capability to consumer products, automl framework data scientists tend to focus on a large number of easily understandable use cases and datasets. Conversely, the diversity of data and use cases in the enterprise makes machine learning for enterprise-class products face another challenge. For Salesforce, our customers want to anticipate a range of results, from customer churn, sales forecasts and opportunity conversions, to email marketing clicks, website purchases, acceptance bids, device failures, overdue payments, and more. It is critical for enterprise customers to ensure that their data is protected and that data is not shared with other organizations or competitors. This means that we have to build a user-defined machine learning model for any given use case. Even though we can build a global model, it makes no sense to build a global model because each user's data is unique, with different patterns, patterns, deviations introduced by different business processes. In order for machine learning to truly serve our customers, we must build and deploy thousands of personalized machine learning models for each use case using unique data for each customer.
Without hiring a large number of data scientists, the only way to achieve that goal is to automate. Today, most auto-ml solutions are either very narrowly focused on a small part of the entire machine learning workflow, or are built for unstructured, homogenous data such as images, speech, and language. And the solution we need should be able to rapidly generate data-efficient models for heterogeneous structured data on a large scale. In the dictionary, "transmogrification" means "the process of transformation," which is often done in a surprising or magical way, and that's what Transmogrifai did for Salesforce-enabling data science teams to translate customer data into meaningful , actionable predictions. Today, thousands's user-customized machine learning model has been deployed on this platform, enabling more than 3 billion predictions per day.
Below we will introduce Transmogrifai's workflow, discuss design ideas, and give some links to help people use the library or contribute code to it.
Workflow for Transmogrifai
In general, the research and development tasks required to build an excellent machine learning model are very large work. The tedious work of data preparation, feature engineering, and model training is an iterative process that takes data scientists weeks or even months to get a mature automation model. Transmogrifai is a library built on the Scala language and SPARKML framework, and it happens to do just that. With just a few lines of code, data scientists can automate data cleansing, feature engineering, and model selection, get a good-performing model, and then explore and iterate further.
The Transmogrifai encapsulates 5 major or components:
Workflow for Transmogirifai
Feature inference: The first step in any machine learning workflow is to prepare the data. Data scientists collect all the relevant data and compress, connect, and aggregate different data sources to extract the raw signals that may be predictive. The extracted signals are then dumped into a flexible data structure (often referred to as DataFrame) for further operation downstream of the workflow. Although these data structures are simple and easy to operate, they do not protect data scientists from downstream errors, such as "error assumptions about data types" or "Null values in Data". As a result, it is possible for a data scientist to run the workflow all night and fail the next morning only because she is trying to multiply two strings (the multiplication of the data type is not normally allowed).
In Transmogrifai, we solve this problem by allowing the user to specify a pattern for the data and automatically extracting the original predictor and the response signal as a "feature". Features are strongly typed, and Transmogrifai supports rich, extensible feature type hierarchies. This hierarchy can go beyond the original type and support more subtle data types, such as geographic location, phone number, zip code, and so on-data types that data scientists want to differentiate. In addition to allowing the user to specify a data type, TRANSMOGRIFAI can infer it on its own. For example, if it detects that a text feature with a low cardinality (with fewer unique values) is actually a potential category feature, it categorizes it and handles it appropriately. Strongly typed features enable developers to discover most errors at compile time rather than at runtime. They are also key to specific types of downstream processing that are common in automated machine learning workflows.
Transmogrifai feature type hierarchy diagram
Transmogrification (Automation feature Engineering): Although strongly typed features are very helpful for data inference and can minimize downstream errors, all features ultimately need to be converted to a numerical representation that shows the law of the data. So that machine learning algorithms can easily take advantage of these features. This process is called feature engineering. There are countless ways to transform the type of feature shown, and how to find the most appropriate method is the art of data science.
For example, let's ask ourselves how we can convert the name of a state in the United States (for example, California CA, New York NY, Texas TX, etc.) to a number. One possible way is to map the name of each state to the number of intervals in [1, 50]. The problem with this coding, however, is that it does not hold information about the geographic proximity of individual states. However, the proximity of this geographic location can be an important feature when trying to model shopping behavior. Another way we can try to encode is to use the distance between the center of each state and the center of the United States. This approach can somehow solve the problem mentioned above, but it still cannot encode information such as the state's north, south, west, or east in the United States. This is just a simple example of a feature, and imagine how complicated it would be if we faced hundreds of such problems! Feature engineering is extremely challenging because there is no completely appropriate approach to all the factors that need to be considered, and successful methods depend to a large extent on the problem we are trying to optimize.
Dozens of different feature types are automatically converted to numerical vectors, which is the origin of the Transmogrifai name. TRANSMOGRIFAI provides a number of technical support for all of its supported feature types, including phone numbers, e-mail addresses, geographic locations, and even text data. These transformations not only transform the data into a format that the algorithm can use, Transmogrifai also optimizes the conversion results to make machine learning algorithms easier to learn with this data. For example, it translates numeric features such as age into the age group that best suits a particular problem, as the age group for the fashion industry and the wealth management industry may be different.
But even if we take the above measures, feature engineering is still an endless game. So, in addition to providing these default technologies, we have done a lot of work to make it easier to quickly contribute code and share feature engineering so that developers can customize and extend the default technology in a reusable way.
Automated feature validation: Feature engineering can cause explosion of data dimensions. and the processing of high-dimensional data is often faced with many problems! For example, the use of specific fields in the data may change over time, and the models that are trained on these fields may not perform well on the new data. Another big (and often overlooked) problem is the post-visual bias (hindsight bias) or the data leakage (which does not explain the causal relationship correctly). This occurs when information that is not actually present in the forecast is "leaked" to the training sample. The result is that the model looks very good in the paper, but it's actually useless. Imagine a data set with multiple transaction information, whose task is to predict the possible completion of a transaction. Suppose there is a field in this dataset that is "Completed volume" (Closed Deal Amount), which can only be filled after a transaction has been completed. Blindly applying machine learning algorithms will consider this field to be highly predictive, since all completed transactions will have a nonzero "completed volume". In practice, however, this field is not populated by an exchange that is still in progress, and the machine learning model has a poor performance on these transactions, and predictions are actually important for these transactions! This rear-view bias is especially problematic in Salesforce, where unknown automated business processes are often flooded with data from a variety of users, making it easy for data scientists to confuse causality.
Transmogrifai has some algorithms that perform automated feature validation, which will be used to remove features that have little predictive power: the use of these features changes over time, the variance shown by the feature is zero, or the distribution in the training sample is significantly different from the distribution of its predictions. The algorithms used by TRANSMOGRIFAI are useful in processing high-dimensional data and unknown data, which could otherwise be disturbed by the posterior bias. These algorithms apply a series of statistical tests based on feature types and use the feature spectrum (feature lineage) to detect and eliminate such deviations.
Automated model selection: The last step for data scientists to process is to apply machine learning algorithms to prepared data to build predictive models. Data scientists can experiment with many different algorithms, and can fine-tune each algorithm to varying degrees. Find the right algorithm and parameter settings to achieve a good performance model.
Transmogrifai's model selector runs different machine learning algorithms on the data, using the average validation error to automatically select the best algorithm. It also automates the problem of data imbalance by sampling the data appropriately and calibrating the predicted results to match the true prior. The performance differences between the best and worst models that data scientists have trained on data are often large, and exploring possible model spaces is important to avoid leaving too large a model defect.
Hyper-Parametric optimization: The super-parametric optimization layer is the basis for all the above steps. In today's machine learning community, hyper-parameters are specifically the parameters that can be adjusted in machine learning algorithms. However, the reality is that the parameters for effective adjustment of all the above steps are different. For example, in feature engineering, a data scientist might adjust the number of two-value variables derived from a categorical predictor. The sample rate for dealing with unbalanced data is another place to adjust. Tuning all of these parameters is a very difficult thing to do, but it does produce a great model that differs greatly in performance from randomly generated models. This is why TRANSMOGRIFAI offers some automatic hyper-parameter tuning techniques and a framework for extending more advanced tuning technologies.
In Salesforce, such automated techniques have reduced the total time spent on training models from weeks or months to hours. And the code that encapsulates these complex processes is simple. Only the following lines of code are required to specify the Automation feature engineering, feature validation, and model selection work above:
Read the Deal dataval dealData= DataReaders. Simple. csvcase[Deal](Path= Pathtodata).Readdataset().Todf()Extract Response and Predictor FeaturesVal(isClosed, predictors)= Featurebuilder. fromdataframe[Realnn](DealData, response="IsClosed")Automated feature Engineeringval Featurevector= Predictors.Transmogrify()Automated feature Validationval Cleanfeatures= isClosed.Sanitycheck(Featurevector, Removebadfeatures=True)Automated Model selectionVal(pred, Raw, prob)=Binaryclassificationmodelselector().SetInput(isClosed, Cleanfeatures)getoutput (//Setting up the workflow and training the Modelval model = new opworkflow (" . Setinputdatasetsetresultfeatures (pred . Train (
Use Transmogrifai to predict the likelihood of a transaction being completed
Design options
Transmogrifai is designed to improve the productivity of machine learning developers, not only through machine learning automation, but also through APIs that enhance the type safety, modularity, and reusability of compile-time implementations. Here are some of the notable design choices we've made.
Apache Spark: For many reasons, we chose to build Transmogrifai on the Apache spark framework. First, we need to deal with huge changes in data size. Some of our users and use cases need to train the model based on the records that thousands need to aggregate or join, while others rely on only thousands of records. Spark has a basic approach to dealing with distributed connections and aggregations for big data, which is important to us. Second, we need a service that can provide our machine learning model in both batch and stream processing modes. When using the Spark stream, we can easily extend the Transmogrifai into both modes. Finally, by building Transmogrifai on an active open-source repository, we can take advantage of the continuous improvements that the open source community has made to the library without having to reinvent a wheel.
Feature abstraction: The SPARKML workflow introduces the abstraction of the Transformer and estimator of the transformation DataFrame. Transmogrifai is built on these abstractions (the above transmogrification, feature validation and model selection, which are all supported by estimator). In addition, Transmogrifai introduces feature abstraction. A feature is essentially a type-safe pointer to a column in a data frame (DataFrame, one data structure) and contains all the information about the column, its name, the type of data it contains, and how it produces the pedigree information.
Then, the feature becomes the main primitive (primitive) that the developer interacts with, and the definition and manipulation characteristics are more like working with variables in a programming language rather than working with columns in a data frame (DataFrame). Features can also be shared, allowing for collaboration and reuse among developers. In addition, Transmogrifai is also able to easily define the characteristics of complex time series aggregation and connection results, but it may be necessary to write a blog post specifically to discuss this topic.
Type safety: characteristics are strongly typed. This allows Transmogrifai to type-check the entire machine learning workflow and ensure that errors are identified early, rather than spending hours in a running process to find errors. Type safety also brings other benefits that help improve developer productivity, including supporting an intelligent integrated development environment (IDE) to make recommendations for completing the code. In, you can see all the possible conversion actions you can perform for a numeric feature, and select one of them.
IDE code completion for converting numeric features
Type safety also increases the transparency of expected inputs and outputs at each stage of the machine learning workflow. This greatly reduces the amount of knowledge that is inevitably accumulated in any particularly complex machine learning workflow.
Finally, feature types are critical for specific types of downstream processing work, especially for Automated feature engineering and feature validation.
Customizable and scalable: Although developers can use automatic estimators to quickly tune model performance, each out-of-the-box estimator is parameterized for users who want more control over the model, which can be set and tuned directly by the data scientist. In addition, it is easy to specify custom converters and estimators. Specifying a custom converter is as simple as defining a lambda expression, and Transmogrifai is responsible for maintaining all boilerplate files for your serialization and deserialization operations.
= Textfeature. Map[Text](_. Value. Map(_. toLowerCase). Totext)
Scale and performance: By Automating feature engineering, data scientists can easily extend feature space and eventually get large-scale data frames that make Spark difficult to process. The Transmogrifai workflow solves this problem by inferring the directed acyclic graph (DAG) that is required to implement these characteristics, and optimizes the execution of the DAG by compressing all transformations made at the same level of the DAG into the same operation. At the same time, because Transmogrifai is built on the spark framework, it automatically benefits from the ongoing optimization of the underlying spark data frame.
As a result, we can apply automated machine learning techniques to millions of rows, hundreds of columns of data, and expand the feature space in the process to tens of thousands of of columns.
Convenient for everyone to use Transmogrifai
Tansmogrifai has revolutionized us by enabling our data scientists to deploy thousands of models with minimal manual tuning in production, reducing the average time to train an excellent performance model from weeks to hours. This level of automation is important for us to provide machine learning services for our businesses, and we believe that every enterprise now has more machine learning use cases than the data scientists they own, and automation is the key to making machine learning technology popular.
Salesforce is a longtime user and contributor to the Apache Spark project, and we are delighted to continue building Transmogrifai in the open source community. Machine learning has the potential to transform the way businesses operate, and we believe that only open communication between ideas and code can reduce the threshold for using it. By working in an open source environment, we can bring different perspectives together and continue to drive technology forward so that everyone can use it.
For more information on getting Started with Transmogrifai, check out the project link: Https://github.com/salesforce/TransmogrifAI.
Original link: https://engineering.salesforce.com/open-sourcing-transmogrifai-4e5d0e098da2
Salesforce Open Source Transmogrifai: End-to-end AUTOML library for structured data