Introduction
Data products have always been an important part of Airbnb services, but we have long recognized that the cost of developing a data product is high. For example, personalized search rankings make it easier for customers to find a favorite home, and smart pricing allows the landlord to set a more competitive price. However, many data scientists and engineers need to spend a lot of time and effort to make these products.
Recently, Airbnb machine learning infrastructure has been improved, making the cost of deploying new machine learning models into production environments much lower. For example, our ML Infra team built a common feature library that allows users to apply more high-quality, filtered, reusable features to their models. Data scientists are also beginning to incorporate some automated machine learning tools into their workflows to speed model selection and improve performance standards. In addition, ML Infra has created a new framework that automatically converts Jupyter notebooks into formats acceptable to the Airflow pipeline.
In this article, I'll show you how these tools work together to speed up modeling, thereby reducing the overall cost of developing LTV models (predicting Airbnb home prices).
What is LTV?
LTV's full name Customer Lifetime Value, which means "customer lifetime value", is a very popular concept in e-commerce and market companies. It defines the benefits that users expect to bring to the company over a period of time in the future, usually in US dollars.
In some e-commerce companies such as Spotify or Netflix, LTV is often used to set product pricing (such as subscription fees, etc.). In a market company like Airbnb, knowing the user's LTV will help us more effectively allocate the budget of the marketing channel, more explicitly make online marketing quotes based on keywords, and do better category segmentation.
We can calculate historical values based on past data, and of course we can further use machine learning to predict the LTV of newly registered houses.
Machine learning workflow for LTV models
Data scientists are often familiar with things related to machine learning tasks, such as feature engineering, prototyping, model selection, and more. However, the need to put a prototype of a model into a production environment requires a series of data engineering techniques that they may not be familiar with.
Fortunately, we have related machine learning tools that separate specific production deployment workflows from the analysis of machine learning models. Without these magical tools, we can't easily apply the model to a production environment. Here are four topics to introduce our workflow and the tools we use:
Feature Engineering: Defining Related Features
Prototype design and training: training a model prototype
Model selection and validation: selecting models and tuning
Production deployment: put the selected model prototype into the production environment
Feature engineering
Use tools: Airbnb internal feature library — Zipline
The first step in any supervised learning project is to find relevant features that affect the outcome. This process is called feature engineering. For example, when predicting LTV, the feature can be the percentage of days that a house can be used in the next 180 days, or it can be the difference from other homes in the same market.
In Airbnb, feature engineering generally involves writing Hive query statements from scratch to create features. But this job is quite boring and takes a lot of time. Because it requires some domain knowledge and business logic, so these feature pipelines are not easy to share or reuse. To make this work more scalable, we developed Zipline, a training feature library. It can provide features at different levels of granularity (such as homeowners, customers, homes, and market levels).
The "multi-source sharing" feature of this internal tool allows data scientists to find a large number of high-quality, reviewed features in past projects. If you don't find the feature you want to extract, you can also write a configuration file to create the features you want:
Source: {
Type: hive
Query:"""
SELECT
Id_listing as listing
, dim_city as city
, dim_country as country
, dim_is_active as is_active
, CONCAT(ds, ' 23:59:59.999') as ts
FROM
Core_data.dim_listings
WHERE
Ds BETWEEN '{{ start_date }}' AND '{{ end_date }}'
"""
Dependencies: [core_data.dim_listings]
Is_snapshot: true
Start_date: 2010-01-01
}
Features: {
City: "City in which the listing is located."
Country: "Country in which the listing is located."
Is_active: "If the listing is active as of the date partition."
}
When building a training set, Zipline will find the features needed for the training set, automatically combining the features and populating the data by key. When constructing a listing LTV model, we used some features that already existed in Zipline and wrote some features ourselves. The model uses a total of more than 150 features, including:
Location: country, market, community and other geographic features
Price: overnight fee, cleaning fee, price difference with similar listings
Availability: The total number of nights that can be spent overnight, and the percentage of homeowners who manually turn off night bookings
Availability: The number of reservations and the number of reservations made during the night in the past X days
Quality: evaluation score, number of evaluations, amenities
Instance data set
After defining the features and output variables, we can train the model based on our historical data.
Prototype design and training
Using Tools: Python Machine Learning Library — scikit-learn
Taking the previous training set as an example, we need to pre-process the data before doing the training:
Data imputation: We need to check if there is missing data and if it is a random missing. If it is not random, we need to figure out the root cause; if it is a random missing, we need to fill in the gap data.
Encoding the classification: Usually we can't use the original classification directly in the model, because the model can't fit the string. When the number of classifications is small, we can consider encoding with one-hot encoding. If the number of classifications is large, we will consider using ordinal encoding to encode according to the frequency count of the classification.
In this step, we don't know what the most effective set of features is, so it's important to write code that can be iterated quickly. Pipeline structures such as Scikit-Learn, Spark, and other open source tools are very handy tools for prototyping. Pipeline allows data scientists to design blueprints, specifying how to transform features and which model to train. More specifically, let's look at the pipeline of our LTV model below:
Transforms = []
Transforms.append(
('select_binary', ColumnSelector(features=binary))
)
Transforms.append(
('numeric', ExtendedPipeline([
('select', ColumnSelector(features=numeric)),
('impute', Imputer(missing_values='NaN', strategy='mean', axis=0)),
]))
)
For field in categorical:
Transforms.append(
(field, ExtendedPipeline([
('select', ColumnSelector(features=[field])),
('encode', OrdinalEncoder(min_support=10))
])
)
)
Features = FeatureUnion(transforms)
In high-level design, we use pipeline to specify how data is transformed in different features based on feature types (such as binary features, classification features, numerical features, etc.). Finally, use FeatureUnion to simply combine the feature columns to form the final training set.
The advantage of using the pipeline to develop prototypes is that it can use data transforms to avoid cumbersome data conversion. In general, these conversions are designed to ensure that data is consistent during training and evaluation to avoid data inconsistencies when deploying prototypes into production environments.
In addition, the pipeline can separate the data conversion process from the training model process. Although not in the above code, data scientists can specify an estimator to train the model in the final step. By experimenting with different estimators, data scientists can select a best performing estimator for the model and reduce the sample error of the model.
Model selection and verification
Using tools: various automated machine learning frameworks
As mentioned in the previous section, we need to determine which of the candidate models is best suited for production. In order to make this decision, we need to weigh the interpretability and complexity of the model. For example, a sparse linear model is well explained, but its complexity is too low to function well. A sufficiently complex tree model can fit a variety of nonlinear patterns, but it is poorly interpreted. This situation is also known as the trade-off between bias (Bias) and variance (Variance).
The above picture is quoted from James, Witten, Hastie, Tibshirani, R Language Statistics Learning.
In applications such as insurance and credit review, the model needs to be explained. Because it is important for the model to avoid unintentionally excluding some of the right customers. However, in applications such as image classification, the high performance of the model is more important than interpretable.
Since the choice of model is quite time consuming, we chose to use a variety of automated machine learning tools to speed up this step. By exploring a large number of models, we will eventually find the best performing model. For example, we found that XGBoost (XGBoost) performed significantly better than other benchmark models (such as the mean response model, the ridge regression model, and the single decision tree).
Above: We can choose a better performing model by comparing RMSE
Since our initial goal was to predict the price of the property, we were comfortable using the XGBoost model in the final production environment, which was more focused on the flexibility of the model than interpretability.
Production deployment
Using Tools: Airbnb's own notebook conversion framework — ML Automator
As mentioned at the outset, building a production environment workflow is completely different from building a prototype on a laptop. For example, how do we conduct regular heavy training? How can we effectively evaluate a large number of instances? How do we build a pipeline to monitor model performance at any time?
At Airbnb, we developed a framework called ML Automator that automatically converts Jupyter notebooks into Airflow machine learning pipelines. The framework is designed for data scientists who are familiar with using Python to develop prototypes but lack experience in putting models into production environments.
Overview of the ML Automator framework (photo source: Aaron Keys)
First, the framework asks the user to specify the configuration of the model in the notebook. This configuration will tell the framework how to locate the training data table, how much computing resources are allocated for training, and how to calculate model evaluation scores.
In addition, data scientists need to write specific fit and transform functions themselves. The fit function specifies how to train, and the transform function is encapsulated by the Python UDF for distributed computing (if needed).
The following code snippet shows the fit and transform functions in our LTV model. The fit function tells the framework that it needs to train the XGBoost model, and the converter will convert the data according to the pipeline we defined earlier.
Def fit(X_train, y_train):
Import multiprocessing
From ml_helpers.sklearn_extensions import DenseMatrixConverter
From ml_helpers.data import split_records
From xgboost import XGBRegressor
Global model
Model = {}
N_subset = N_EXAMPLES
X_subset = {k: v[:n_subset] for k, v in X_train.iteritems()}
Model['transformations'] = ExtendedPipeline([
('features', features),
('densify', DenseMatrixConverter()),
]).fit(X_subset)
#Parallel use converter
Xt = model['transformations'].transform_parallel(X_train)
# Parallel model fitting
Model['regressor'] = XGBRegressor().fit(Xt, y_train)
Def transform(X):
#return dictionary
Global model
Xt = model['transformations'].transform(X)
Return {'score': model['regressor'].predict(Xt)}
Once the notebook is complete, ML Automator will wrap the trained model in the Python UDF and create an Airflow pipeline as shown below. Data engineering tasks such as data serialization, periodic retraining, and distributed evaluation will be loaded into daily batch operations. Therefore, this framework significantly reduces the cost of data scientists putting models into production, just as there is a data engineer working with scientists!
The graphical interface of our LTV model in Airflow DAG, running in a production environment
Note: In addition to model production, there are other projects (such as tracking model performance over time, modeling using elastic computing environments, etc.) we have not covered in this article. These are all hot areas of ongoing development.
Experience and outlook
In the past few months, our data scientists have worked closely with ML Infra to produce many good models and ideas. We believe these tools will open up new paradigms for Airbnb's development of machine learning models.
First, the development cost of the model is significantly reduced: by combining the advantages of various independent tools (Zipline for feature engineering, Pipeline for model prototyping, AutoML for model selection and validation, and finally the ML Automator for Model production), we have greatly shortened the development cycle of the model.
Second, the design of the notebook reduces the barrier to entry: data scientists who are not familiar with the framework can immediately get a lot of real-world use cases. In a production environment, you can ensure that the notebook is correct, self-explanatory, and up-to-date. This design pattern has been well received by new users.
Therefore, the team will be more willing to pay attention to the idea of machine learning products: At the time of this writing, we have several other teams exploring the idea of machine learning products in a similar way: to check the list of listings, to predict whether the listing will Increase partners, automatically mark low-quality listings, and more.
We are extremely excited about the future of this framework and the new paradigm it brings. By narrowing the gap between prototypes and production environments, we can enable data scientists and data engineers to pursue end-to-end machine learning projects and make our products better.