First of all, let me start with my intentions . Machine learning system now much more red NB this thing I don't have to repeat. But because of the particularity of machine learning system, it is not easy to build a reliable and useful system. Every time I see the wonderful sharing of my peers, I will think of how these intricate systems are built up. What is the build process? Is there some pit behind this? Have some experience? Is it possible to "steal" it for reference?
So I want to do a more focused "process-oriented" sharing, share with you some of the practices we have in building the system, some pits, and how to get out of the pits.
In addition, I am sharing a more focused on the "small team", one is because when the current team to do ML is still relatively small, followed by because I understand that not every enterprise is like bat a large lineup of neat. Therefore, the experience and practice of the small team may have a unique reference, I hope this sharing can never be the same angle to provide some reference.
Today's shared practice comes from being a small team of ML when recommending groups.
Our team is responsible for building, tuning, maintaining and improving the machine learning system from 0 onwards as recommended/advertised. In addition to the computing platform for the maintenance of other teams, every aspect of the ML pipeline is responsible. The production model is used to sort some of the recommended modules and some AD modules.
Before sharing begins, it is important to clarify the location of this share. As shown, this share does not involve these content, the need for students can refer to CSDN on other wonderful sharing.
These are the shares involved, "process-oriented" is the focus of this time. The people you share are positioned as follows:
No matter what stage of the machine learning system you are building, if you can gain from this share, or some inspiration, the author with "After long vacation syndrome" Do this PPT is not white busy ...
This is the outline we shared this time: a brief talk about my "small team" some of the understanding, and then spend the main time to share with you when the small team machine learning practice. Then I'll summarize some of the pits we've guessed in practice, and what we've learned from these pits. Then I will take some references as an example to do some prospects for the future work and possible directions. Finally, the question and answer session.
brief discussion on small team
First of all, I know about the small team.
Why are there small teams? The question at first glance is a bit of crap, because every team has grown up from small to large. This is true, but the machine learning team has some of its own characteristics.
- In contrast to some functional systems, one of the characteristics of machine learning systems is uncertainty , which is the effect that this system can put together, which cannot be quantified from the beginning. This has led policymakers to be more cautious in their input and not to put too many in the first one.
- This talent is indeed relatively scarce , recruitment difficult. The resume looks pretty much, has the actual ability or the experience actually very few. The principle of Ningquewulan, small and refined team is a better choice.
What is the challenge of a small team to do the system? This is the first question we care about. the essence of a small team challenge is actually two words: fewer people. There are a number of specific challenges that can be derived from this fundamental limitation.
- The first is the high demand for individual capacity . It is easy to understand that fewer people means that everyone needs to play a big role, so the ability to demand a relatively high. In fact, there are not many good solutions to this problem, mainly external recruitment and internal training.
- Second, in the system development process, we generally need to cross the responsibility of multiple tasks , which is not only a challenge to individual ability, but also the challenge of collaboration ability. But on the other hand, this is actually the best training for employees, can let everyone grow at the fastest speed.
- Again is the question of direction and demand choice . Because there are fewer people, you need to be very cautious when deciding what to do next, and minimize the amount of unproductive input. This is sometimes a limitation, but on the other hand, this "pushes" us to focus on the most important part, and steel is used on the edge.
- Finally, the single point of risk is higher . Because everyone is responsible for more parts, so everyone's leave, leave and other changes, will have a big impact on the system. The problem is also mainly solved by internal training and external recruiting, but there is one way to keep people with challenging things. Which method works, depends on the specific circumstances of the specific situation.
In this way, small team challenges are not small, but in turn, small teams have some unique advantages.
- The first is that the team is easy to condense . This is also the natural advantage of any small team.
- The second is ease of collaboration . Many things do not need a meeting, turn around a few words can be done.
- Again is the advantage of iteration speed . As the process involves a small number of people who are responsible, do not need to coordinate too much resources, so long as these few people hard, the iteration speed will be faster.
- The last and most important point is the growth of the team . Because everyone is responsible for a lot of things, then the pace of growth will naturally quickly, at the same time the personal sense of accomplishment will be relatively high, if properly deployed, will let the entire team in a very dynamic positive state.
when recommending machine learning practices
Let's take some time to share how the machine learning team is moving the machine learning system into a big stone.
Shown is the overall architecture We recommend in the background. As can be seen from the schematic diagram above, the machine learning system exists as a subsystem and interacts directly with the recommended job platform (the offline job platform that generates the recommended results).
The composition of these several frames just let everyone know the machine learning system in the overall recommendation system location and role, not the focus of this share, there is no need to understand.
The schema for this page is magnified by the detail in the Red box section of the previous page. It can be seen that machine learning systems play a role in the sequencing of results. The details of this architecture I do not expand here, interested students can refer to the group of students before the sharing of time, is a more similar structure.
The above schema diagram is a further expansion of the red-framed part of the previous page schema, which is an architectural sketch of the machine learning system itself. Experienced students can see that this diagram includes the main process components of the machine learning system.
We'll talk about how the system is set up and what the process is going through. The initial stage of the system is an exploratory phase . The point of this phase is to figure out whether your problem is a problem that is suitable for use with ML technology.
Machine learning is great, but not omnipotent, especially in areas that require a strong artificial priori, and may not be the most suitable solution, especially if it is not suitable as a system startup. At this stage, the tools we use are r and Python.
In the figure on the right side of the page, the red-framed part can be solved with R, and the blue-framed part is more suitable with Python, and the green-framed part is both needed.
Why choose R and Python?
First say R.
- Because of the versatility of R, it is the Swiss Army Knife of the data science community.
- Because R has been popular for many years, it is a mature tool and it is easy to find a solution when encountering problems.
- At that time (2013) Sklearn and the like are not well enough to use, and there are problems are not easy to find a solution.
Say python.
- Python's development efficiency is high, suitable for rapid development, iterative, of course, pay attention to engineering quality.
- Python has a strong text processing capability and is suitable for dealing with text-related features.
- Python and computing platforms such as Hadoop and Spark have a strong ability to be scalable when data volumes expand.
However, parts of R can now be replaced with Python, as the toolkit, represented by Sklearn,pandas,theano, is more mature.
But when it comes to the early stages of exploration, when the system of large data volumes, r is no longer appropriate . The main reason is two: the amount of data that can be processed is small and processing speed is slow .
- The first is that pure R only supports stand-alone, and the data must all be loaded into memory, which obviously is a significant barrier to big data processing, but some of the new technologies now may alleviate the problem, but we have not tried.
- The second is the relatively slow calculation, which of course refers to the speed under the large data volume.
So, as on the left side of the architecture diagram, once the big data volume phase is reached, the tools that are represented by Hadoop and Spark will be on the stage and become the main tool to use .
After the initial exploration and validation phase, we are going to step into the engineering iteration .
is a typical process we have developed.
After the verification is passed, we go to the next important link, I call it " whole process construction ", referring to the ML system that will be built, and the subsequent users, all built up to form a complete development environment.
What needs to be stressed here is "integrity", that is, not just to set up the ML model related samples, features, training and other links, the use of the model after the link, such as sorting display, but also to build up together. This will be mentioned again later.
If you are building your system for the first time, then "full process build" will take longer to complete. But this step is the cornerstone of all the work behind it, and the time and effort involved are worthwhile.
Once this step is complete, a system is actually built, of course, a system that has no God, because each part may be completely non-optimized, and some of it may be that only the shell has no content.
After the optimization iteration of this "infernal Affairs", this part of the job is to constantly find the points can be optimized, and then try various solutions, do offline verification, how to feel up to the line of standards, do online AB. After the system process has been built up, the basic thing behind this iteration is that it is constantly in the loop. (the original meaning of Infernal Affairs is the 18th layer of the 18-story underworld, meaning the infinite samsara of suffering.) )
In fact, this development process, particularly like the process of building a house, first to hit the foundation, then build a blank room, after that is the continuous renovation, a variety of inspection, until can stay. Live in a period of time may feel where and dissatisfaction, or there is a new, more beautiful decoration method, it may be renovated again. So repeated. Until one day you are rich, to change the house, that is the system of the overall restructuring, upgrade time.
The tools we use on this page are some of the most popular tools available on the market, in addition to the Dmilch set of tools.
Dmilch (Milch for German Milk): Dangdang machine learning Toolchain is a set of feature engineering-related tools that we summarize in a continuous iteration. It contains some common tools for feature processing, such as regularization of features, normalization, and calculation of common indexes. Similar to the featurefu purpose of open source in the previous period of LinkedIn, it is designed to facilitate feature handling, but with different angles.
This page describes a few of the key points that we have in our workflow. In fact, the small team in this area has a natural advantage, so our central idea is to "runfast."
The first key point is the serialization of the changes . This is perhaps the machine learning this algorithm class system unique features, a number of improvements on the words, sometimes can not distinguish between what is the real role, like a traditional Chinese medicine, do not know what the effect is, and we hope that the real "artemisinin" can be extracted.
The 2nd is the Project promotion mechanism . We have about one to two meetings a week, the main content is to verify the improvement effect, program discussion, and confirm the next action on the spot.
Technical staff actually do not like the meeting, then why do you have to open every week? I think one of the most important aims is to get everyone involved in the discussion, to be responsible for the project and grow together . The work to be undertaken has a division of labor, but there is no division of labor in the discussion, everyone has the idea of the system, there are suggestions. This will also ensure that we absorb each other's unfamiliar areas, more conducive to growth.
Another topic that has to be talked about is the attempt at new technology . If we follow the example of building a house before, the new technology is like a tall furniture and so on, the family did not have a one or two pieces of Zhen Zhai, are embarrassed to greet people.
In this regard, our experience is that we should thoroughly understand the existing technology, use it thoroughly, and say the new technology, not later . For example, the proposed collaborative filtering algorithm, generally in the purchase, browsing, review, collection and other data, different dimensions are to calculate, see which effect is better. It is not too late to try new technology when the value of the familiar technology is "squeezed".
It is also important that other people's technology is not suitable for you. The business scenarios of different companies, data size, data characteristics are not the same, for others to put forward the new technology, should be carefully adopted.
We used to have a lot of confidence in the technology of a major international factory, but repeated attempts have not been good results, but the increase in complexity. Later, and some peers to find that everyone has not been good results. So the foreign moon, there may be only in foreign comparison round. What kind of technology, or depends on the soil in which the system is suitable for the kind of seedlings.
By the end of this section, I'll briefly introduce the effect of our model on the recommended AD : The recommended first screen click-through rate has improved 15%~20%. Ad clicks increased by around 30%, and rpm increased by around 20%. Can see the effect is still very obvious.
those years, the pits we've stepped on
Let's go to the next important part of sharing today, which is the various pits we've stepped on.
"Nevermore, Future", the pit is perhaps the most valuable part of every share. We built the system also stepped on a lot of pits, here and we share a few I think the bigger pit, I hope to be helpful to everyone. I'll introduce a few pits first, and then we'll talk about the feeling and the harvest that we crawled out of the pit.
See the model, not the system.
If we were to put a name on the pit we had stepped on, the pit must be the first place. Because if you fall into this hole, then the basis for directing your system's direction is probably completely wrong.
Specifically, the problem is that when building a system, we start with a basic focus on machine learning models, how AUC, NE, but not how the model ends up on the line. The consequence of this is that we feel that the model is very good from all aspects of the indicator and so on, but an on-line discovery has no effect at all. Because we ignore the model is how to use, has been behind closed doors to create the same "optimization" model, the final effect will not be good.
What is the correct posture? From our experience, in the early stages of system setup, it is clear that you are not building a model, but a model-centric system. It is very important to know what to do when the model comes out and how to use it. Bigger picture
Although the model is the center of the system, it is not the whole system. In the system design, development, tuning of the various stages, from the perspective of the system to see the problem, not only the model in the eyes, no system (product). Otherwise, when you pull up a auc=0.99 model, a head-up discovery has gone further and farther away from the system.
Therefore, the machine learning system should pay attention to the model and system, if only to see the model and can not see the system, it is likely to make the indicator beautiful but ineffective "vase system".
no emphasis on visual analysis tools
This is a problem that can easily be overlooked at first, but will cause you to be very difficult to get to later on (this is a non-deep learning system).
Because the machine learning system is in some way a black box, our energies tend to focus on parameters, models, and instinctively feel that the internal work of the model does not need to be cared for. But our experience is that if you focus only on the outside of the black box and do not care about it at all, then if the model does not work well, it will be difficult to locate the problem. In turn, if the effect is good, it will be a bit confusing, like your bathroom lights suddenly self-lit, or the TV suddenly opened himself, always make people very not practical.
We have a deep feeling on this issue. We first in the system, found that the effect is not good, in fact, there is not too much discipline to help locate the problem. Can only be a variety of features back and forth features, sample processing change the pattern, if the effect is good, good, bad, then toss.
Then we made a Web page that showed each sample, the characteristics of each case, its parameters, the number of occurrences of the sample, the sort in the candidate set, and so on. like the whole system added to the model to do an autopsy, I hope to see as much as possible the internal details of the system, for the analysis of the problem is a great help.
This system has helped us a lot, although not a "methodical" approach, but put a lot of things in front of you, you will find something different from what you think, you will also find something you would not think of. This is especially valuable for machine learning, a system that is a bit like a black box. To this day, this system is one of the things that we rely on every time the effect is verified, and it can be said that we have another pair of eyes.
Overly dependent algorithms
This pit is believed to have been met by many classmates. Let me give you an example. We encountered a text processing problem, to filter out a lot of irrelevant useless text words. At first, a lot of various algorithms, a variety of tuning, but not satisfied with the effect of delay.
Finally we flashed the trick: human flesh filter. Specifically, three people spent three days of pure manual text over the (thousands of tens of thousands of words), the effect is immediate. At that point, perhaps there were better algorithms, but from a systematic, engineering perspective, it was the highest level of human ROI.
So although machine learning is based on algorithm-based systems, but also can not think rigid, all things are only thinking of using algorithms to solve , some places, or millet plus rifles more appropriate.
critical processes and data are not in the hands of your team
This pit, it can be said is not an easy to find the pit, especially in the early stages of the system is relatively covert. We also found the problem after eating a few losses.
In many companies, the front-end display, log collection and other work is a dedicated team responsible, and such as recommended ads such as the team is directly used. The benefits of this are obvious, allowing the machine learning team to focus on the job, but the downside is that the data they collect isn't always what we expect.
Let me give you an example. The exposure data we started with was a team of brothers who did it for us, but we found that it was a long time to find out what was wrong with the rest of the data. This problem directly affects the correctness of the samples we get, so the impact on us is very great.
What is the cause of the problem? It is not that the brothers are not serious, but they do not fully understand our demand for data, they do not use the data, so the quality of the data will be risky . After eating this loss, we now take this part of the work to do their own, so that the data is correct or not we can monitor the whole process, out of the problem can also be internal solution, without coordination of various resources.
team Not Enough "full stack"
This pit is a rather complicated pit. In the last pit, I mentioned that we found a problem with the quality of the data and then did this part of the exposure collection. But the location of the problem and its own takeover is not in the data when there is a problem. The reason is simple and brutal: there was no front-end talent in our group.
Because the exposure problem involves a series of actions from the browser to the backend system, the front end is the first part of these actions. But when we were in the component machine learning team, we didn't realize that there would be a front-end thing in it, that people with backstage + models would be enough. So we are less able to face the problem. It was not until a colleague with a rich front-end experience joined our team that we fixed the problem and made our own decision.
The lesson of this problem is: to build a team to be more cautious, from a more systematic perspective , can not say that machine learning only recruit algorithm engineers, this will lead to team-level short board, for some problems buried foreshadowing.
However, some problems may be difficult to predict before they are encountered, so the pit is indeed more complicated.
Mega System
The last pit, of course, is also left to a big hole. This pit I call "Mega System".
What does a giant system mean? Simply put, the whole system is made into a "one" system, rather than a sub-module made up of multiple subsystems. The meaning of making a system is that there is a high coupling between the modules inside the system, strong correlation, samples, characteristics, training, prediction and so on all stick together, can not be separated. What are the consequences of this?
Give examples directly. Our first version of the system, the light on the line has been a week. And after the maintenance is very difficult, it is very difficult to change things. Why this is done, my reflection is: in the study of the theory, I would certainly have to sample, characteristics, training this pipeline as a set of things, this kind of thinking directly reflected in the system is a giant system. Maybe there is no problem when you have only more than 10 characters and hundreds of samples. But when your feature goes up to millions of and the sample goes up to tens of millions of, you need to think about it and your system is not a bit too big to get out of control.
What better way to do that? Our later solution is: big system small do . "Big system small Do" This is not invented by me, is this year after the Spring Festival (or last year) to see the team talking about the structure of the red envelopes to talk about a concept. I think this statement is very well-distilled and I agree with it very much. This means that although your system is very large, very complex, but do the time to do a good module separation, so conducive to development, but also conducive to expansion, maintenance.
Machine learning system is characterized in the beginning you may use the features are very few, so feel a system can be done, but do to do, need to do a variety of features, samples to do a variety of processing, the system will unknowingly become huge, and if you only focus on the model, It's easy to create a giant system that can't be maintained.
Long March just started .
Our team has experienced just this many "pits", a system can be said to be built up, but this is only the first step of Long March. For us, in fact, the machine learning system of the new thing, itself has a different from the traditional software system of many complexities, there are many challenges to solve. I'm here to use two references to brief you on these complexities, and the challenges that confront them. Interested in in-depth understanding of the students can find articles specific to see.
The first article is Google study an article on paper , which is about machine learning technology debt. The topic is also very interesting, can be translated as: "Machine learning: high-interest technical debt credit card."
This article mainly says that the machine learning system is very complex to build, if inexperienced, or not cautious, in many parts of the easy "debt", these debts were not affected at the time, but because the "interest" is very high, it will make you more miserable later.
I read the article according to the article self-organized, technical debt of several specific dimensions. These dimensions and our own practice is also highly consistent, then read the article is full of knee arrows.
The "subsystem boundary Blur" mentioned in the example above is similar to the "mega-system" I said earlier, and the system is not split internally.
Again, for example, the "System-level Spaghtti (System-level pasta)" mentioned in the lower right. Spaghetti code is commonly used to refer to mess code, because machine learning systems are generally built in the exploration, not as the other systems are completely designed to build again, so it is easy to produce pasta code.
It is much easier to develop, upgrade, and maintain a system if you can consider these dimensions before taking the system. It is believed that such experience is also a giant company such as Google fell a lot of holes summed up. It's not easy for us to be a giant.
Next this article is now in the FB of the SGD Daniel Leon Bottou on icml to do on a tutorial. The title is: Two big challenges in machine learning, is a relatively biased system practice of the article, said is the two new challenges of the computer learning.
The 1th is appalling: machine learning destroys software engineering . But to think about it, that's true. The development process of machine learning systems is mostly exploratory and progressive, which is very different from traditional software engineering, which presents challenges for system developers. I think there will be a special "machine learning System Architect" post in the future.
The 2nd point is that the current method of experimental methods has also met the limit . At first glance, this is a scientific experiment. Machine learning system development because it is exploratory, so in the development of often do a variety of experiments, to verify the effects of the overall method framework, but also needs to be carefully designed. Obviously, the present method is not very suitable for Bottou's view.
Small teams pry big data--when it is recommended that the team's machine learning practices