Understanding how genetic algorithms work (with Python implementations)

Source: Internet
Author: User

Selected from Analyticsvidhya

Participation: Yanchi, Huang

Recently, Analyticsvidhya published an article titled "Introduction to Genetic algorithm & their application in the data science", author Shubham Jain In this paper, a comprehensive and brief overview of genetic algorithms is made in the language of easy-to-understand languages, and the practical applications in many fields are listed, and the Data Science application of genetic algorithm is introduced emphatically. The heart of the machine has compiled this article, the original link please see at the end of the text.

Brief introduction

A few days ago, I set out to solve a practical problem--the sale of large supermarkets. After using a few simple models to do some feature engineering, I ranked NO. 219 on the leaderboard.

Although the result is good, but I still want to do better. So, I began to study the optimization method that can improve the score. As a result, I found one, it's called Genetic algorithm. After applying it to the supermarket sales problem, I finally got my score on the leaderboard.

Yes, only by genetic algorithm I jump from 219 directly to 15, bad! After reading this article, you can also apply the genetic algorithm very freely, and you will find that the effect can be greatly improved when you use it for the problem you are dealing with.

Directory

1. The origin of genetic algorithm theory

2. Biological inspiration

3. Definition of genetic algorithm

4. Specific steps of genetic algorithm

    • Initialization

    • Fitness function

    • Choose

    • Cross

    • Variation

5. Application of genetic algorithm

    • Feature Selection

    • Using the Tpot Library to implement

6. Practical Application

7. Conclusion

1. The origin of genetic algorithm theory

Let's start with a quote from Charles Darwin:

It is not the most powerful species that can survive, nor the most intelligent species, but the ones most adaptable to the environment.

You may be wondering: What does this say have to do with genetic algorithms? In fact, the whole concept of genetic algorithm is based on this sentence.

Let's use a basic example to explain:

Let us first assume a scene, and now you are king of the kingdom, and in order to save your country from disaster, you have implemented a bill:

    • You elect all the good people and ask them to expand their populations by giving birth.

    • The process continued for several generations.

    • You'll find that you've got a whole bunch of good people.

This example is not likely, but I use it to help you understand the concept. That is, we can get better output values (e.g., better countries) by changing the input values (e.g., population). Now, I assume that you have a general understanding of this concept, that the meaning of genetic algorithms should be related to biology. So let's take a quick look at some small concepts so we can connect them to understand.

2. Biological inspiration

I believe you remember this phrase: "The cell is the cornerstone of all living things." "It can be concluded that in any cell of a creature, there is the same set of chromosomes." The so-called chromosome refers to a polymer composed of DNA.

Traditionally, these chromosomes can be expressed as a string consisting of numbers 0 and 1.

A chromosome is made up of genes that are essentially the structure of the DNA, and each gene on the DNA encodes a unique trait, such as the color of the hair or the eye. I hope you'll recall the biological concepts mentioned here before you continue reading. To end this part, let's take a look at what the so-called genetic algorithm actually refers to.

3. Definition of genetic algorithm

First, let's go back to the example we discussed earlier and summarize what we've done.

    1. First of all, we set the initial population size of the nation.

    2. Then we define a function that distinguishes between the good and the bad.

    3. Again, we choose the good guys and let them breed their offspring.

    4. Finally, these descendants have replaced some of the bad guys from their original citizens and have been repeating the process.

The genetic algorithm actually works like this, which means it basically tries to simulate the evolutionary process to some extent.

Therefore, in order to formalize a genetic algorithm, we can consider it as an optimization method, it can try to find out some input, with these inputs we can get the best output value or result. The way genetic algorithms work is also derived from biology, as described in the following procedures:

So now let's step through the process.

4. Specific steps of genetic algorithm

To make the explanation easier, let's take a look at the famous combinatorial optimization problem "knapsack problem". If you don't understand, here's a version of my explanation.

For example, you are going to hiking for 1 months, but you can only carry a backpack with a weight limit of 30 kilograms. Now you have different essential items, each of which has its own "survival point" (as given in the table below). Therefore, your goal is to maximize your "survival points" under a limited backpack weight.

4.1 Initialization

Here we use genetic algorithms to solve this knapsack problem. The first step is to define our overall. The whole consists of individuals, each with their own chromosomes.

We know that chromosomes can be expressed as binary numbers, and in this problem, 1 represents the genes in the next position, and 0 means loss. (Translator Note: The author here borrows chromosomes, genes to solve the knapsack problem in the front, so the gene at a particular location represents the items in the top knapsack problem table, such as the first position is sleeping Bag, then this time reflected in the chromosome "gene" position is the first "gene" of the chromosome. )

Now, we look at the 4 chromosomes in the graph as our overall initial values.

4.2 Fitness function

Next, let's calculate the fitness score for the first two chromosomes. For the A1 chromosome [100110], there are:

Similarly, for the A2 chromosome [001110], there are:

For this problem, we think that when the chromosome contains more survival scores, it also means that it is more adaptable.

Therefore, it is shown that chromosome 1 is more adaptable than chromosome 2.

4.3 selection

Now, we can begin to choose the right chromosome from the whole, so that they "mate" with each other and produce their own offspring. This is the general idea of a selection operation, but this will lead to a decrease in the number of chromosomes and a loss of diversity over several generations. Therefore, we will generally conduct "roulette choice" (Roulette Wheel Selection method).

Imagine having a roulette wheel, now we divide it into m parts, where m represents the number of chromosomes in our population as a whole. The area that each chromosome occupies on the roulette wheel is expressed in proportion to the degree of fitness score.

Based on the values in, we create the following "roulette".

Now, the roulette wheel starts to spin, and we will be selected as the first parent by the area that the fixed pointer (fixed point) refers to. Then, for the second parent, we do the same thing. Sometimes we also label two fixed pointers on the way, such as:

In this way, we can get two parents in one round. We use this approach as a "random Universal choice" (Stochastic Universal Selection method).

4.4 Crossover

In the previous step, we have chosen a parent chromosome that can produce offspring. So in the words of biology, the so-called "crossover", in fact, refers to the reproduction. Now let's "cross" the Chromosomes 1 and 4 (selected in the previous step), see:

This is the most basic form of crossover, which we call "single-point crossover". Here we randomly select a crossover point, and then, the chromosome part of the intersection before and after the chromosome cross-swapping, so that a new generation of offspring.

If you set two intersections, then this method becomes "multipoint crossing", see:

4.5 variants

If we look at this from a biological point of view, may I ask: Does the offspring produced by the above process have the same traits as their parents? The answer is. During the growth of offspring, the genes in their bodies change, making them different from their parents. This process, which we call "mutation", can be defined as a random change occurring on a chromosome, and it is precisely because of mutation that there is diversity in the population.

A simple example of a mutation:

After the mutation is complete, we get the new individual, and the evolution is done, the whole process is as follows:

After a round of "genetic variation", we use fitness functions to validate these new offspring, and if the function determines that they are sufficiently adaptable, they will use them to replace those chromosomes that are not sufficiently fit. Here's a question, what criteria should we use to determine the best level of fitness for future generations?

In general, there are several termination conditions:

    1. After X iterations, there is nothing much change in the overall world.

    2. We have defined the number of evolutions in advance for the algorithm.

    3. When our fitness function has reached a pre-defined value.

Well, now that I'm assuming you have a basic understanding of the essentials of genetic algorithms, let's use it in the data science scenario.

5. Application of genetic algorithm

5.1 Feature Selection

Imagine what you would do to pick a feature that is important to your prediction of your target variables whenever you participate in a data science competition. You often judge the importance of a feature in a model, and then manually set a threshold to select a feature whose importance is higher than the threshold value.

So, is there any way to better deal with this problem? In fact, one of the most advanced algorithms to deal with feature selection tasks is genetic algorithm.

The way we handle knapsack problems in front of us can be fully applied here. Now, we start with the establishment of the "chromosome" as a whole, where the chromosome is still a binary string, "1" that the model contains the feature, and "0 that the model excludes the feature".

One difference, however, is that our fitness function needs to be changed. The fitness function here should be the standard for the accuracy of this competition. That is, if the predicted value of the chromosome is more accurate, then it can be said that its adaptability is higher.

Now I'm assuming that you have a little idea about this method. I'm not going to go into the process of solving this problem right away, but let's start by using the Tpot library to implement it.

5.2 Using Tpot Library to achieve

This part is believed to be the goal that you eventually want to achieve when you first read this article. That is: implementation. So first, let's take a quick look at the Tpot library (tree-based Pipeline optimisation technique, Tree Transfer optimization technology), which is based on the Scikit-learn library. As a basic transitive structure.

The gray area in the figure is automatically processed with the Tpot library. The genetic algorithm is needed to realize the automatic processing of this part.

Instead of explaining it in depth, we apply it directly. In order to be able to use the Tpot library, you need to first install some Python libraries on which Tpot is built. Below we quickly install them:

# Installing DEAP, Update_checker and TQDM

Pip Install Deap Update_checker TQDM

# installling Tpot

Pip Install Tpot

Here, I used the Big Mart Sales (DataSet Address: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/) DataSet , to prepare for the implementation, we first download the training and test files quickly, the following is the Python code:

# import Basic Libraries

Import NumPy as NP

Import Pandas as PD

Import Matplotlib.pyplot as Plt

%matplotlib Inline

From Sklearn Import preprocessing

From sklearn.metrics import Mean_squared_error # # preprocessing

# # Mean Imputations

train[' Item_weight '].fillna ((train[' item_weight '].mean ()), inplace= True)

test[' Item_weight '].fillna ((test[' item_weight '].mean ()), inplace= True)

# # # Reducing fat content to only categories

train[' item_fat_content ' = train[' item_fat_content '].replace ([' Low fat ', ' LF '], [' low fat ', ' low fat '])

train[' item_fat_content ' = train[' item_fat_content '].replace ([' Reg '], [' Regular '])

test[' item_fat_content ' = test[' item_fat_content '].replace ([' Low fat ', ' LF '], [' low fat ', ' low fat '])

test[' item_fat_content ' = test[' item_fat_content '].replace ([' Reg '], [' Regular '])

train[' outlet_establishment_year ') = 2013-train[' Outlet_establishment_year ']

test[' outlet_establishment_year ') = 2013-test[' Outlet_establishment_year ']

train[' outlet_size '].fillna (' Small ', inplace= True)

test[' outlet_size '].fillna (' Small ', inplace= True)

train[' item_visibility ' = np.sqrt (train[' item_visibility '])

test[' item_visibility ' = np.sqrt (test[' item_visibility ']) col = [' outlet_size ', ' outlet_location_type ', ' Outlet_Type ', ' item_fat_content ']

test[' item_outlet_sales '] = 0

Combi = train.append (test)

For I in Col:

Combi[i] = Number.fit_transform (combi[i].astype (' str '))

Combi[i] = Combi[i].astype (' object ')

Train = combi[:train.shape[0]]

Test = combi[train.shape[0]:]

Test.drop (' Item_outlet_sales ', axis= 1,inplace= True)

# # Removing ID variables

Tpot_train = Train.drop ([' Outlet_identifier ', ' item_type ', ' Item_identifier '],axis= 1)

Tpot_test = Test.drop ([' Outlet_identifier ', ' item_type ', ' Item_identifier '],axis= 1)

target = tpot_train[' item_outlet_sales ']

Tpot_train.drop (' Item_outlet_sales ', axis= 1,inplace= True)

# finally building model using Tpot Library

From Tpot import Tpotregressor

X_train, X_test, y_train, y_test = Train_test_split (Tpot_train, Target, train_size= 0.75, test_size= 0.25)

Tpot = Tpotregressor (generations= 5, population_size=, verbosity= 2)

Tpot.fit (X_train, Y_train)

Print (Tpot.score (x_test, Y_test))

Tpot.export (' tpot_boston_pipeline.py ')

Once the code runs, tpot_exported_pipeline.py will put the Python code for path optimization. We can see that extratreeregressor can best solve this problem.

# # Predicting using Tpot optimised pipeline

tpot_pred = Tpot.predict (tpot_test)

Sub1 = PD. DataFrame (data=tpot_pred)

#sub1. Index = np.arange (0, Len (test) +1)

Sub1 = sub1.rename (columns = {' 0 ': ' item_outlet_sales '})

sub1[' item_identifier ') = test[' Item_identifier ']

sub1[' outlet_identifier ') = test[' Outlet_identifier ']

Sub1.columns = [' Item_outlet_sales ', ' item_identifier ', ' outlet_identifier ']

Sub1 = sub1[[' item_identifier ', ' outlet_identifier ', ' item_outlet_sales ']

Sub1.to_csv (' Tpot.csv ', index= False)

If you submit this CSV, then you will find that the ones I promised at the outset are not fully realized. Is that what I'm lying to you about? Of course not. In fact, the Tpot library has a simple rule. If you do not run Tpot for too long, then it will not find the most likely delivery method for your problem.

So, you have to increase the algebra of evolution, take a cup of coffee to go out for a while, the other to Tpot on the line. In addition, you can use this library to handle classification problems. Further information can be found in this document: Http://rhiever.github.io/tpot/. In addition to race, we also have a lot of application scenarios in life to use genetic algorithms.

6. Practical Application

Genetic algorithms have many applications in the real world. I've got some interesting scenes here, but I won't go into detail because of the space constraints.

6.1 Engineering

Engineering design relies heavily on computer modeling and simulation to make the cycle process fast and economical. Genetic algorithms can be optimized here and give a good result.

Related resources:

    • Paper: Engineering design using genetic algorithms

    • Address: HTTP://LIB.DR.IASTATE.EDU/CGI/VIEWCONTENT.CGI?ARTICLE=16942&CONTEXT=RTD

6.2 Transportation and Shipping route (travelling salesman problem, tour salesman question)

This is a very well-known issue and it has been used by many trading companies to make transportation more time-saving and economical. Genetic algorithms are also used to solve this problem.

6.3 Robots

Genetic algorithm is widely used in the field of robotics. In fact, genetic algorithms are being used to create autonomous learning robots that can act like humans, and perform tasks such as cooking, washing clothes and so on.

Related resources:

    • Paper: Genetic algorithms for auto-tuning Mobile Robot Motion Control

    • Address: Https://pdfs.semanticscholar.org/7c8c/faa78795bcba8e72cd56f8b8e3b95c0df20c.pdf

7. Conclusion

Hopefully through this article, you now have a good understanding of the genetic algorithm, and will use the Tpot library to implement it. But if you do not practice, the knowledge of this article is very limited.

So, please, readers must try to achieve it in either a data science contest or a life.

Original link: https://www.analyticsvidhya.com/blog/2017/07/introduction-to-genetic-algorithm/

This article for the heart of the machine compiled, reproduced please contact the public number to be authorized.

Understanding how genetic algorithms work (with Python implementations)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.