Machine learning Workflow First step: How do you prepare data in Python?

Source: Internet
Author: User

This article is a series of tutorials in the first part of the tutorial on using the machine learning capability workflow from scratch in Python, covering algorithmic programming and other related tools from the start of the group. Will eventually become a set of hand-crafted machine language work packages. This time the content will begin with data preparation first.

--from Matthew Mayo, Kdnuggets

It seems that everyone's understanding of machine learning ability is simple enough to send a series of arguments to more and more database and application interfaces, and then expect some magical results to emerge. Maybe you have a good understanding of what's going on in these databases-from data preparation to modeling to presentation of results, and so on, but you still have to rely on these myriad tools to do your job.

  

This is actually quite normal. It is understandable that we should be able to perform some routine tasks with the tools that have been accurately tested and proven to work. It's not the best way to reinvent the wheel that doesn't roll effectively. This will be a lot of limitations, but also waste a lot of unnecessary time. Whether you are using open source or an authorized tool to do your job, these code tools have been repeatedly tried and tested by many people to ensure that you can do your job with the best quality when you get started.

However, some hard work dirty you do it yourself is also valuable, even as an educational endeavor. I'm not recommending that you start out with your own deep learning practice from scratch. Write a program framework, at least not always, but even if only once through constant experimentation and failure, it is also very good to write from scratch and your own calculation of the support tools to implement them. I may not be right, but I think most of the people who are now learning about machine learning, data science and artificial intelligence are not doing that.

So let's start from scratch and learn to build some knowledge of machine learning capabilities in Python.

What exactly does "from Scratch" mean?

First of all, I declare: When I mention "from Scratch", I mean as little as possible with the help of the outside world. Of course this is also relative, but in order to achieve our goal, I will delimit the boundary, when we write our own matrix model, data frame or build our own database, we will use Python in the NumPy, Panda and Matplotlib library. In some cases, we won't even use the full functionality of these libraries. We'll talk about it later, so let's put their names in the first place for a better understanding. The features that come with your own library in Python are in principle available, but beyond that, we're going to write it ourselves.

We need to start with one point, so let's begin with some simple data preparation tasks. We will be slower at first, but when we have a little bit of a sense of what we are learning, we will gradually speed up. In addition to data preparation, we need data transformations, results presentation and rendering tools-not to mention machine learning algorithms-to achieve the goals we're about to accomplish.

Our idea is to manually stitch up any major features we need in order to complete our machine learning capabilities task. When the sequence is expanded, we can add new tools and algorithms, and we can rethink our previous assumptions (correct) so that the entire process repeats as much as possible, as if it were to get closer. Slowly, we will focus on our goals, develop strategies to accomplish our goals, apply them to Python, and verify that they work.

The end result, as we now expect, would be a series of simple Python models arranged in our own simple machine learning database. For beginners, I believe this is a valuable experience in understanding how machine learning processes, workflows, and algorithms work.

What exactly does a workflow (workflow) mean?

Workflows mean different things to different people, but the workflow we're talking about here refers to a part of the machine learning project. We have a lot of process frameworks to help us track my work, but now let's simplify these:

    • Get Data

    • Process/Prepare Data

    • Building a model

    • Interpreting rendering Results

We can expand when we really do, but this is the simple machine learning process framework We are designing ourselves now. At the same time, the "duct (small arrow)" implies the ability to bring together all the functions in the workflow, so let's keep these in mind and move on.

  

Get the data

Before we build our own models, we need some data, and we need to make sure that the data is consistent with our reasonable expectations. For the purpose of testing (instead of training or testing, but just testing our own devices), we will use an iris dataset that you can download from here. Although we can find many versions of datasets on the web, I recommend that we all use the same raw data to ensure that our preparations are working properly.

Let's take a look:

  

  

Now that we know the simple data set and its corresponding files, let's think about what we need to do to make the original data evolve into the results we want:

    • Data needs to be stored in CSV format files

    • Instances are mostly made up of values with numeric attributes

    • Groups are grouped content http://www.wmyl15.com/

None of the above is applicable to all datasets so far, but none of them can be applied to a single dataset. This allows us to have the opportunity to write code that we can reuse later. Good programming exercises will allow us to focus on reuse and modularity.

Some simple exploratory data analysis is listed below:

  

(For specific values, for image data)

  

Preparing data

Although data preparation is rarely needed in this particular context, it is sometimes needed. In particular, we need to confirm that we have explained the header row, removed any pandas-rendered parameters, and transformed each of our group values from the name type to the numeric type. Because there is no name value when we use the model, at least there is no more complicated conversion.

Ultimately, we also need a better data representation of our own algorithms, so we'll make sure we end up with a matrix-or numpy Nadarry-before we move on. Our data preparation workflow will then make a table:

  

At the same time, we need primarily we have no reason to believe that all interesting data will be stored in comma-separated files. We may want to be able to get data from a SQL database or directly from the Internet, and we can go back and look at the data we found in these two places.

First, let's write a simple function that uploads a CSV file to Dataframe. Of course, this is easy to do in the intranet, but to move forward we might want to add some extra steps to our own dataset so we can upload the function later.

  

This code is quite straightforward. A row of read data files completes some extra pre-processing, such as ignoring the non-data rows (we think the evaluation in the data file started with pound key, although this is ridiculous. We can specify whether the dataset file includes headings, we can accept CSV and TSV files, and CSV files are the default settings.

There are some bug checks, but it's not very sound, so we might be able to come back and talk about it one o'clock in the evening. In addition, one-by-one reading of the document to decide what to do to these lines, than directly with the built-in function to clean the uniform CS a file directly read to the Dataframe slow, but after the tradeoff we found that allow more flexibility, at this stage is worthwhile (but reading large files may take a long time). Do not forget that if part of the built-in operation is not the best method, we can make adjustments later.

Before we try to run our own code, we need to write a function that converts the value of the name class into a numeric value. To promote a function, we need to make it possible to use the numeric value of any property in the dataset, not just in different categories. We should also keep track of whether the property name eventually becomes an integer. With the previous steps of uploading a CSV or TS me data file to pandas Dataframe, this function should accept both a pandas Dataframes and a property name that is converted to a number.

We also note that we have avoided the topic of using single-hot coding, which involves classifying non-categorical attributes, but I think we will return to this topic later.

  

The above function is a simple one, but it can help us to accomplish the objective function. We can do this in a number of different ways, including using the built-in features of pandas, but getting you started with some of the drudgery that will make you tired is the meaning of this function.

Now we can load a dataset from the file and then convert the categorical attribute values into numeric attribute values (we can also keep the images in the dictionary for later use). As mentioned earlier, we want our datasets to eventually exist in the form of NumPy Ndarry, so that we can use them very simply in our own algorithms. Again, this is a simple task, but writing a function will allow us to do so in the future when we need it.

  

Even if any of the previous functions are not overly lethal, this feature may be available. But please bear with me, we abide by very comprehensive programming guidelines-if too cautious. There will be a good chance for us to change or add to our existing functionality as we continue to speak. These changes, if implemented and documented in one place, make sense in the long run.

Workflow for testing data preparation

Our workflow so far may still be the form of building plates, but let's give ourselves a test of coding.

  

  

Our code is working the way we want it to, so let's do some simple house cleanup work. Once we start scrolling, we will provide a more comprehensive organizational structure for our coding, but now we need to add all these functions to a separate file and save it as a dataset.py format. This will make our later use more convenient, next time we will learn.

Future plans

After that we will learn the simple classification algorithm, k nearest neighbor algorithm. We'll learn how to build a classification and clustering model in a simple workflow. There is no doubt that this will require a number of out-of-limits tools to help us complete the project, and I am sure we will also revise the parts that have been done.

Practicing machine learning is the best way to understand machine learning. Using the algorithms and support tools needed in our workflow will ultimately prove to be useful.

Machine learning Workflow First step: How do you prepare data in Python?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.