Software development skills for data scientists

Source: Internet
Author: User
Tags version control system

Software development skills for data scientists

Data scientists often come from diverse backgrounds and frequently don ' t has much, if any, in the-the-of-the-the-the-formal training In computer or software development. That's being said, most data scientists at some point would find themselves in discussions with software engineers because of Some code that already is or would be touching production code.

This conversation would probably go something like this:

SE: "You didn ' t check your code and your tests to master without a code review, did you?"

DS: "Checked my what to what's without a What?"

As it turns out, there is a number of skills that software developers often take for granted that new data scientists Don ' t possess--and may not even has heard of. I did A quick poll in twitter about What these skills might is. I ' ll walk through the most common responses below, but I ' d say the unifying theme for all of them are that many new data SC ientists  don ' t know how to effectively collaborate.  perhaps They used to is academics and only worked alone or with a single other collaborator. Perhaps they ' re self-taught. Regardless of the reasons, writing code in a environment where many other people (and ' other people ' includes yourself at Some later date) would be looking at, trying to understand, and using your code or things that your code produces.

Used to your code living on your hard drive or perhaps in a shared Dropbox folder. Now your code would be routinely checked to a repo (more on this below) where anyone can take a look at it. This was a little unnerving at first, and it initially causes the want check in ever code. That's generally a bad idea.

Each of the following topics speak to, the idea in some. I ' m not saying the data scientists need to being experts in all of the these fields right away but some level of proficiency in Each of them would be necessary sooner than later. You won ' t find the most of the these topics in "Introduction to the Data science in Python" or "machine learning in R" books--These is the taken-for-granted skills.

Writing Modular, reusable code

Many data scientists is self-taught programmers or learned to program as part of a project. Programming is a tool that one acquired to achieve a certain goal, like estimating a regression or modeling the movement Of stars, or simulating atmospheric conditions. Rather than "programming" being a skill that have its own norms, best practices, and so forth, writing code is about learn ing the right commands to type in the right order to produce the output of that could and then is lovingly arranged in LaTeX.

Often times, projects looked different enough from one another or were sufficiently simple (code-wise) Tarted from scratch each time or just copied and pasted the bits and pieces from old projects you needed. Your code often is very imperative in style, and could is read start-to-finish to get a idea of what the needed to is done. "First load the data, then does this and then does that and then print the results. The end. "

Those days is over.

You should learn a principle called DRY and which stands for Don ' t Repeat yourself. The basic idea was that many tasks can being abstracted into a function or piece of code this can be reused regardless of the Specific task. This was more efficient from a "lines of code" perspective, but also in terms of your time. It can be taken to an illogical extreme, where code becomes very difficult to follow, but there are a happy medium to Striv E for. A good rule of thumb:if you find yourself writing the same line of code with only minor changes each time, think about Ho W can turn that code to a function that takes the changes as parameters. Avoid hard-coding values into your code. It is also good practice to revisit code you ' ve written in the past-see if the code can being made cleaner, more efficient , or more modular and reusable. This is called refactoring.

Chances is good that you'll be asked to the submit your code for a code review at some point. This is normal and doesn ' t mean people be skeptical of your work. Many organizations require that code being reviewed by multiple people before it's merged into production code. Writing Clean, reusable code would make this process easier for everyone and would lower the probability that's you'll be re Writing significant portions of your code following a code review.

Further Reading:chris DuBois on becoming a full-stack statistician,the pragmatic Programmer, clean Code

Documentation/commenting

Because other people is going to being reading your code, you need to learn how to write meaningful, informative comments as well as documentation for the code that is write. It is a very good practice(although one probably won ' t follow) to write comments and documentation before Actually write the code. This was the coding equivalent of writing an outline before you write a paper.

[Aside:some seasoned programmers would argue that you shouldn ' t write the comments until the code was complete, because thi s would force your to write clear, self-explanatory code and the only comments you'll have a to write is for the situations That is not crystal clear. As a beginning software developer, you should probably ignore this advice.]

Comments is non-executed blocks of code that explain what is doing and why is doing it. Good comments make the purpose of code clearer, they don ' t just restate what's obvious in the code. If you're writing clean, well-styled code, your function, variable, object, etc., names should be fairly self-explanatory.

You ' ve probably heard so should comment your code many times. So, you wrote things like this:

# import PackagesPD# Load some dataPD.  Read_csv(' data.csv 'skiprows=2)        

These is bad comments. They don ' t add any information. Why is this skiprows parameter set to 2? Is there comments at the beginning data.csv ? Something like this might is preferable:

# Data contains-lines of description text, skip to avoid errors.   PD.  Read_csv(' data.csv 'skiprows=2)        

It's very important that's update your comments as you update your code. Using The example from above, let's say the data source for your CSVs have changed, and there is no longer any description Lines. You modify the call read_csv , but don ' t remove the comment, which produces:

# Data contains-lines of description text, skip to avoid errors.   PD.  Read_csv(' data.csv ')     

Now whoever is reading your code had no idea if the the comment was right or if the code was right, which means they had to exe Cute the code to find out. Then they ' re effectively debugging your code for you, and no one appreciates that.

If you write a function, write a docstring (or whatever your language of choice calls the attribute of a function that des Cribes what it does) so clearly states what's the function does, what's parameters it takes, and what it returns.

Unlike comments, documentation is a document written in 中文版 (or whatever language you speak), rather than in a program Ming language, that's explains the purpose of the code you was writing is, what it operates, example use cases, who to Contac T for support, and so on. This can being as simple as a, sits in the README directory where your code was to a full-fledged manual that would be print Ed and given to users.

Version Control

In my informal Twitter poll, version control (also known as source or revision control) is the most oft-cited skill th At the new data scientists need to learn. It is also probably one of the most confusing. In your former life, ' version control ' probably meant you had a folder somewhere on your hard drive that contained  project.py ,   project2.py , project3.py ,   project3_final.py ,   project3_final_do_not_delete_final_revised.py  and so on.

Version control provides a centralized-one-to-many people to work on a common codebase at the same time without Writing over each of the other's work. Each person ' checks out ' a copy of the code and makes changes to it on a local ' branch ' which they can then ' commit ' and ' Merge "back into the common codebase. There's a lot of specialized vocabulary, but it starts-make (some) sense after a while. The Version control also allows you to easily "revert" changes that made broke.

Many people use  git  as their version control system, although may also encounter mercurial   (abbreviated  Hg ) and  Subversion   (abbreviated svn ). The terminology and exact workflows would differ slightly, but the basic premise is usually the same. All of the code are stored in one or more repositories (repos), and within each repo your may has several branches--Diffe Rent versions of the code. In each repo, the branch so everyone treats as the starting/reference point is called the master  branch .  github is a service that hosts repos (both public and private) and provides a common set of tools for interact ing with  git .

There is three certainties in your life as a data scientist:death, taxes, and an inevitable git clusterfuck. You'll find yourself typing and git reset --hard hitting enter while sighing at least once. That ' s OK.

If you ' re not familiar with version control, startnow. Install git (it works on pretty much every operating system) and the start using it to manage your own code. Commit frequently, write meaningful commit messages (which is just comments), and get to know the system. Create a GitHub account and check your code into a remote repo.

Further Reading:code School.

Testing

There's a good possibility so if you had no formal computer science traning and you don ' t even know what I mean when I say "Testing." I ' m talking about writing code, checks your code for bugs across a variety of situations. The most basic type of test, the can write is called a unit test.

In the past, you probably ran most of your code interactively, either by typing it in line-by-line or by writing a script and sending portions of that script to an interpreter of some kind. You ' re moving to a position where is not even is awake when your code runs. Maybe you ' ve built a recommender system and you want to generate the recommendations in batch every night for customers th At might visit, the next day. You write a script, that would be is run at 2am and would dump the recommendations into a database.

What happens if a, the product list that is recommendations has an error and returns too few columns? What is the If a column that used to is an integer suddenly becomes a floating point value? Do you want to being the one on the hook when there is no recommendations in the database the next day?

You write tests this describe the expected behavior of your code and that fail if that behavior was not produced. I ' m working on another post about testing to data scientists, so I won ' t go to too much detail here, but it ' s a very IM Portant topic. Your code is going to being interacting with other code and messy data. You need to know, what would happen when it's run on data that isn ' t as pristine as the data is working with now.

Further reading:improve Your python:understanding unit-testing,pytest, thoughtful machine learning

Logging

In the scenario above, your code was running at 2am, and your ' re not around-see-what happens when (and it ' s definitely   when , not if) it breaks. For this you need logging. Logging is just a record of what happened as your code was executed. It includes information about what parts of the your program executed successfully, what parts didn ' t, and any other diagnosti C information you "D like to include. Like comments, documentation, and testing, it's extra code you'll have a to write on addition to the actual executable co De that is about, but it ' s totally worth it.

When your get to work in the morning and find your code barfed, you'll want to know what happened without re-running a ll of the code--and that's not even guaranteed to reproduce the error, since it may has been due to another piece of Da Ta that has since been corrected. Logging lets you immediately identify the source of the problem (if your Logging code was well-written, that's) and Quickl The y figure is out and what does about it.

For instance, if your logs the "the Code didn ' t run because the file containing the products wasn ' t found, you IM Mediately know to try and figure out if the file is deleted, or if the network is down, and so on. If the code partially runs and fails on a specific product or customer, you can go inspect that particular piece of data, and fix your code so it won ' t happen again.

Disk space is cheap, so log generously. It's a lot easier (and faster) to grep through a big directory of logs than to try to reproduce an unusual error on a larg E codebase or DataSet. Make logging work for you--log more things than your think you ll need. Be smart at logging when functions is called, when steps in your program is executed.

Further reading:logging HOWTO (Python)

Conclusion

There is lots of things I didn ' t cover here:

    • How to conduct code reviews
    • Refactoring code
    • Navigating a *nix terminal, adding your SSH keys, setting up a dev environment
    • Working with distributed resources like AWS
    • IDE Choices
    • Programming paradigms (functional, object-oriented, etc.)

Posts like this one often balloon to a laundry list of skills and languages and make it seem impossible the any one per Son could ever master all of them. I ' ve tried to avoid the and focus on things that'll help you write better code, interact better with software developer s, and ultimately save you time and headaches. You don ' t need to has them all mastered your first day on the job, and some of them is more important at some companies than at others, but you'll encounter all of the them at some point.

Tweets

Posted On:may 12, 2015

Category:misc

Software development skills for data scientists

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.