Seven Python Tools All Data scientists should Know

Source: Internet
Author: User

Seven Python Tools All Data scientists should Know

If You're aspiring data scientist, you ' re inquisitive–always exploring, learning, and asking questions. Online Tutorials and videos can help you prepare your for your first role, but the best-of-the-to-ensure-you ' re-ready-to Be a data scientist are by making sure your "re fluent in the tools people with the use of the industry.

I asked our data science faculty to put together seven Python tools this they think all data scientists should know how to Use. The galvanize Data Science and GALVANIZEU programs both focus on making sure students spend ample time immersed in these T Echnologies, investing the time to gain a deep understanding of these tools would give you a major advantage when you apply For your first job. Check them out below:

IPython

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Pytho N programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and Rich History. IPython provides the following features:

    • Powerful interactive shells (terminal and qt-based)
    • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media
    • Interactive data visualization and use of GUI toolkits
    • Flexible, embeddable interpreters to load into one ' s own projects
    • Performance tools for Parallel computing

Contributed by Nir Kaldero, Director of the science, Head of galvanize experts

Graphlab Create

Graphlab Create is a Python library, backed by a C + + engine, for quickly building large-scale, high-performance data produ Cts.

Here is a few of the features of Graphlab Create:

    • Ability to analyze terabyte scale data @ Interactive speeds, on your desktop
    • A single platform for tabular data, graphs, text, and images
    • State of the art machine learning algorithms including deep learning, boosted trees, and factorization machines
    • Run the same code on your laptop or in a distributed system, using a Hadoop Yarn or EC2 cluster
    • Focus on tasks or machine learning with the flexible API
    • Easily deploy data products in the cloud using predictive Services
    • Visualize data for exploration and production monitoring

contributed by Benjamin Skrainka, leads Data science instructor at Galvanize

Pandas

Pandas is an open source, bsd-licensed library providing high-performance, easy-to-use data structures and data a Nalysis Tools for the Python programming language. Python has a long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fill this gap, enabling the carry out your entire data analysis workflow in Python without have To switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels In performance, productivity, and the ability to collaborate. Pandas does not implement significant modeling functionality outside of linear and panel regression; For the, look to Statsmodels and Scikit-learn. More work was still needed to make Python a first class statistical modeling environment That goal.

Contributed by Nir Kaldero, Director of the science, Head of galvanize experts

PuLP

Linear programming is a type of optimisation where the objective function should be maximised given some constraints. PuLP is a Linear programming modeler written in Python. PuLP can generate LP files and call in use highly optimized solvers, GLPK, COIN CLP/CBC, CPLEX, and Gurobi, to solve these Linear problems.

Contributed by Isaac Laughlin, Data science instructor at galvanize

matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and Interactive environments across platforms. Matplotlib can used in Python scripts, the Python and Ipython Shell (ala matlab®or mathematica®), Web application serv ERS, and six graphical user interface toolkits.

Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, Errorcharts, Scatterplots, etc, with just a few lines of CO De.

For simple plotting the Pyplot interface provides a matlab-like interface, particularly if combined with IPython. For the power User, you have full control of line styles, font properties, axes properties, etc, via an object oriented in Terface or via a set of functions familiar to MATLAB users.

Contributed by Mike Tamir, chief-Officer at Galvanize

Scikit-learn

Scikit-learn is a simple and efficient tool for data mining and data analysis. What's great about it's that it's accessible to everybody, and reusable in various contexts. It is built on numpy,scipy, and Mathplotlib. Scikit is also a open source is commercially USABLE–BSD licence. Scikit-learn has the following features:

    • Classification–identifying to which category A object belongs to
    • Regression–predicting a continuous-valued attribute associated with an object
    • Clustering–automatic grouping of similar objects into sets
    • Dimensionality reduction–reducing The number of the random variables to consider
    • Model selection–comparing, validating and choosing parameters and models
    • Preprocessing–feature Extraction and normalization

Contributed by Isaac Laughlin, Data science instructor at galvanize

Spark

Spark consists of a  Driver Program  that runs the user ' s main function and executes various  Parallel operations  on a cluster. The main abstraction Spark provides is a  resilient distributed DataSet   (RDD), which is a collection of Elements partitioned across the nodes of the cluster that can is operated on in parallel. RDDs is created by starting with a file on the Hadoop file system (or any other hadoop-supported file system), or an Exis Ting Scala Collection in the driver program, and transforming it. Users may also ask Spark to  persist  an RDD in memory, allowing it to be reused efficiently across Paral Lel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark are shared variables that can being used in parallel operations. By default, when Spark runs a function in parallel as a set of the tasks on different nodes, it ships a copy of each variable Used in the function to each task. Sometimes, a variable needs to is shared across tasks, or between tasks and the driver program. Spark supports types of shared variables: broadcast variables, which can used to cache a value in memory O n all nodes, and accumulators, which is variables that is only "added" to, such as counters and sums.

contributed by Benjamin Skrainka, leads Data science instructor at Galvanize

Still hungry for more data science? Enter Our Data Science giveaway -a chance to win tickets awesome conferences like Pydata Seattle and the data scienc e Summit, or get discounts on the Python resources like effective Python and the Data Science from Scratch.

Seven Python Tools All Data scientists should Know

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.