.
Stanford temporal tagger-sutime is a library that recognizes and standardizes time expressions.
Stanford spied-usage mode on the seed set, learning character entities from unlabeled text in iterative mode
Stanford topic modeling toolbox-a topic modeling tool for social scientists and other people who want to analyze datasets.
Twitter text java-implemented Twitter Text Processing Library
Mallet-Java-based statistical natural language processing, document classification, clustering, topic m
on spark data with traditional bi and visualizer tools. Users can also use spark SQL to perform ETL for different formats of data (such as Json,parquet and databases), convert them, and then expose them to specific queries.
Spark MLlib:
Mllib is an extensible Spark machine learning library that consists of common learning algorithms and tools, including two-tuple classification, linea
mining and distributed processing on the internet Anand Rajaraman,jeffrey David Ullman, Wang Bin[5] Ufldl Tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial[6] Spark Mllib's naive Bayesian classification algorithm http://selfup.cn/683.html[7] mllib-dimensionality Reduction http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html[8] Mathematics in machine learning (5)-pow
. Urban computing:concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and technology. 5 (3), 2014[3] Jerry lead http://www.cnblogs.com/jerrylead/[3] Big data-massive data mining and distributed processing on the internet Anand Rajaraman,jeffrey David Ullman, Wang Bin[4] UFLDL Tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial[5] Spark Mllib's naive Bayesian classification algorithm http://selfup.cn/683.html[6]
matrices B and C multiplication, the other two matrices are more dense, and B matrix column can be interpreted as a matter of several aspects. 4. User k's preference for object H can be obtained by using the H column of the K-row multiplication matrix C of matrix B
Usage Scenarios
When the user and the recommended object itself attribute data does not exist, only the user and the recommended object historical interaction data, when refining the user-recommended object of the
, automatically from Scala. Some functions are inherited from product;6, Case class constructor parameters are public level, we can directly access;7, support pattern matching.BreezeIn addition, the above densevector is actually used breeze inside the classLinearregressionwithsgdIn addition, this is the linear regression implemented within spark, which is based on a random gradient descent. Similar functions also include the following:The linear regression algorithms available in
granularity, to do something, something does not.Other Spark featuresAs mentioned in the previous spark biosphere, Spark has a few very useful applications on top of it, in addition to the core RDD:
Spark sql: Hive-like implementation of SQL queries using the RDD
Spark Streaming: Streaming computing, providing real-time computing capabilities like storm
MLLib: Machine Learning Library, provides common classification, clustering, regr
KAKFA streaming source that leverages the playback capabilities of Kafka to provide a more reliable delivery semantics in a non-pre-write log configuration. At the same time, for applications that require strong consistency, it also provides primitives that implement the exactly-once guarantees. On Kafka's support, version 1.3 also adds a Python API and primitives that support this API.A new algorithm in MllibSpark 1.3 also provides a number of new algorithms. Among them, latent Dirichlet Alloc
treelink, that is, whether there is a deal, which is converted into a binary classification problem.
Data format: Target feature1 feature2... Feature13, where target has only two values: 0 indicates no deal, 1 indicates yes, 0 indicates no deal, and each sample is described by 13 feature. In the sample data, the number of positive samples (target = 1) is 23285, the number of negative samples is 20430, and the ratio is 1.14: 1. All sample data is distributed to the training set and the test set.
Summary: The advent of Apache Spark has made it possible for ordinary people to have big data and real-time data analysis capabilities. In view of this, this article through hands-on Operation demonstration to lead everyone to learn spark quickly. This article is the first part of a four-part tutorial on the Apache Spark Primer series.The advent of Apache Spark has made it possible for ordinary people to have big data and real-time data analysis capabilities. In view of this, this article throug
major features.
Spark Core:rdd and its operators.
Spark-sql:dataframe and SQL.
Spark ML (MLlib): Machine learning Framework.
Spark Streaming: Real-time computing framework.
Spark GraphX: Figure calculation Framework.
Pyspark (SPARKR): Python and R framework above spark.
From off-line calculation of RDD to streaming real-time computing. From the support of Dataframe and SQL to the
--Deep Learning FrameworkCaffe is a modular and expressive deep learning framework based on speed. It was released under the BSD 2-clause license and has supported a number of community projects in industrial applications such as research, startup prototyping, and visual, voice and multimedia.Official website: http://caffe.berkeleyvision.org/3. h20--Distributed machine Learning FrameworkH20 is an open-source, fast, extensible and distributed machine learning framework, as well as a framework-equ
a variety of resource scheduling systems (distributed deployments), mainly: Standalone deploy mode, Amazon EC2, Apache Mesos, Hadoop YARN
Spark can run on top of Hadoop (using Hadoop's HDFs as the storage file system and Hadoop's yarn as the Resource scheduling system), but spark can also be completely out of Hadoop, such as using Red Hat's Gluster FS as a storage file system , using Apache Mesos as the resource dispatch system. In other words, Spark is not entirely part of the Hadoop ecosystem
Spark can be divided into the following layers.
1 spark basics 1.1 understand the basic operation steps of the spark ecosystem and installation and deployment during the installation process. Install and deploy spark brief introduction to spark source code compilation spark standalone install spark standalone ha install spark Application Deployment Tool spark-submitspark ecosystem spark (memory computing framework) sparksteaming (Stream computing framework) spark SQL (Ad-Hoc)
.
2.TreelinkBinary classification performance test
2.1Lab purpose and data preparation
The transaction history data is used to predict the transaction status of a seller through treelink, that is, whether there is a deal, which is converted into a binary classification problem.
Data format: Target feature1 feature2... Feature13, where target has only two values: 0 indicates no deal, 1 indicates yes, 0 indicates no deal, and each sample is described by 13 feature. In the sample data, the n
absrtact: This article mainly introduces TalkingData in the process of building big data platform, introducing spark gradually, and build mobile big data platform based on Hadoop yarn and spark.Now, Spark has been widely recognized and supported at home: In 2014, spark Summit China in Beijing, the scene is hot, the same year, Spark Meetup in Beijing, Shanghai, Shenzhen and Hangzhou four cities, of which only Beijing has successfully held 5 times, The content covers many areas, including spark Co
spart, including task scheduling, memory management, error recovery, and interaction with the storage system. Spart core contains the definition of an elastic distributed data set (RDD) API: The RDD represents a collection of elements distributed across multiple computer nodes that can be manipulated concurrently, and is the main programming abstraction of Spart.
Spart SQLSpart SQL is a package that Spart uses to manipulate structured data, and with Spart SQL, we can query data using SQL or
Spark Machine Learning1 Online LearningThe model keeps updating itself as new messages are received, rather than being trained again and again, like offline training.2 Spark Streaming
Discrete stream (DStream)
Input source: Akka actors, Message queue, Flume, Kafka 、......Http://spark.apache.org/docs/latest/streaming-programming-guide.html
Class group (Lineage): A collection of conversion operators and execution operators applied to the RDD
3 mlib+streaming Application 3.0 B
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.