Reprinted please indicate the source: http://blog.csdn.net/hsluoyc/article/details/43977779
Please reply when requesting the word version in this article. I will send it via a private message
This article mainly discusses spark security threats and modeling methods through official documents, related papers, industry companies and products. The details are as follows.
Chapter 2 Official documentation [1]
Currently, spark supports password-sharing authentication. You can set the spark. Authenticate parameter to verify whether the spark communication protocol uses a shared password. This authentication protocol is a basic handshake protocol that allows both parties to have the same shared password to ensure direct communication between them. If the shared passwords are different, they will not be allowed to communicate. The Shared Password is generated as follows:
? In the yarn deployment mode, spark can automatically generate and distribute shared passwords. Each application can use a unique shared password;
? For other spark deployment modes, the spark. Authenticate. Secret password parameter of spark should be configured on each node. This password will be used by all master/workers and applications;
? Note: The netty shuffle path (spark. Shuffle. Use. netty) function is still in the trial phase and is not secure. Do not use netty for shuffles in the production environment.
1.1 web UI Security
You can set the spark. UI. Filters parameter to enable javax. servlet. filters on the spark web UI interface to improve security. If a user does not want others to see his or her data, he or she can set security for the UI. By using javax. the servlet filter can be used to verify other users. Once another user logs on to spark, the system compares and analyzes the user and view access control list, to ensure that the user has the right to view the UI of the owner user. Note that the user who starts the application can view the UI of the application without any restrictions. In yarn mode, spark UI uses the standard yarn web application proxy mechanism and can pass the installed hadoop filter for authentication.
Spark also supports modifying the access control list to control which user can access and modify the running spark application, including terminating an application or task. You can configure spark. ACLs and spark. Modify. ACLS parameters here. In yarn mode, you can modify the access control list through the yarn interface.
Spark allows administrators to specify a user in the access control list that has the permission to view and modify all applications. You can configure spark. admin. ACLS parameters here. This is useful in scenarios where administrators or technical support can help users debug problematic applications in a shared cluster.
1.2 event Audit Security
If you want to enable the event audit function, you need to manually create the folder for storing event logs (the path is set using the spark. EventLog. dir parameter) and grant spark access to this directory. If you want the log file to be safer, you can set the drwxrwxrwxt permission for this folder. The owner of this folder should be set to the root user who starts the history server and the user group permissions should be added to the root user group. This setting ensures that non-owner users can write in this folder but cannot move or rename the file. In this way, event logs will only be generated and modified by the root user and spark system to ensure their security.
1.3 network port security
Spark has high requirements on network communication and has strict requirements on firewall settings in some environments. The following describes the main ports used by spark for communication and how to configure these ports.
1.3.1 standalone Mode
1.3.2 Cluster Manager (such as yarn)-based mode
You can view more detailed usage information in the security configuration parameters on the configuration page, or view the security management implementation details in the org. Apache. Spark. securitymanager package of the source code.
Chapter 1 related papers
Currently, there are no articles dedicated to spark security in academic circles. Some articles only mention security issues. The following are representative documents:
As mentioned in article [2], in the multimedia computing framework, users store and process their multimedia application data in a distributed manner, thus avoiding the installation of large multimedia application software. Multimedia Processing in the cloud environment poses huge challenges to the following aspects: content-based multimedia retrieval systems, distributed and complex data processing, cloud-based QoS support, multimedia cloud transmission protocols, multimedia cloud coverage networks, multimedia cloud security, P2P cloud-based multimedia services, and so on. Spark streaming supports large-scale streaming data processing. Its security threats have a common relationship with multimedia cloud security. Because multimedia data such as videos are very private content, identity authentication is required when using spark streaming for multimedia data processing, at the same time, multimedia data is encrypted and transmitted using security protocols such as rtmp [3].
As mentioned in article [2], spark and other memory computing platforms need to use distributed, or even third-party services and infrastructure to store important data or perform key operations, this poses a huge challenge to dynamic data monitoring and security protection. Unlike the traditional mapreduce-based security mechanism, you only need to perform security protection on static datasets on the hard disk. In Spark, data is stored in the memory and often changes dynamically, this includes changes to the data mode, attributes, and newly added data. Therefore, it is necessary to implement effective privacy protection in such a complex environment.
As mentioned in article [4], security issues are very important in graph computing systems, but the existing research pays little attention to such security issues. The possible problem is that the network nodes fully comply with the requirements of the transmission protocol. It is only a hypothesis that a Byzantine error may occur. A mechanism is required to detect and fix node failures and link failures. Spark graphx is also used as a framework for graph computing and graph mining.
Chapter 2 industry companies and products 3rd datastax
Datastax launched the commercial data analysis platform datastaxenterprise (DSE, latest version 4.6) [5] Based on Apache Cassandra and spark, and implemented security reinforcement on the basis of the original Open Source spark. Including:
1) not only supports built-in encryption and authentication methods, but also supports the combination of trusted third-party security software packages (such as Kerberos and LDAP) and datastaxenterprise;
2) transparent data auditing and client-node encryption;
3) multiple tools in opscenter improve manageability, such as simpler configuration, granularity control for backup/recovery, and better diagnosis;
4) perform password verification when accessing the Cassandra database in spark and shark [6];
5) Simple Object permission management based on grant/Revoke mode in relational databases.
3.2 sqrrl
Sqrrl is a company specializing in security big data platforms. Founded in 2012 and headquartered in Cambridge, Massachusetts, sqrrl is centered around the National Security Agency (NSA) developed open-source nosql database Apache accumulo (bigtable, a big data technology developed by Google, was initially developed by NSA and then split up as an open-source project.
Adam Fuchs, co-founder and chief technology officer of sqrrl, is also one of the co-founders of Apache accumulo. As of the ECP release date, sqrrl has raised $2 million from venture capital companies such as Atlas venture and matrix partners.
Sqrrl enterprise [7] is a secure and scalable platform for developing real-time analysis applications. Sqrrl enterprise uses graphx graph computing engine in spark to construct and analyze dynamic object graphs [8]. Therefore, sqrrl is a commercial product based on spark.
Using sqrrl enterprise and the graphx libraryincluded in Apache spark, We will construct a dynamic graph of entities andrelationships that will allow us to build baseline patterns of normalcy, flaganomalies on the fly, and analyze the context of an event.
Sqrrl enterprise security functions include [9]:
1) cell-level security enforcement: every time a user tries to perform an operation on the data, the system will evaluate the visibility label carried by the data );
2) data tag engine: based on user-defined rules, the system can automatically tag fields of data;
3) Policy Declaration engine: based on predefined policies, the system can automatically grant users or user groups the permissions to access specific visibility labels. As a policy execution point (PDP), the policy engine provides real-time analysis and support for RBAC and ABAC policies;
4) encryption: The system can encrypt static or dynamic data, support third-party encryption algorithms and libraries, and seamlessly integrate with the third-party key management system;
5) Secure Search: the search index may cause data leakage. The system can achieve vocabulary-level security and ensure that the data index can comply with the security policies of data elements;
6) Audit: The system can automatically generate tamper-resistant logs, which record all actions and can be used to verify compliance, warning, and digital forensics.
With the release of the latest sqrrlenterprise 2.0, sqrrl will be integrated into the full supply phase from the limited release phase. Sqrrl enterprise also provides more advanced security tools based on Apache accumulo, enhanced analysis functions, and features such as JSON. New analysis functions include full-text search, using Apache Lucene, SQL, statistics, and graphic search.
Chapter 2 references
[1] http://spark.apache.org/docs/1.2.0/security.html
[2] Ji, Changqing, et al. "Big Data Processing: Big challengesand opportunities." Journal of Interconnection Networks 13.03n04 (2012 ).
[3] Chang, Qian, Zehong Yang, and yixu song. "A Scalable custom elive Video Streaming System Based on rtmp and HTTP transmissions. "Advanced Research on Computer Science and Information Engineering. spring erberlin Heidelberg, 2011. 113-118.
[4] Rahimian, Fatemeh. "gossip-based algorithms for informationdissemination and graph clustering." (2014 ).
[5] http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
[6] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkPwdAppl.html
[7] http://sqrrl.com/product/sqrrl-enterprise/
Http://sqrrl.com/resources/
Http://sqrrl.com/product/data-centric-security/ [9]
Spark security threats and modeling methods