The enterprise collects several terabytes of security-related data on a regular basis (such as network events, software application events, and personnel activity events) for compliance and postmortem forensics. It is estimated that the daily events of large enterprises of different sizes are between billions and billions. These values continue to grow as the enterprise enables more and more event logging sources, employs more people, deploys more devices, and runs more software. Unfortunately, the amount and variety of data will quickly become the straw on the camel's back. The existing analysis technology can not deal with large-scale data, usually produces a lot of false positives, so the effect is weakened. The problem has worsened as companies migrate to cloud architectures and gather more data.
Large data analysis-large-scale analysis and processing of information-the buzz in several areas, and in recent years it has also aroused the interest of the security community because of its commitment to efficiently analyze and correlate security-related data on a scale not previously available. However, the difference between traditional data analysis and large data analysis is not so intuitive to security. After all, the information security community has been using the analysis of network traffic, system logs and other sources of information for more than 10 years to identify threats and detect malicious activity, and it is not clear how these traditional ways differ from big data.
To address this problem, there are other issues, the Cloud Security Alliance (CSA) in 2012 set up a large data working group. The Working Group is composed of volunteers from the industry and institutions to identify the principles, programmes and challenges in this area. Its latest report, "Large data analysis in Security Intelligence", focuses on the role of large data in the security field. In this report, detailed information is presented on how the involvement and widespread use of new tools using a large number of structured and unstructured data has changed the field of security analysis. It also lists some of the basic differences with traditional analysis and points out some possible research directions. We made a summary of some of the key points in the report.
The development of large data analysis
Data-driven information security data can support bank fraud detection and anomaly based intrusion monitoring systems (IDSS). Although for forensics and intrusion detection, analysis of logs, network flows, and system events has been a problem for the information security community for more than more than 10 years, but for several reasons traditional technology has sometimes been insufficient to support long-term, large-scale analysis: first, it is not economically feasible to retain large amounts of data previously. So in traditional infrastructure, most event logs and other recorded computer activities are deleted after a fixed retention period, such as 60 days. Second, it is inefficient to perform analysis and complex queries on a large, unstructured dataset that is incomplete and noisy. For example, several popular information security and event Management (SIEM) tools do not support the analysis and management of unstructured data and are strictly defined on predefined data scenarios. However, because large data applications can effectively clean, prepare, and query those heterogeneous, incomplete, noisy format data, they are also beginning to become part of the information security management software. Finally, the management of large data warehouses is traditionally expensive, and their deployment typically requires a strong business case. While the Hadoop framework and other large data tools now commercialize large-scale, reliable cluster deployments, there are new opportunities for data processing and analysis.
Fraud detection is the most visible application of large data analysis: credit card and telephone companies have been in the history of fraud detection for decades; however, from an economic standpoint, it is not appropriate to use custom based settings to exploit large data for fraud detection. A major impact of large data technologies is that they allow businesses in many industries to afford to build infrastructure for security monitoring.
In particular, new large data technologies, such as the Hadoop biosphere (including Pig, Hive, Mahout and Rhadoop), stream mining, complex event processing, and NoSQL databases, are able to analyze large-scale heterogeneous datasets with unprecedented scale and speed. These technologies change security analysis by facilitating the storage, maintenance, and analysis of security information. For example, the wine platform 1 and bot-cloud2 allow the use of mapreduce to efficiently process data for security analysis. We can identify some of these trends by looking at how the response to security tools has changed over the past decade. As the market for IDs probes grows, network monitoring probes and logging tools are deployed to the corporate network; however, managing warnings from these decentralized data sources has become a challenging task. As a result, security vendors began to develop Siems, which is committed to consolidating and associating warning messages with other network statistics and presenting all information to security analysts through a dashboard. Now, large data tools will be more decentralized data sources, a longer time range of data Association, integration and induction to the security analyst to improve the information that security analysts can obtain.
This column more highlights: http://www.bianceng.cnhttp://www.bianceng.cn/database/storage/
Zions Bancorporation recently presented a case study that allows us to see the concrete benefits of large data tools. Its research has found that it deals with data quality and the number of events analyzed is much more than the traditional Siem (20 minutes to one hours of searching in a one-month data load). In the new Hadoop system where it runs queries with hive, the same results come out about a minute or so. 3 using a secure data warehouse that drives this implementation, users can not only exploit meaningful security information from firewalls and security devices, but also tap into Web site flows, business processes, and other day-to-day transactions. Incorporating unstructured data and multiple different datasets into one analysis framework is one of the characteristics of large data. Large data tools are also particularly suitable for use as a basic tool for advanced persistent threat (APT) detection and forensics. 4,5 APT's running mode is low and slow (ie, execution is not noticeable, and time is long); therefore, they may last for a long time, while the victims are unaware of the invasion. To detect these attacks, we need to collect and correlate large amounts of decentralized data, including data from internal data sources and externally shared intelligent data, and to perform long-term historical-related risks to incorporate the postmortem information of attacks that have occurred in the history of the network.
Challenge
While the promise of large data analysis applications is significant in dealing with security issues, we must present several challenges to recognize its true potential. Sharing data in the industry is particularly important and avoids violating privacy rules for data reuse, which means that data can only be used to collect its purpose. Until recently, privacy depended to a large extent on the technical limitations of www.computer.org/security 75 in extracting, analyzing, and correlating the capabilities of potentially sensitive data sets. However, the development of large data analysis provides us with the tools to extract and correlate this data, making it easier to break the privacy. Therefore, we must develop large data applications in the context of understanding privacy regulations and recommended practices. Although in some areas where privacy laws exist-for example, in the United States, where the Federal Communications Commissioner cooperates with telecoms companies, health insurance privacy and liability laws point to medical data, several state utilities committees limit the use of smart grid data, and the Federal Trade Commission is developing guidelines for web activities- All of these activities have broadened the coverage of the system and in many cases have different interpretations. Even if there are privacy laws, we have to understand that such large-scale data collection and storage will attract the attention of the community, including industry (the use of our information in marketing and advertising), the Government (will emphasize that this data is necessary for national security or law enforcement) and criminals (like to steal our identity). Therefore, as architects and designers of large data applications, we should proactively create safeguards to prevent the misuse of these large databases.
Another challenge is the issue of data provenance. Because large data allows us to expand the data source used for processing, it is difficult to determine which data source conforms to the trustworthiness required by our analysis algorithm, so that we can produce accurate results. Therefore, we need to reflect on the authenticity and completeness of the data used in the tool. We can study the idea from antagonistic machine learning and robust statistics to find out and mitigate the impact of malicious inserts of data.
This special CSA report focuses on the application of large data analysis in security, but on the other hand, it protects large data with security technology. As large data tools are continually deployed to enterprise systems, we will not only take advantage of traditional security mechanisms, such as the integration of transport-layer security protocols within Hadoop, but also introduce new tools, such as Apache Accumulo, to deal with the unique security issues of large data management.
Finally, there is another area in the report that is not covered, but that needs to be further developed, that is, human-computer interaction, especially visual analysis, to help security analysts interpret the results of the query. Visualization analysis is a science that promotes the ability of reasoning and analysis through interactive visual interface. Human-Computer interaction in large data is less concerned than the technical mechanisms developed for efficient computing and storage, but it is also an essential tool for large data analysis to reach a "commitment", because its goal is to communicate information to humans through the most effective means of presentation. Large data is changing the landscape of security technologies for network monitoring, Siem and forensics. However, in an arms race where offense and defense will never stop, big data is not omnipotent, and security researchers must constantly explore new ways to curb sophisticated attackers. Large data also makes it a constant challenge to maintain a leak that controls personal information. So we need to make more efforts to nurture a new generation of computer scientists and engineers with privacy-preserving values, and to work with them to send out tools for designing large data systems that allow large data systems to follow universally accepted privacy guidelines.