Large Data applications
The Obama administration released a "Big Data research and development program" in March 2012. In response, the National Science Foundation, the National Institutes of Health, the Ministry of Defence, the Department of Energy and the United States Geological Survey are investing in big data innovation. Many companies in the United States are conducting their business activities around large data acquisition and utilization capabilities as part of their product or operational backend. Research groups, governments and the private sector are also speeding up the generation of large datasets of various themes, including: climate change, transport patterns, health and disease data, buying behaviour and social behaviour through interactive social media. Examples of large data applications are:
Partnership between the INRIX company and the New Jersey Department of Transportation. The INRIX company collects the speed data on the main road by using signals and data from the GPS devices of cars and mobile phones, and then immediately warns the New Jersey Department of Transportation on any major road conditions, and sends alerts to drivers ' car GPS devices or mobile phones to alert drivers to dangerous conditions.
The climate company, the Climate Corporation, is a weather insurance company that makes a policy to make up for the difference between federal crop insurance and climate-causing farmer losses. The company analyses and forecasts the temperature, precipitation, soil moisture and yield of 20 million U.S. farmland through a vast network of sensors. After knowing the days of the hot days and soil moisture data, the model was established to help them to forecast the amount of weather insurance the farmers needed and the premiums the company needed to pay.
New York State uses a range of large data technologies to assess the impact of climate change on New York State and to provide strategies for tackling climate change in areas such as agriculture, public health, energy and transportation. The application has also been introduced into the U.S. Centers for Disease Control, which is working with 10 other states and cities to "read state and city plans" to study and deal with climate change, and large data technology is one of the most important components.
Open Government data
Large data strategies are often built on the basis of open government data. Open government data is not an entirely new concept in the United States. Over the years, government information and data have been changing, and the ways and means of collecting and distributing government data have also undergone these changes. Open government data has the potential to generate new scientific research, accelerate economic growth, provide information for policy formulation, and develop new policies for people to serve. Policy options on open government data will have far-reaching implications for innovation and research in the application of large-scale data sets, openness and transparency in government, and many other areas.
As part of the Obama administration's open government plan, the U.S. government set up a data.gov website in 2010 to open "High-value" datasets to the public. The government's Open data platform now provides a huge amount of raw government data directly to users, and expects users to tap into the new value of the data, thereby deepening our awareness of government activities and more complex social issues in ways that have not been achieved in the past. These technical approaches promote the availability, openness and transparency of data, while allowing the public, organizations, communities and other members of society to generate new and innovative perceptions based on existing data. As a public-facing platform, it can be a tool for promoting collaboration, storing data sets, promoting community participation and providing opportunities for participation. In addition, data can be stored and opened in multiple formats, such as Csv,xml and Excel, through these platforms. Each data format has a specific meaning that can restrict or facilitate the application of the data.
Current policy Analysis
A key issue with regard to large data and open government data is the management, use, reuse and accessibility policies of government information and data. The United States has a complex and changing information policy (law, regulations and memos to manage the lifecycle of information, from the generation of information, the dissemination of information, and then to information processing and archiving, to find a balance between data availability, privacy, security issues, digital asset management, archiving, and preservation. This gap in the policy framework, while still being adjusted, lags behind technological advances, raising questions about whether the current United States policy framework is sufficient to address the problems posed by large data, and raises the following key issues:
Can we ensure the availability of data? How do we protect our privacy in the big Data age? How do we ensure the quality and accuracy of our data? How do we manage our digital assets under current archiving and preservation conditions? Can we develop strong data reuse policies in the big data age?
The following is a detailed analysis of the current situation and hysteresis of the United States Information policy framework in the age of large data and open government data, and gives advice on the adjustment of information policies.
(i) Data availability and distribution
The United States Office of Management and Budget provides broad guidance to government agencies on data acquisition and dissemination, and establishes the principle that government agencies must publish information to the public in a timely, fair and effective manner. Government agencies must establish and maintain "information release product List". Government agencies must take into account the differences in the ability of citizens to gain access to important information for those who do not. Government agencies should develop a variety of strategies to disseminate information. When using electronic media, those provisions that involve proper management and filing are equally applicable. Government agencies need to assess and determine the most appropriate way to collect and save documents.
The Office of Management and budget of the United States also provides extensive guidance on the information management of government agencies ' websites. Government agencies are required to conduct standardized risk assessments of all available online applications and require government agencies to implement a number of privacy-related measures. Other policy tools related to information acquisition and dissemination include: 1. Government agencies are required to provide appropriate access to information to persons with limited English proficiency, involving all "federal projects and activities". The goal of this policy is to address the gap between the use of e-government by citizens, especially those who are not native speakers of English. 2. Provide that persons with disabilities have equal access to all electronic materials in public education. stipulates that the Government shall not exclude persons with disabilities in the provision of services and benefits and in the conduct of communication between the political and the public. (a) Ensure equal participation of persons with disabilities in government activities and access to government information and establish their general right to information and the use of communications technology. 3. The availability of online information and communication technologies has been promoted and implemented. 4. Electronic and information technologies procured, maintained or used by the federal government must meet specific accessibility criteria to ensure access to online information and services for persons with disabilities.
(ii) Privacy, security, accuracy and archiving
Government websites are becoming two-way communities, increasing the likelihood of Internet viruses or other attack vectors invading government environments, as well as increasing the likelihood of accidental information leaks. The information policy framework has also been adapted to address this change. For example, the Office of Management and budget requires government agencies to take adequate security measures to ensure that information is not tampered with and to ensure its accuracy, confidentiality and accessibility to meet the expectations of government agencies and the needs of users.
However, the current policy does not guarantee the solution of the large number of misuse of the big data. Concerns about personal identity information, the security of government data and information, and the accuracy of public data are related to large data. The quality, reliability and authority of large data is a major concern for governments, research groups and non-governmental organizations and the private sector. Unconfirmed or validated data, or low-quality data collected in the wrong way, can lead to erroneous research findings that seriously affect a range of decisions and policy formulation.
Data.gov's data management policy is dedicated to addressing these issues, including: requiring government agencies responsible for collecting and distributing data to ensure data accuracy, timeliness and overall quality. Government agencies are required to have version control to ensure that the DataSet has a clear label. Government agencies are required to ensure that the data released on the data.gov does not involve national security. Government agencies are required to ensure that the data released is consistent with confidentiality and privacy protection requirements.
Along the lifecycle of information, there are challenges in the use, storage, and preservation of large data. Promoting the openness and accessibility of large data is significantly different from promoting the use of large data. In addition, there is a difference between data applications in specific areas (i.e., only by scientists in a particular field) and extensive interdisciplinary data applications (i.e., interdisciplinary areas and applications across common research areas).
At the same time, a specialized database for large-scale scientific data sets needs to be established. One element of building a data community is the urgent need to consolidate and manage data from different sources and departments. These data flows must converge between governments, the private sector, public utility companies, equipment and individuals to be truly useful and to provide information for community and national development. It is therefore necessary to establish, adopt and adhere to a formal set of data management standards and practices among the various entities to ensure data compatibility, naming conventions and organizational structures. Also, to ensure that researchers have an informed use of datasets, there is a need to produce well-defined data files and coded copies.
"Aggregation" refers to the combination of data from different Web sites, and large data makes the information policy environment more complex. The Office of Management and budget requires public websites of government agencies to provide data that is open and in line with industry-standard formats, enabling users to integrate, decompose, manipulate, or analyze data to meet their needs. Currently aggregated data often lacks formal authorization and verification procedures. As the Data.gov website puts it, "Once the data has been downloaded from the government agency page, the government can no longer guarantee its quality and immediacy." In addition, the United States Government could not guarantee any analysis of data taken from data.gov. Although this disclaimer limits the liability of Data.gov, two data use issues still need to be addressed.
The regulation of large data is also a problem that can not be neglected. Digital regulation involves the maintenance, preservation and value-added of electronic research data throughout its lifecycle, such as the concept of digital assets, the creation, acquisition and use of digital assets, and the evaluation and selection of digital assets. With the continuous growth of new electronic data assets, it is necessary to develop effective data management strategies for the entire lifecycle of large data.
Finally, in digital "open space" such as the data.gov community, there are fewer permanent and final documents, and almost all file management and archiving work is based on these documents. Now, as a result of the use of Non-governmental third-party applications or software, and the constant data adjustments and modifications, data ownership, storage progress and archiving are facing enormous challenges.
Policy and governance principles
When policymakers consider, debate and formulate policies, when the private sector, the non-profit sector and the Government cooperate, we find it difficult to open up government data and large data legislation or to develop a set of policies and governance structures. The government needs to develop a set of guiding principles when it comes to data openness and the use of large data technologies. And these guidelines are only a start, not an end. As the understanding of big data innovation continues to deepen, we need to build and maintain a strong policy and governance framework. The Guiding principles are:
1. No harm done. Sharing data among Governments, the private sector and public societies may involve private, sensitive personal information, and most of these organizations do not have matching data management, utilization and reuse policies. When Non-governmental organizations are also participants in large data cooperation projects, individuals should not be forced or asked to share data collected by Governments for some purpose to these non-governmental organizations.
2. Long-term vision. The long-term sharing, preservation, retrieval and acquisition of data will require a long term vision beyond the current technical level. It is necessary to ensure that large data and its ancillary products can be obtained in the next 10 years, 20 years and even longer time. Adherence to open data standards and technical standards from the outset can be an effective catalyst for this process.
3. Data presentation. We need to ensure that data elements, units of data acquisition (e.g., individual or community level), or other aspects of the data are well defined, while data collection and usage policies are clearly articulated.
4. Take responsibility. Large data has great potential for providing information and policy development, but it can also cause damage. Large data typically contains aggregations of multiple datasets that were not originally generated for consolidation purposes. In large data innovation, governments need to take responsibility for the damage caused by the use of their data by others and to ensure the protection of the public.
Policy recommendations
Large data poses a series of problems, and the current policy framework is powerless to address them, requiring a governance model of large data. This governance model needs to consider the following specific issues:
Privacy。 In the personal, family, equipment or other level, large data contains a variety of personal information data. Privacy laws and policies may conflict with the opportunities of big data, while large data is violating the privacy of individuals or communities.
Data reuse. Data are usually collected by government agencies or other organizations and are generally associated with the services of social services. In addition, individuals, government agencies or companies often have the right to use data within the allowable limits, as well as privacy policies when collecting and using data. When large data applications are continuously integrated from different organizations, government and family data sets to identify new ideas and provide information for policy-making and policy development, it is also necessary to make clear guidelines for the use of data and reuse policies for individuals so that individuals can knowingly make decisions about their personal data.
Data accuracy. Since the new dataset is generated by combining disparate data from different government agencies, researchers, scientists, the private sector and public groups, data quality standards need to be developed and adhered to. Data collected for a particular purpose is not necessarily fully compatible with other datasets, which may result in errors and a series of erroneous conclusions. The disclaimer on the Data.gov website gives this responsibility to the government agencies that publish the data, as well as organizations or individuals who download and use the data. The use of data can have a huge impact on society, policy and science projects, and the above approach is not an appropriate response to the use of data.
Data can be obtained. What policies are in place to manage the availability and retention of these newly generated datasets? In addition, large data sets have become a problem for the public to access government datasets, and there is a need for a public data access platform similar to data.gov.
Archive and save. If large data is separated from its embedded technology and analysis platform, the raw data itself cannot guarantee the export of similar discoveries, so protecting both the data and the technology used to analyze the data is critical. In addition, we must consider the archiving and long-term preservation of research datasets established by non-governmental organizations, such as universities and research centres funded by government research institutions. An overall data management strategy needs to be developed to ensure that the availability of smaller datasets becomes part of large data.
Data regulation. One of the main goals of large data innovation is to encourage communities to integrate multiple large datasets to create new knowledge. Large data is not necessarily born to be large data, but rather by accumulating, modifying, merging, and processing many small datasets. Each arrangement of data is a new dataset that needs to be archived, managed, and supervised.
Build sustainable data platforms and architectures. To organize, regulate, store and open data sets to scientific groups, the private sector, other sectors and the public, a strong technical infrastructure is needed. These platforms need to be open to large data at both the physical (technical) and intellectual levels (organized), while requiring seamless integration of a range of technologies, analytical skills, and information architectures. These infrastructures must be able to support public-facing general-purpose platforms like Data.gov, as well as specialized platforms with large numbers of large datasets for special agencies.
The establishment of data standards. Large data requirements enable interoperability at the technical level while complying with metadata standards at the data level. Different domains may have different metadata standards. The generation, development and release of large data datasets need to consider the appropriate data standard format to promote collaboration and data reuse. In addition, document standards need to be established for externally released documents. In addition, the limitations of the data need to be clearly explained.
Encourage cross-sectoral data-sharing policies. Since large data involves real-time data transfer between different systems, Governments, and departments, a framework for data sharing and interoperability is needed. With the large data innovation of collaborative analysis technology, it is necessary to integrate the data acquisition and reporting system without gap. This will make it necessary to adjust information and data policies to reflect this integrated data environment.
Big data innovation is important for policy formulation and decision-making, it can deepen our understanding of major scientific and social challenges, promote cooperation among Governments, citizens and enterprises, and lead a new era of e-government services. However, we also need to consider a range of policy issues related to managing large data, including privacy, accuracy, accessibility, equity and preservation policies, and building a holistic model of large data governance.