Big Data Glossary
The emergence of big data has brought about many new terms, but these terms are often hard to understand. Therefore, we use this article to provide a frequently-used big data glossary for your in-depth understanding. Some of the definitions refer to relevant blog articles. Of course, this glossary does not contain 100% of all terms. If you think there are any omissions, please let us know.
A
- Aggregation-the process of searching, merging, and displaying data
- Algorithm (Algorithms)-mathematical formulas for data analysis
- Analytics-Internal meaning of data discovery
- Anomaly detection-searches for data items in the dataset that do not match the expected mode or behavior. In addition to "Anomalies", there are several words used to indicate exceptions: outliers, exceptions, surprises, and contaminants. They generally provide key executable information.
- Anonymization (Anonymization)-to make data anonymous, that is, to remove all data related to personal privacy
- Application-computer software that implements a specific function
- Artificial Intelligence (Artificial Intelligence)-R & D of intelligent machines and software that can perceive the environment and respond as required, and even learn by yourself
B
- Behavioural Analytics-this analysis method draws conclusions based on user behaviors such as "how to do", "Why to do this", and "what to do, instead of simply targeting an analytical discipline of characters and time, it focuses on the humanized model in data.
- Big Data Scientist-a person who can design Big Data algorithms to make Big Data useful
- Big data startup refers to the emerging company that develops the latest Big data technology.
- Biometrics-Identity Recognition Based on Individual Characteristics
- B-byte (BB: Brontobytes)-approximately 1000 YB (Yottabytes), equivalent to the size of the future digital universe. 1 byte B contains 27 0!
- Business Intelligence is a series of theories, methodologies, and processes that make data easier to understand.
C
- Classification analysis-a systematic process for obtaining important relevance information from data. This type of data is also called metadata data, which describes the data.
- Cloud computing-a distributed computing system built on the network. data is stored outside the data center (Cloud)
- Clustering analysis-aggregates similar objects and groups similar objects into a cluster. This analysis method aims to analyze the differences and similarity between data.
- Cold data storage-stores infrequently used old data on low-power servers. However, it takes a long time to retrieve the data.
- Comparative analysis
- Complex structured data, composed of two or more Complex and correlated parts, cannot be simply parsed by a structured query language or tool (SQL ).
- Computer generated data (such as log files)
- Concurrency-execute multiple tasks at the same time or run multiple processes
- Correlation analysis is a data analysis method used to analyze whether there is a positive or negative Correlation between variables.
- CRM: Customer Relationship Management-a technology used to manage sales and business processes. Big Data affects the company's Customer Relationship Management policies.
D
- Dashboard-analyze data using algorithms and display the results in charts
- Data aggregation tools-the process of converting Data Scattered from multiple Data sources into a new Data source
- Data analyst-professionals engaged in Data analysis, modeling, cleaning, and processing
- Database-a warehouse that stores data sets with a specific technology
- Database-as-a-Service-a Database deployed on the cloud, pay-as-you-go, for example, Amazon cloud Service (AWS: Amazon Web Services)
- Database Management System (DBMS)-collects and stores data and provides data access
- Data center-a physical location where servers are used to store Data
- Data cleansing-the process of re-reviewing and verifying Data aims to delete duplicate information, correct existing errors, and provide Data consistency
- Data custodian-a professional technician responsible for maintaining the technical environment required for Data storage
- Data ethical guidelines
- Data feed-a type of Data stream, such as Twitter subscription and RSS
- Data marketplace
- Data mining-the process of discovering specific patterns or information from a dataset
- Data modelling-use Data Modeling Technology to analyze Data objects, so as to gain insight into the internal meaning of Data
- Data set-a collection of large amounts of Data
- Data virtualization-the process of Data integration to obtain more Data information. This process usually introduces other technologies, such as databases, applications, file systems, and web pages, big Data technology, etc.
- De-identification, also known as anonymization, ensures that individuals are not identified by data.
- Discriminant analysis-classifies data. Data can be allocated to different groups, categories, or directories based on different classification methods. It is a statistical analysis method that analyzes the known information of certain groups or clusters in the data and obtains the classification rules.
- Distributed File System (Distributed File System)-provides a simplified and highly available System for storing, analyzing, and processing data.
- Document Store Databases, also known as document-oriented database, is a database specially designed to Store, manage, and restore Document data, this type of document data is also called semi-structured data.
E
- Exploratory analysis-explores patterns from data without standard processes or methods. It is a method for discovering the main features of data and datasets.
- E-byte (EB: Exabytes)-approximately 1000 PB (petabytes), approximately 1 million GB. Today, the new information produced by the world every day is about 1 EB
- Extract-Transform-Load (ETL: Extract, Transform and Load)-is a process used for processing databases or data warehouses. That is, extract (E) data from various data sources, convert (T) the data to meet the business needs, and finally load (L) it to the database.
F
- Failover-when a server in the system fails, the running task can be automatically switched to another available server or node.
- Fault-tolerant design-a system that supports Fault tolerant design should be able to continue running when a part of the system fails.
G
- Gamification-it is very effective to use the thinking and Mechanism of games in other non-game fields. This method can be used to create and detect data in a very friendly way.
- Graph Databases-stores data using a graphic structure (such as a finite ordered pair or an entity), which includes edges, attributes, and nodes. It provides the free indexing function between adjacent nodes, that is, each element in the database is directly associated with other adjacent elements.
- Grid computing-connects many computers distributed in different locations to deal with a specific problem, usually by connecting computers through the cloud.
H
- Hadoop-an open-source basic framework for distributed systems, which can be used to develop distributed programs for big data computing and storage.
- Hadoop database (HBase)-an open-source, non-relational, and distributed database used together with the Hadoop framework
- HDFS-Hadoop Distributed File System (Hadoop Distributed File System) is a Distributed File System designed to run on common hardware (commodity hardware ).
- HPC (High-Performance-Computing)-use a supercomputer to solve extremely complex Computing problems
I
- Memory Database (IMDB: In-memory) is a database management system. Unlike common database management systems, it uses primary storage to store data rather than hard disks. It features high-speed data processing and access.
- Iot (Internet of Things)-installs sensors on common devices so that these devices can connect to the network anytime and anywhere.
J
- Legal data consistency (Juridical data compliance)-when you use a cloud computing solution that stores your data in different countries or continents, it will be related to this concept. Check whether the data stored in different countries complies with local laws.
K
- KeyValue database-data is stored with a specific key pointing to a specific data record. This method makes data search more convenient and convenient. The data stored in the key-value database is usually the basic data type in programming languages.
L
- Latency-Latency of system time
- Legacy system (Legacy system) is an old application, an old technology, or an old computing system that is no longer supported.
- Load balancing-allocates workload to multiple computers or servers for optimal results and Maximum System utilization.
- Location data-GPS information, that is, geographical Location information.
- Log file: A file automatically generated by the computer system to record the running process of the system.
M
- M2M data (Machine2Machine data)-communication and transmission between two or more machines
- Machine data-data generated by sensors or algorithms on machines
- Machine learning-a part of Artificial Intelligence refers to the ability of machines to perform self-learning from the tasks they complete and to achieve self-improvement through long-term accumulation.
- MapReduce is a software framework for processing large-scale data (Map: ing, Reduce: induction ).
- Massively Parallel Processing (MPP: Massively Parallel Processing)-simultaneously Processing the same computing task using multiple processors (or multiple computers.
- Metadata (Metadata)-refers to the data that describes the data attributes (what is the data.
- MongoDB-an open-source non-relational database (NoSQL database)
- Multi-Dimensional database (Multi-Dimensional Databases)-A database used to optimize Online Analytical Processing (OLAP) programs and data warehouses.
- MultiValue Databases is a non-relational database (NoSQL), a special multi-dimensional database that can process data in three dimensions. Mainly for very long strings, can perfectly process strings in HTML and XML.
N
- Natural Language Processing (Natural Language Processing) is a branch of computer science that studies how to implement interaction between computers and human languages.
- Network analysis: analyzes the relationship between nodes in the Network or graph theory, that is, the connection and strength relationship between nodes in the Network.
- NewSQL-an elegant and well-defined database system that is easier to learn and use than SQL, and a new database later than NoSQL
- NoSQL-as its name implies, it is a database that does not use SQL. This type of database refers to databases other than traditional relational databases. These databases have higher consistency and can process ultra-large scale and high concurrency data.
O
- Object Database-(also known as facial Object Database) stores data in the form of objects for object-oriented programming. Unlike relational databases and graph databases, most object databases provide a query language that allows access to objects using declarative programming.
- Object-based Image Analysis-digital Image Analysis analyzes the data of each pixel, while Object-based Image Analysis only analyzes the data of relevant pixels, these related pixels are called objects or image objects.
- Operational Databases-these Databases can perform routine operations by an organization and are very important for business operations. They generally use online transaction processing, allows users to access, collect, and retrieve company-specific information.
- Optimization analysis-the Optimization process that relies on algorithms during the product design cycle. In this process, companies can design a variety of products and test whether these products meet the preset values.
- Ontology indicates the Knowledge Ontology, which is a philosophical idea used to define the concept set in a domain and the relationship between concepts. (Translator's note: data has been raised to the philosophical level, given the meaning of the world ontology and becoming an independent objective data world)
- Outlier detection-an abnormal value is an object that seriously deviates from the total average value of a dataset or a data combination. This object is far from other objects in the dataset. Therefore, the appearance of abnormal values indicates that a system problem occurs and needs to be analyzed separately.
P
- Pattern Recognition-uses algorithms to identify patterns in data and make predictions for new data from the same data source
- P bytes (PB: Petabytes)-approximately 1000 TB (terabytes), approximately 1 million GB (gigabytes ). The number of particles produced by the European Nuclear Research Center (CERN) Large Hadron ordinator per second is about 1 PB.
- Platform as a Service (PaaS: Platform-as-a-Service)-a Service that provides all the necessary basic platforms for cloud computing solutions
- Predictive analysis-the most valuable analysis method in big data analysis, which helps predict the future (recent) behavior of an individual. For example, a person may buy certain commodities, it may access some websites, do some things or make some behavior. Identify Risks and opportunities by using different datasets, such as historical data, transaction data, social data, or customer's personal information data
- Privacy-separates data that can identify personal information from other data to ensure user Privacy.
- Public data-Public information or Public data set created by the Public fund.
Q
- Quantified Self-uses applications to track users' daily movements to better understand their behaviors
- Query-Query the answer to a question
R
- Re-identification-combines multiple datasets to identify personal information from anonymous data
- Regression analysis-determines the dependency between two variables. This method assumes that there is a one-way causal relationship between two variables)
- RFID-radio frequency identification, which uses a wireless non-contact Radio Frequency Electromagnetic Field Sensor to transmit data
- Real-time data: data created, processed, stored, analyzed, and displayed in milliseconds
- Recommendation engine-Recommendation engine algorithms recommend a product to users based on their previous Purchase behaviors or other purchase behaviors.
- Routing analysis-find an optimal route for a certain transportation method by using multiple variable analyses to reduce fuel costs and improve efficiency
S
- Semi-structured data (Semi-structured data)-Semi-structured data does not have a strict storage structure of structured data, but it can use tags or other forms of tags to ensure the hierarchical structure of data
- Sentiment Analysis-uses algorithms to analyze how people view certain topics
- Signal analysis: analyzes the performance of a product by measuring physical quantities that change at any time or space. In particular, sensor data is used.
- Similarity searches-query the most similar objects in a database. The data objects mentioned here can be any type of data.
- Simulation analysis-Simulation refers to the Simulation of processes or system operations in the real environment. Simulation analysis can consider a variety of different variables during simulation to ensure optimal product performance
- Smart grid: uses sensors in the energy network to monitor the running status in real time, which helps improve efficiency.
- Software as a Service (SaaS: Software-as-a-Service)-Web-based application Software used in browsers
- Spatial analysis-Spatial analysis analyzes Spatial data such as geographical information or topology information, and obtains the data pattern and pattern in geographical space.
- SQL-a programming language used to retrieve data in relational databases
- Structured data (Structured data)-data that can be organized into columns and can be identified. This type of data is usually a record, a file, or a field in the correctly labeled data, and can be precisely located.
T
- T bytes (TB: Terabytes)-approximately 1000 GB (gigabytes ). The 1 TB capacity can store up to 300 hours of HD videos.
- Time series analysis: analyzes well-defined data obtained during repeated measurement times. The data to be analyzed must be well defined and taken from continuous time points at the same interval.
- Topology Data Analysis focuses on three main points: Composite Data models, cluster identification, and Data statistics.
- Transaction data (Transactional data)-Dynamic data that changes over time
- Transparency-consumers want to know what role their data is and how it is processed, while organizations make the information transparent.
U
- Un-ureured data (Un-structured data)-unstructured data is generally considered as a large volume of plain text data, which may also contain dates, numbers, and instances.
V
- Value-all available data can create great Value for organizations, society, and consumers. This means that major enterprises and the entire industry will benefit from big data.
- Variability-that is, the meaning of data is always changing (FAST. For example, a word may have different meanings in the same tweets.
- Variety-data is always presented in different forms, such as structured data, semi-structured data, and unstructured data, even complex structured data
- High speed (Velocity)-In The Big Data era, Data creation, storage, analysis, and virtualization require high-speed processing.
- Veracity-organizations must ensure the authenticity of data to ensure the correctness of data analysis. Therefore, Veracity is the correctness of the index data.
- Visualization-raw data can be used only when the Visualization is correct. Here, "Visualization" is not a common graph or pie chart. Visualization refers to a complex chart, which contains a large amount of data information, but can be easily understood and read.
- Volume-(one of the characteristics of big data 4 V) refers to the data Volume, ranging from Megabytes to Brontobytes
W
- Weather data is an important open public data source. If combined with other data sources, it can provide relevant organizations with a basis for in-depth analysis.
X
- XML database (XML Databases)-XML database is a database that stores data in XML format. XML databases are usually associated with document-oriented databases. developers can query, export, and serialize XML database data in specified formats.
Y
- Y bytes (Yottabytes)-approximately equal to 1000 ZB (Zettabytes), approximately equal to the data capacity of 250 trillion DVDs. Today, the data volume of the entire digital universe is 1 YB, And will double every 18 years.
Z
- Z-byte (ZB: Zettabytes)-approximately 1000 EB (Exabytes), approximately 1 million TB. It is predicted that by 2016, information on the global network can reach about 1 ZB every day.
Appendix: storage capacity unit calculation table
1 Bit (Bit) = Binary Digit
8 Bits = 1 Byte (bytes)
1,000 Bytes = 1 Kilobyte
1,000 Kilobytes = 1 Megabyte
1,000 Megabytes = 1 Gigabyte
1,000 Gigabytes = 1 Terabyte
1,000 Terabytes = 1 Petabyte
1,000 Petabytes = 1 Exabyte
1,000 Exabytes = 1 Zettabyte
1,000 Zettabytes = 1 Yottabyte
1,000 Yottabytes = 1 Brontobyte
1,000 Brontobytes = 1 Geopbyte
Reference original: An Extensive Glossary Of Big Data Terminology
This article permanently updates the link address: