"Data Cleansing" 2013-Data quality and data cleansing methods

Source: Internet
Author: User

  • Classification of data quality problems
  • This paper mainly discusses the data quality problem of instance layer .

  • Data quality assessment (12 dimensions)
  • 1) Data specification: Data standards, data models, business rules, metadata and reference data for the existence, completeness, quality and archiving of measurement standards;
    2) Data integrity Guidelines (Integrity Fundamentals): Measurement of data for existence, validity, structure, content, and other basic data characteristics;
    3) Repeat (duplication): Measurement criteria that are accidentally duplicated in a particular field, record, or data set that exists within or between systems;
    4) accuracy (accuracy): The standard for measuring the correctness of data content;
    5) Consistency and synchronization (consistency and synchronization): Measurement of the equivalence of information stored or used in a variety of data warehouses, applications, and systems, as well as the measurement standards for data equivalence processing processes;
    6) Timeliness and availability (timeliness and availability): Measurement of the timeliness and availability of data for a particular application during the expected time period;
    7) Usability and maintainability (Ease of Use and maintainability): the degree to which data can be accessed and used, and the level of measurement that data can be updated, maintained, and managed;
    8) Data coverage: the availability and comprehensiveness of measurement standards relative to data overall or to all related objects;
    9) Expression quality (Presentation quality), how to make effective information expression and how to collect information from the user measurement standards;
    10) understandable, relevant and credible (Perception,relevance and trust): the level of understanding of data quality and the measure of execution in data quality, as well as the measurement of the importance, usefulness and relevance of the data required for business;
    11) Data Decay (decay): The standard for measuring the negative rate of data change;
    12) utility (transactability): Data produces a measurement standard that expects a business transaction or the degree of results.
    In the process of evaluating the quality of the project data, we need to select several appropriate data quality dimensions, then make the evaluation plan for each selected dimension, select the appropriate assessment method, and finally combine and analyze all the quality evaluation results.

  • Cleaning method
  • 1) Missing data processing

    2) Similar duplicate object detection

    3) Abnormal data processing

    4) Logic Error detection

    5) Inconsistent data

    "Data Cleansing" 2013-Data quality and data cleansing methods

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.