Data cleansing is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database. Used mainly in database, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant data, Parts of the data and then replacing, modifying or deleting this dirty data. After cleansing, a data or data set will be consistent with other similar data sets in the system. The inconsistencies are detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage.
Data cleansing differs from data validation, validation means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict such as rejecting any address that does not have a valid postal code or fuzzy such as correcting records that partially match existing, known records.
High quality of data needs to pass a set of quality criteria. Those include:
Accuracy: An aggregated value over the criteria of integrity, consistency and density
Integrity: An aggregated value over the criteria of completeness and validity
Completeness: Achieved by correcting data containing anomalies
Validity: Approximately the amount of data satisfying integrity constraints
Consistency: Concerns contradiction and syntactical anomalies
Uniformity: Directly related to irregularity and in compliance with the set ‘Unit of Measure’
Density: The quotient of missing values in the data and the number of total values ought to be known
Uniqueness: Find the related number of duplicates in the data
Integrity: An aggregated value over the criteria of completeness and validity
Completeness: Achieved by correcting data containing anomalies
Validity: Approximately the amount of data satisfying integrity constraints
Consistency: Concerns contradiction and syntactical anomalies
Uniformity: Directly related to irregularity and in compliance with the set ‘Unit of Measure’
Density: The quotient of missing values in the data and the number of total values ought to be known
Uniqueness: Find the related number of duplicates in the data
Data Cleansing Process
Data Auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventually gives an indication for the characteristics of the anomalies and their locations.
Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end of the product of high quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered. If for instance we find that an anomaly is typing errors in data input stages, the layout of keyboard can help in manifesting possible solutions.
Workflow Execution: In this stage, the workflow is executed after its specification is completed and the correctness of data is verified. The implementation of the workflow should be efficient even on large sets of data which necessarily poses a trade-off, because the execution of the data cleansing operation can be computationally expensive.
Post Processing and Controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow are manually corrected if possible. The result is the new cycle in the data cleansing process where the data is audited again to allow the specification of an additional workflow to further cleansing the data by automatic processing.
Methods of Data Cleansing:
Parsing: In data cleansing Parsing is detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification or not. In this similar way a parser works with grammars and languages.
Data Transformation: Data Transformation allows the mapping of the data from their given format into expected format by the appropriate application. This includes value conversions or translation functions as well as normalizing numeric values to conform to minimum and maximum values.
Duplicate Elimination: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity or not. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification.
Statistical Methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult, the true value is not known, it can be resolved by setting the values of an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more possible values that are usually obtained by extensive data augmentation algorithms.
No comments:
Post a Comment