Common problems and solutions in data analysis

rakib · 發表於 12:38:46

Missing data processing:
In the process of data analysis, we often encounter the problem of missing data. This may be due to errors in the data collection process, loss during data transmission, or other unforeseen reasons. The existence of missing data may affect the data. Completeness, reducing the reliability of analysis results. Therefore, how to handle missing data is a very important problem.

Types of missing data
Complete Game Loss (MCAR): The cause of the deficiency has nothing to do with any observational variable.
Random error (MAR): The cause of the error is related to the observed variable, but has nothing to do with the unobserved variable.
MNAR: The cause of the lack of value is related to the variable that is not observed.
missing data processing method
Delete:
List of deletions: Directly deletes those containing missing values. It is applicable to situations where the proportion of missing values is relatively small and random.
Pairwise mobile method: When conducting matching analysis, only delete Whatsapp Number matching variables are missing.
Substitution:
Mean/Median/Mode add: Use the mean, median or mode of the variable to source the values. Simple is easy to implement, but it can increase the variance of the variable.
Regression input: Utilize is related to the error variables related to the security image properties, and then uses the movie value to derive the error value.
Multiple imputation: generate multiple complete data sets, fill the missing values in each data set with different values, then analyze each data set, and finally combine the results.
Based on the model: use the statistics model (such as EM algorithm) to estimate the gap value.
do not handle:
Direct analysis: some statistical methods can be used to handle the missing values, such as parameter verification.
The first missing value as a class: The first missing value as a new category for analysis.
choose the appropriate processing method
The proportion of the missing value: If the proportion of the missing value is relatively small, it can be considered to delete; if the proportion is larger, it needs to be considered for the addition method.
The distribution of missing value: If the missing value is randomly distributed, you can consider mean/median/mode interpolation; if the missing value is related to other variables, then consider regression interpolation or multiple interpolation.

The type of data: For continuous variables, you can consider cut/median plot or regression interpolation; for categorical variables, you can consider categorical or multiplex interpolation.
The purpose of the analysis: the different purpose of the analysis is different for the processing requirements of the missing data.
Take care
The mechanism of default value: understand the causes of default value, it helps to choose the appropriate handling method.
Data distribution: An error may change the data distribution, so it needs to be handled carefully.
Sensitivity analysis: Compare different processing methods and choose the most suitable one.
tools and software
R: provides a variety of functions and packages for processing missing data, such as mice, missForest etc.
Python: including Scikit-learn, Pandas, etc., can handle missing data processing.
SPSS: provides a variety of missing data processing methods.
SAS: provides rich missing data processing functions.
summary

Lack of data processing is an important link in data analysis, choosing the appropriate processing method can improve the accuracy and reliability of the analysis results. can get satisfactory results.

		自動登錄	找回密碼
密碼			立即註冊