Linear regression is a data analysis technique that uses linear functions to predict data. Although the linear regression model is relatively simple, it is a mature statistical technique.
When a model has poor performance, it cannot predict the data accurately. The main cause may be overfitting or underfitting. If it is a case of overfitting, we can use regularization to solve model overfitting.
Overfitting and underfitting are the root causes for poor model accuracy. Only by being able to determine whether a model is overfitting or underfitting can we take the correct approach to improve the performance of the model.
The confusion matrix is a tool used to measure the performances of models. This allows data scientists to analyze and optimize models. Therefore, when learning machine learning, we must learn to use confusion matrix. In addition, this article will also introduce accuracy, recall, precision, and F1 score.
Spearman Correlation Coefficient is a nonparametric method. It calculates the ranks by sorting the two variables, and then calculates the difference between the ranks to measure the correlation between the two variables.
Mann-Whitney U test is a nonparametric test. It combines the two samples, sort them, and assign them ranks according to the orders to test if the distribution of two population are equal.
Chi-Square Test tests whether the variables in a crosstab are independent based on the observed frequencies. SciPy’s chi2_contingency() can help us quickly calculate chi-square statistic and p-value.