File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: The sample size design in ultrahigh dimensional regression models and extensive simulation validation
Title | The sample size design in ultrahigh dimensional regression models and extensive simulation validation |
---|---|
Authors | |
Issue Date | 2016 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Duan, Z. [段振辰]. (2016). The sample size design in ultrahigh dimensional regression models and extensive simulation validation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Most of classical regression modeling methods are based on correlation learning. In ultrahigh dimensional scenarios, the correlations between the response variable and all potential predictors are affected by a phenomenon called spurious correlation, which means the squared correlation between the response variable and an independent predictor could still be significantly large. This dissertation proposes solving the problem by introducing sample size design methods. We start from a theoretical explanation of the spurious correlation phenomenon, based on its true nature according to the independence structure of the sample correlation matrix obtained by Prof. K.W. Ng.
By applying the nature of spurious correlation, we show that it can be expressed with a specific distribution relating to the sample size and the number of independent predictors. Therefore, we create a critical sample size design method with which people can control the spurious correlations in any required level with a minimum required sample size. Considering the false predictors could still be correlated in real applications, we show that the critical sample size is still effective and brings a safer result.
Further, we generalize the sample size design method to protect or help the true predictors from the independent false predictors. Rather than controlling the spurious correlation under a constant level, we ensure every true predictor show higher squared correlation with the response variable than all the false predictors do. In this dissertation, an exact solution to critical sample size for the scenario with one true predictor is provided. And for more than one predictors cases, we provide a safer sample size and a detailed tuning guidance for users.
Lastly, a modified sampling algorithm, named as Dynamic Sampling Importance Resampling algorithm (D-SIR), is proposed. We modify the classical Sampling Importance Resampling algorithm by introducing a dynamic grouping mechanism. Our algorithm increases the sampling efficiency significantly, making the sampling time no longer linear with the sample size needed, while the quality of the sample is not necessarily compromised. With the algorithm, a required sample size can be achieved far more easily.
|
Degree | Doctor of Philosophy |
Subject | Regression analysis |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/238346 |
HKU Library Item ID | b5824357 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Duan, Zhenchen | - |
dc.contributor.author | 段振辰 | - |
dc.date.accessioned | 2017-02-10T07:29:33Z | - |
dc.date.available | 2017-02-10T07:29:33Z | - |
dc.date.issued | 2016 | - |
dc.identifier.citation | Duan, Z. [段振辰]. (2016). The sample size design in ultrahigh dimensional regression models and extensive simulation validation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/238346 | - |
dc.description.abstract | Most of classical regression modeling methods are based on correlation learning. In ultrahigh dimensional scenarios, the correlations between the response variable and all potential predictors are affected by a phenomenon called spurious correlation, which means the squared correlation between the response variable and an independent predictor could still be significantly large. This dissertation proposes solving the problem by introducing sample size design methods. We start from a theoretical explanation of the spurious correlation phenomenon, based on its true nature according to the independence structure of the sample correlation matrix obtained by Prof. K.W. Ng. By applying the nature of spurious correlation, we show that it can be expressed with a specific distribution relating to the sample size and the number of independent predictors. Therefore, we create a critical sample size design method with which people can control the spurious correlations in any required level with a minimum required sample size. Considering the false predictors could still be correlated in real applications, we show that the critical sample size is still effective and brings a safer result. Further, we generalize the sample size design method to protect or help the true predictors from the independent false predictors. Rather than controlling the spurious correlation under a constant level, we ensure every true predictor show higher squared correlation with the response variable than all the false predictors do. In this dissertation, an exact solution to critical sample size for the scenario with one true predictor is provided. And for more than one predictors cases, we provide a safer sample size and a detailed tuning guidance for users. Lastly, a modified sampling algorithm, named as Dynamic Sampling Importance Resampling algorithm (D-SIR), is proposed. We modify the classical Sampling Importance Resampling algorithm by introducing a dynamic grouping mechanism. Our algorithm increases the sampling efficiency significantly, making the sampling time no longer linear with the sample size needed, while the quality of the sample is not necessarily compromised. With the algorithm, a required sample size can be achieved far more easily. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Regression analysis | - |
dc.title | The sample size design in ultrahigh dimensional regression models and extensive simulation validation | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5824357 | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.mmsid | 991021210489703414 | - |