Abstract:
Recent advances in sensor technologies allow engineers and scientists to measure many processes in fine detail. The huge amount of information collected in this way causes engineers and scientists to rely heavily on computers to process and analyze data. The function of machine learning algorithms is to extract knowledge from experimental data and to use computers for complex decision making. Thus, decision rules are extracted automatically from data by using the speed and the robustness of the computing machines. However, the amount of information available can easily overthrow the processing power of the most powerful computers. As a result, one of the mainstream research fields in learning from empirical data is to design algorithms that can be applied to large-scale problems efficiently. Support Vector Machines (SVMs) and graph-based semi-supervised learning techniques are the latest development in the field of machine learning for solving real-world machine learning problems. This thesis presents the latest results on applying SV:Ms and graph-based semi-supervised learning techniques to largescale problems. The first original contributions of the thesis is the development of Iterative Single Data Algorithm (ISDA) for solving classification and regression tasks with the explicit bias term b that is used for training large-scale SVMs. A software implementation of ISDA developed in the thesis has shown significantly better performance than the standard SVMs software LIBSVM which implements the sequential minimal optimization (SMO) approach in solving the quadratic programming based SVMs training. The second original contribution of the thesis is introduced within the application of SVMs to cancer diagnosis. An improved Recursive Feature Elimination with Support Vector Machines (RFE-SVMs) algorithm is developed by exploring the effect of the penalty parameter C. The algorithm has significantly better performance than other ones used, notably the nearest shrunken centroid method developed by Tibshirani et al. (2002) for the diagnosis of colon cancer. Finally, there are two more important contributions of the thesis to the area of semi-supervised learning. They are the introduction of the normalization step and the development of the world's very first large-scale graph-based semi-supervised learning software SemiL. The normalization step improves the performance of the graph-based semi-supervised learning techniques very significantly, whereas the SemiL software demonstrates its cost-effectiveness in applying semi-supervised learning techniques to real-world problems.