**UNCW REU 2017**

**Quick Facts**

May - July 2017

UNC Wilmington

NSF-sponsored Research

Publication Pending

**Skills Gained**

R

Python

LaTeX

Machine Learning

Independent Research

Professional Presentation

Scientific Writing

Teamwork

MATLAB

I spent the summer of 2017 at the University of North Carolina Wilmington conducting face recognition research. The program was sponsored by the National Science Foundation, and I worked with seven other students. Over the course of the 10-week program I became versed in many machine learning algorithms commonly used for feature extraction and dimension reduction. In addition to giving weekly presentations and manipulating large datasets, I developed a new algorithm for face recognition, which achieved high accuracy on two separate datasets.

Developing RS-2DLDA was my main focus during the latter part of the summer. It has achieved high accuracy on two well-known datasets, ORL and MORPH-II; however, my work is not finished. I plan to continue improving RS-2DLDA throughout the 2017-18 academic year and possibly later in the future. Dr. Chen and Dr. Wang at UNCW are working with me to get the research paper on RS-2DLDA ready for publication. I gave a poster presentation of my results at the conclusion of the summer, and I will be presenting at the CURS conference this fall. I plan to present at additional conferences in the future.

MORPH-II is a dataset of 55,134 mugshots from 13,617 individuals collected over the span of five years. Each image is accompanied by information about the person, such as their race, gender, and birthdate. Repeat offenders have multiple images in the dataset. Unfortunately for some of these individuals, their information is not always consistent. In total we found one person with inconsistent gender, 33 with inconsistent race, and 1,779 with inconsistent birthdates. We couldn't do research on dirty data, so our first task of the summer was to come up with a systematic way to clean the dataset. Now that the inconsistencies in MORPH-II have been fixed, the hope is that researchers who use MORPH-II in the future will achieve more accurate results.

Along with the other REU participants, I had the opportunity to help develop a statistics workshop for a group of 24 eighth graders. We showed the Junior Seahawk students our research, split into groups, and came up with questions that we wanted to answer. My group decided to poll the class to see which colleges the eight graders were planning to attend. We collected our data and together made a pie chart using R. The students concluded that UNC was a popular choice because it is a well-known school.

The first three weeks of the program were dedicated to an intensive introduction to machine learning. We focused on research for the next six, and the tenth and final week was reserved for giving final presentations. Regardless of the week's focus, we gave presentations to the group every Friday. Below, I give a summary of each week's activities.

During the first week, we began our introduction to machine learning. We reviewed regression, classification, and cross-validation and learned how to implement these techniques in R and MATLAB. Much of the machine learning research done on image datasets involves either face recognition or the prediction of gender, race, or age. By the end of the week we had implemented a simple linear regression model to predict age on the 1002 images of the FG-NET dataset. My group presented our results on Friday, and I gave an additional tutorial on how to use R Markdown to the other students.

We continued our machine learning crash course. We learned about several machine learning algorithms, such as logistic regression,
linear and quadratic discriminant analysis (LDA and QDA), *k*-nearest neighbors (KNN), bagging, random forests,
and boosting. In addition, we began our study of feature extraction. Feature extraction is a way of "extracting"
important pieces of information (called features) from an image. If descriptive features are selected, then
a machine learning algorithm can use them to do face recognition or predict age, gender, or race. The first
feature extraction technique we discussed was local binary patterns (LBP), which we implemented from scratch
in Python.

In addition to the FG-NET dataset, we began using MORPH-II, a dataset with 55,134 images from 13,617 people. It's difficult for computers to make any sort of prediction with so many high-resoution images, so we began investigating dimension reduction techniques such as principal component analysis (PCA). We continued overview of feature extraction techniques, this week working with the biologically inspired features (BIF) popular among other researchers in the field. We also overviewed gabor filters, a technique used for edge detection in images, and support vector machines (SVM), a popular machine learning algorithm. I worked with one other student and investigated how the use of different kernel functions with SVM affected accuracy with gender and age prediction. Kernel functions map all data points to a higher-dimensional space where the data is hopefully more separable. Different kernel functions are better suited to different machine learning problems.

At the conclusion of the three-week introduction to machine learning, the other students and I were ready to begin our research. But before we began, we had to ensure that all of the information in the MORPH-II dataset was accurate. The dataset is composed of mugshots, and each image is accompanied by information about the subject like age, race, gender, birthdate, etc. Repeat offenders have multiple entries in the dataset, and upon further inspection I discovered that for some people, their information was not consistent from image to image. In total I found one person with inconsistent gender, 33 with inconsistent race, and 1,779 with inconsistent birthdates. It appeared that other researchers who had used MORPH-II did not notice these discrepancies. For the duration of this week, I worked with some of the other students to develop a systematic way to clean the data, which we presented to the rest of the group at the end of the week. In addition, we wrote a whitepaper detailing our methodology so that future researchers wishing to use MORPH-II can benefit from the work we did to clean the data. I began creating more professional-looking presentations with the Beamer LaTeX package. (The Week 4 Presentation is identical to the MORPH-II Presentation on the left.)

After cleaning the MORPH-II dataset, I began focusing on my individual research project. I read literature on two-dimensional PCA (2DPCA), a generalization of PCA. The 2D version of PCA intrigued me because in traditional PCA, 2D image matrices have to be first converted to 1D vectors. 2DPCA instead leaves the images in their original matrix form, and often benefits from higher accuracy and reduced computation time over traditional PCA. I implemented 2DPCA in Python and conducted experiments on the ORL dataset, replicating the results I found in the 2DPCA research paper. I presented an overview of 2DPCA to the group, as well as the questions I had and future work that I planned to conduct.

Continuing my 2DPCA research, I read all the literature I could find on the topic in order to get a good understanding of what other researchers had accomplished so far. I found two generalizations of 2DPCA and included these in my weekly presentation. I found that Bilateral 2DPCA was highly effective for face recognition because it took advantage of the structural information in the face images. Kernel 2DPCA on the other hand, was far too computationally expensive to be useful for our purposes.

Two-dimensional LDA (2DLDA) is a 2D generalization of LDA. 2DLDA is usually more accurate in face recognition problems than 2DPCA because it is a supervised method. This means that it takes into account which images belong to which person when it is learning from the training data. 2DPCA on the other hand, does not. My work this week focused on a thorough literature review of 2DLDA and related topics. I implemented 2DLDA in Python and conducted experiments on the ORL dataset. In addition to replicating the results obtained in the papers I read, I tried a more difficult face recognition task. Instead of simply matching an image with one of the people in the dataset, I created a problem in which the algorithm had to either match an image with its owner, or identify that image as an unknown person. I found that although 2DLDA is generally more accurate in face recognition in which everyone is known (called closed set face recognition), when there was a possibility of an image not belonging to anyone in the dataset, 2DPCA seemed better equipped to identify these images.

2DLDA is generally an effective method for face recognition, but unfortunately its performance is highly dependent on *d*, the number of eigenvectors
kept. If *d* is chosen too low, 2DLDA performs poorly because it doesn't have enough information to make accurate predictions. On the other hand, if *d*
is too high then 2DLDA will overfit to the training data and not generalize well to new images. To remedy this, I developed RS-2DLDA. In RS-2DLDA many classifiers are
created each from a random sample of eigenvectors. This means that the risk of overfitting is avoided, and no information is lost from discarded eigenvectors. In addition
I further increased the performance of RS-2DLDA by using multiple distance metrics and incorporating a weighting scheme. I presented my algorithm to the group at the end of the week
(though I had not yet decided to call it RS-2DLDA).

I conducted more experiments using RS-2DLDA on the MORPH-II and ORL image datasets. For the remainder of the week I focused on preparing my poster presentation, which I presented to UNCW faculty and students.

My final presentation for the program focused on RS-2DLDA. This presentation was designed for a general audience, so it is less mathematical and more illustrative. (The Week 10 presentation is identical to the RS-2DLDA Presentation on the left.)