At Quantifind we analyze a multitude of social data sources (Twitter, blog posts, reviews, etc) to understand the core consumers for a brand, identify the main drivers for their business KPIs, and provide actionable insights and opportunities to improve performance.
From a marketer’s point of view, one starting point is a clear understanding of the demographic breakdown of the consumer base and how that stacks up against those of competitors. In our product, we track a variety of demographic slices including gender, age, ethnicity, location, and interest groups.
Ideally users would supply this information in their social media profiles, allowing us to filter and aggregate by those slices in our product. Unfortunately this is rare; the proportion of users who self-report their demographic information is < 0.3%. When's the last time you filled out all the details in a sign-up page before posting a comment, review, or reply?
Thus we're left with a data science problem: how can we take a user's profile and comments and make a prediction about the demographic groups he/she falls into?
The more accurately we can predict the demographics of a user, the more accurately we can glean insights from the collective voice of the consumer. In this post we will mainly focus on the gender classification problem for bloggers using textual information.
Our training data consists of approximately 400,000 users collected across hundreds of forums ranging from automobile to baby care to soccer enthusiasts. Below is an example of what a record might look like.
All of the users in our training set self reported their gender when they signed up to be a member of the forum. Collecting this data took a little ingenuity, since the query language we were using didn't allow us to filter down to those users who self reported a gender. To circumvent this, we first obtained a large sample of data by querying for common words like "is," "and," and "so.” The goal was to approximate a random sample from the world of forum data. Once we had our sample, we identified those domains where users were more likely to self report their gender. Then we made targeted queries to those domains to enlarge our training set. As a result, our training data was concentrated in a few domains where many users reported their information and fell off exponentially as we looked at smaller domains.
The first step in any text classification problem is the process of tokenization, which converts messy raw text into a sequence of tokens (words). The tokens can then be easily stored in a much more compact vector or matrix format suitable for statistical analyses. We considered the top 20,000 tokens (unigrams, bi-grams and tri-grams) from our training corpus after a fairly generic step of stemming and stop-word removal. After these pre-processing steps we get back a data set in nice tabular format. In the sparse vector format indices denote the index of the tokens in the 20,000 token dictionary and val represent the corresponding word count for each user.
Training data was slightly skewed towards males (56% male vs 44% female), and an average user had 3 posts with 60 tokens. The term-frequency matrix is 99.2% sparse.
We started with a simple linear classifier with the term-frequency matrix as the design matrix. Clearly with such a high-dimensional problem with significant sparsity in the design matrix we must impose some form of regularization to avoid overfitting and select meaningful features. We used the popular glmnet library in R that fits a generalized linear model via penalized maximum likelihood methods. For a binary classification problem we try to minimize
w.r.t. (β0, β) where ℓ( . ) is the logistic likelihood function, λ is the Lasso parameter and 0 ≤ α ≤ 1 the elastic-net parameter that controls the trade-off between the lasso and ridge penalties. The elastic net penalty is known for encouraging grouped variable selection for features that have strong sample correlation. The glmnet package offers easy to use functions to optimize λ via cross-validation with a metric of your choice (classification error, likelihood, auc etc.). We tried a pure Lasso penalty approach, that is α = 1, preferring a parsimonious model.
The cross-validation plot shows a pretty healthy "U" shaped pattern with an optimal model consisting of 4000 significant features, affirming our assumption that many of the features are not very useful. Classification accuracy was reasonably good. As a diagnostic check we then looked at the top features selected by the algorithm…
… most of which appear to be fairly intuitive. However, we do see the pattern of many terms from a specific domain crowding the top of the list. For male features we see an over-abundance of technology-related terms (xps, bios, router, ssd...) whereas pregnancy and baby-related terms completely dominate the top of the female term list. This bias is related to the domains in our training set. Domains which are heavily skewed towards either men or women are often specialty forums with very particular language. This might result in poor coverage / test-performance in a new domain as our top features are all concentrated in specific topics. We also tested a few other classification algorithms that are readily available in R such as randomForest, gradient boosting trees and svm. RandomForest and boosting ended up with similar classification error rate while SVM performed a bit worse.
The fact that most of the popular machine learning algorithms performed similarly given the raw tokens as features motivated us to explore the featurization piece in greater detail to achieve better classification performance. We tried a SVD approach to select top k = 100 singular components as features. In terms of predictive performance it was not significantly better than the baseline raw token model. Additionally, the singular components did not appear to have a lot of semantic coherence / interpretability. Since SVD didn't do a great job at clustering our features, we moved to a more hands on approach which could be easily interpreted.
Linguistic research in the field of demographic feature attribution from user posts in social media have often focused on groups of function words as features rather than individual tokens. Word groups help reduce the dimensionality / sparsity and also allow us to build models that have greater coverage outside of the training corpus. The main groups that we considered are:
In total we ended up with 50 semantic features in addition to the individual tokens.
In many popular social networking / media sites such as facebook, twitter, instagram most users set up their account with their first and last name. As a result demographic annotation becomes a much easier task. With blog posts crawled across hundreds of domains we do not have the luxury of such clean user information. Instead we only get an username. Usernames are an interesting data-point, as people often get creative with it, reflecting their personality and interests. The spectrum is extremely wide with cases such as "name":"dave_mitchell" which is just as useful as having a first and last name vs complete lack of information in "name":"idonotknowhoiam".
We created a list of popular male and female names obtained from census.gov, along with common male and female relation terms such as "bro", "guy", "girl", and "mom". The feature extraction algorithm searched for the longest sub-string in a username that is contained in either the male or female list. Doing this correctly is a bit tricky, since the terms we want to identify can themselves occur as sub-strings of other words. For example, the male name "Ethan" can occur as a sub-string of the female name "Bethany", and the male name "Ben" could be incorrectly detected in a user name like "bentoutofshape". To combat this, we used an auxiliary dictionary of English terms as a reference to prevent false positives. We stored all the words (our male and female groups along with the dictionary) in a prefix tree, which can be used to efficiently perform the sub-string matching. The prefix tree approach resulted in a 66% reduction in featurization time over the naive sub-string search method which was also less accurate. After all this, for each username we extract two binary features includesMaleNames and includesFemaleNames. Here are some example tags from our algorithm.
Username features had 30% coverage (percentage of usernames which contained either a male or female name) with an accuracy of about 75%.
We re-fit the same penalized logistic regression model with the updated feature list and immediately noticed a 10% boost in prediction accuracy in the random hold out test set. Here are the top features selected by this new model
We were pleasantly surprised to see that many of the derived features from the feature engineering step now occupy the top spots in the updated list. Both the username tags as well as the possessives come out as extremely influential. Emoticons make an appearance in the female list. Many of the theme topics show up in the male list such as {techAvg, carsAvg, sportsAvg, politicsAvg, articlesAvg, profanitiesAvg}, some of which validate existing research, such as the use of articles for males. For the female list we see babyPregnantAvg theme appear prominently. Other top themes included {familyAvg, conjunctionsAvg, negativeEmotionsAvg,...}. Interestingly, none of summary statistics made it to the top.
We also looked at the classification performance across domains, stratified by the true male/female proportion for the domain.
Each point in this plot represent a specific domain. On the X-axis we have the true proportion of male authors in that domain while on the Y-axis we plot the classification accuracy of the updated model in a test data set. As domains become very one-sided such as cars, soccer on the top right of the plot and pregnancy, nursing on the top left the problem becomes easier and the algorithm tends to do a better job. In the middle portion we have the most diverse domains which are inherently harder to predict.
The training data in the gender annotation problem was quite balanced. We are not so lucky in other annotation problems such as age classification (Below 21, 21 to 34, 34 to 55, 55 and above). This is the inherent selection bias problem in social media where age groups above 35 are relatively rare (less than 10% of the training data). In such cases naive application of popular classification algorithms often result in highly skewed predicted probabilities that is heavily influenced by the training distribution. It is hard to find one solution that can be applied uniformly across all imbalanced classification problems. We briefly touch a couple of approaches where we have had some success.
One strategy for addressing class imbalance is to up-sample the smaller classes so that the problem artificially becomes more balanced. When using logistic regression, this is essentially the same as weighting the objective function differently for different classes. Getting more creative, there are approaches like SMOTE (Synthetic Minority Oversampling Technique) where you perturb the data as you up-sample, with the hope of increasing the robustness of the model. In their formulation, a data point is perturbed by doing a convex combination of its features with the features of its nearest neighbors. For a regression problem, the algorithm would also do a convex combination of the responses to create the response for the perturbed point. However, computing nearest neighbors in such a large, sparse data set can be difficult. We opted instead to perturb points by randomly dropping one of the count or binary features, setting it to 0. One risk with any of these techniques is that the noise in the features of the up-sampled classes will also be increased. Despite that problem, we did see a small increase in precision due to our SMOTE-like pre-processing. We also found that for multi-class problems, like predicting a user's age bucket, our up-sampling procedure reduced the number of drastic misclassification errors where users who were over 50 might be classified as under 21.
In some real world problems it is not too unrealistic to expect a reasonable class distribution for test data given a stratifying variable. Consider the example of predicting age-groups in blogs where the training data is severely skewed young but our test set is a Mercedes-Benz auto forum where the users are significantly older. Let's also assume that we are able to get an overall age distribution for the auto vertical that shows 50% of bloggers in the vertical are aged over 35. How can we use this information effectively to re-calibrate the predictions from a classifier? For some algorithms such as Naive Bayes Classifier incorporating priors comes naturally but not so for logistic regression which do not have such a clear Bayesian interpretation. Articles [3] and [5] outline this as a problem of bias correcting the intercept in a logistic regression model and could be interpreted as a posterior probability update under the assumption of within class densities remaining identical between training and test data. We tested the method outlined in [5] in the gender classification problem where extremely skewed domains were held out as test set. The overall the precision / recall metrics remained relatively similar but the marginal distribution was markedly better with this calibration approach.
If you want to learn more about selection bias in general please take a took at the in-depth post by Arnau on Selection Bias: Origins and Mitigation.
We used a combination of Scala and Spark to create our training data set and perform featurization across millions of records. For modeling purposes we mainly used R. The flexibility offered by the glmnet package combined with its efficiency is truly amazing. On an average it took less than 10 minutes to fit a Lasso-Logistic regression model with 400,000 users and 20,000 features. When you consider the wide variety of complimentary metrics and plots that glmnet returns for free, that's pretty fantastic for an offline learning algorithm. There are quite a few tuning parameters in the glmnet algorithm and it usually does a great job of selecting default values. To increase the coverage of our selected feature set we eventually chose a relatively high elastic net parameter close to 1 to include a small ridge penalty.
The key takeaway from this project was the importance of feature engineering for a predictive algorithm. Often in statistics and machine learning we focus on building algorithms that exploit specific structures in the data such as sparsity or non-linearity and expect the model to do most of the heavy-lifting with raw features. Featurization allows us to incorporate human intuition in a data-driven fashion and make the final model more interpretable. We did not explore sophisticated NLP methods such as LDA for feature generation yet with simple word group features the improvement in performance was quite significant.
--