Survey Text Analysis
In order to present the qualitative findings of a large university-wide survey, I took a large database of comments and transformed them for a clustering analysis. The following steps made up the overall processing:
- privacy filtering and redaction, using a search for named-entities along with a few heuristics and searched phrases / topics;
- sentence embedding - this was before the transformers revolution in NLP, so I used a word-vector based approach after learning vector representations tuned on the dataset. The specific algorithm to compute sentence-level vectors without too much noise was partly based on https://openreview.net/pdf?id=SyK00v5xx, using a TF-IDF weighting followed by removing the principal component;
- k-means clustering for broad topic areas;
- a custom front-end for searching and visualizing relationships between topics, sentiment, and survey audiences
For later work in this conceptual space, see machine learning.