Survey Text Analysis
My work with NLP reaches back just far enough that I initially manipulated text in a programmatic manner using the stack of POS tagging, dependency parsing, and other structured tagging systems followed by classic vector embedding and clustering.
The first notable project in this line was the construction of an exploratory analysis of large, unstructured survey responses. In order to bring meaning to the qualitative findings of a large university-wide survey, I transformed a database of open-ended text responses towards the construction of a clustering analysis, both with pre-computed clusters and with a novel searching mechanism that permitted topical exploration by administrators. The following steps made up the overall processing:
- privacy filtering and redaction, using a search for named-entities along with a few heuristics and sensitive phrases / topics;
- sentence embedding - this was before the transformers revolution in NLP, so I used a word-vector based approach after learning vector representations tuned on the dataset. The specific algorithm to compute sentence-level vectors without too much noise was partly based on https://openreview.net/pdf?id=SyK00v5xx, using a TF-IDF weighting followed by removing the principal component;
- k-means clustering for broad topic areas;
- a custom front-end for searching and visualizing relationships between topics, sentiment, and survey audiences; this front-end leveraged nearest-neighbor searches alongside structured tagging of content, which permitted the user to immediately answer questions like “How did our non-faculty staff in administrative support roles speak of a certain topic, how did their sentiment on it differ notably from other groups,a and what are some characterstic text excepts that illustrate this?”
For later work in this conceptual space, see machine learning.