Bigrams, Trigrams and Porter Stemming

From mtab wikisupport
Revision as of 15:26, 29 July 2013 by Mtabadmin (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

With each new release of mTAB, we extend our commitment to enhancing the combined analysis of unstructured (i.e. verbatim) and structured (i.e. closed-ended) survey questions. This powerful ability enables users to extract the additional information trapped within a survey project’s verbatim responses, using mTAB’s built-in quantitative analysis tools and techniques.


The Basics

For some time now, mTAB has supported the ability to cross tabulate verbatim questions and closed end questions and to use closed end questions as filter criteria to bring focus to the analysis of verbatim questions. For example, within mTAB you can view the verbatim survey question “About my experience...” broken out by a “brand” (closed-end question) column banner and filtered on respondents that indicated bottom three box ratings to the question “Would you recommend to a friend or relative?”.


In recent versions we have extended mTAB’s ability to conveniently mix verbatim and closed end questions in this manner by introducing term analysis of verbatim questions. Within mTAB, you can conveniently search verbatim responses for your own apriori key terms (e.g. “dirty”, “confused”,“disappointed”), or you can utilize mTAB’s tag cloud analysis tool to identify the most frequently occurring terms within the selected set of verbatim responses. mTAB’s tag cloud dialog can additionally display quantitative statistics about each term, for example the number of term occurrences, the percentage of verbatims containing the term, and the percentage of respondents mentioning the term.


A survey analyst can use the term analytics to compare term frequencies between columns of a verbatim cross tab. Expanding on our previous example, we could learn that 20% of Brand B’s respondents reported dissatisfied sales transaction related experience issues, two times more than any other brand. You need only select the individual spreadsheet column (i.e. brand) you wish to analyze and then generate the tag cloud specific to that column (in this case, a brand).


Further Enhancements

Building upon the basics, mTAB now includes bigram and trigram term frequencies within the tag cloud analysis, extending the power of mTAB’s term analysis. Bi- and trigrams are combined terms like “gas mileage” and “room for improvement” that would previously have been reported as individual terms with mTAB’s tag cloud. If the combined terms occur in sufficient frequency to appear alongside the most frequently individually occurring tags, the combined terms will appear within the tag cloud dialog concatenated together with an underscore character (example: “gas_mileage”).

Verbatims bigrams-trigrams-porter-stemming tag-cloud.jpg

Behind the scenes, mTAB’s tag cloud is now incorporating a standard English language “common misspellings” dictionary used by the text analytics community. While misspellings typically do not occur frequently enough to be appear within mTAB’s “top 50” tag cloud display, correcting misspellings increases the frequency of the correctly spelled terms, thereby improving the accuracy of the quantitative analysis of terms.


In addition, a new Apply Porter Stemming checkbox allows similar terms to be reduced to a root or “normalized” form using the Porter Stemming algorithm commonly used within text analytics. In a nutshell, the Porter algorithm reduces terms that differ by attributes such as tense or plurality to a common root form. For example, considered, considering, considers, considerer, etc. would be reduced to the root term form “consider”.


Reducing terms to a root form brings additional focus to the quantitative analysis of terms by providing a more realistic percentage of the occurrences of terms with similar intent.


Within the mTAB tag cloud dialog, you can simply check the Apply Porter Stemming checkbox to reduce the existing terms within the tag cloud to the stemmed set, and then observe the verbatim and respondent percentage frequencies of the stemmed terms.

Verbatims bigrams-trigrams-porter-stemming apply-porter-stemming.jpg