Ditch the generic dictionary: building smarter sentiment analysis with fca

FCA
Sentiment Analysis
Text Mining
NLP
Algorithm

Standard sentiment dictionaries often fail. We present a new method using Formal Concept Analysis (FCA) to automatically create custom dictionaries from your data, which achieves better performance.

Author

Manuel Ojeda-Hernández, Domingo López-Rodríguez, Ángel Mora

Published

3 February 2023

How good is “sentiment analysis”? If you ask a standard tool, it might say the tweet “This movie is sick!” is negative. It sees the word “sick” and flags it as bad.

This is the classic problem with sentiment analysis: context is everything. The one-size-fits-all dictionaries (or “lexicons”) that power most tools are often wrong because they don’t understand the specific slang, jargon, or context of a dataset.

In our 2023 paper in the International Journal of Approximate Reasoning, we propose a new method that throws away the generic dictionary and builds a new one from scratch, perfectly tailored to the data.


🧐 The problem: generic lexicons don’t work

Most sentiment analysis tools rely on a pre-defined lexicon—a massive list of words tagged as “positive” or “negative”. But this approach has huge flaws:

  • Context: As we saw, “sick” can be good or bad. “Bad” can mean “cool.”
  • Domain-specific language: A generic lexicon has no idea if “high volatility” is good or bad in a financial tweet, or what “good p-value” means in a scientific paper.
  • Maintenance: These lists are created manually and are expensive to maintain and update.

💡 Our solution: build a custom dictionary with fca

We asked: “What if, instead of using a generic list, we could automatically discover the groups of words that actually predict a positive or negative sentiment in our specific dataset?”

This is exactly what Formal Concept Analysis (FCA) is built for.

Our method uses FCA to analyze a set of pre-labeled tweets (e.g., 1000 positive, 1000 negative) and finds the “concepts”—the groups of words—that are common to the positive tweets and absent from the negative ones (and vice-versa).

These concepts become our new, custom-built sentiment dictionary.

🛠️ How it works: from data to custom lexicon

Our approach is straightforward:

  1. Input: We take a set of labeled tweets (e.g., from a known dataset).
  2. Context: We build a formal context where tweets are “objects” and the words they contain are “attributes.”
  3. FCA: We run our FCA algorithm to extract all the formal concepts.
  4. Lexicon: These concepts (groups of words) form our new, data-driven lexicon, perfectly tailored to the language, slang, and topics of our dataset.

A concept might be {'amazing', 'loved', 'best'}, which is a strong positive signal. But it might also find {'sick', 'unreal'}, which a generic dictionary would miss.

A conceptual diagram showing text being analyzed by FCA and sorted into positive and negative lexicons. *
A conceptual diagram of our method: using FCA to analyze text and automatically generate custom positive/negative dictionaries.

🚀 The results: it just works better

This wasn’t just a theory. We tested our method on a real-world dataset of tweets. We compared the classification accuracy of our FCA-generated lexicon against classifiers using standard, well-known sentiment dictionaries.

The results were clear: our FCA-based approach achieved better performance.

By building a dictionary from the data itself, it creates a tool that understands the data’s unique context, leading to more accurate sentiment classification.

🔬 Why does this matter?

This method provides a new pipeline for creating high-performance, custom-tailored sentiment analysis tools. Any company can now take its own customer reviews, its own support tickets, or its own social media data and build a sentiment classifier that understands its specific language and its customers, rather than relying on a generic tool that barely works.


📖 The full paper

For the complete methodology, the full experimental setup, and the performance comparison, you can read the original open-access article.

Lexicon-based sentiment analysis in texts using Formal Concept Analysis. Authors: Manuel Ojeda-Hernández, Domingo López-Rodríguez, Ángel Mora. Journal: International Journal of Approximate Reasoning (vol. 155, pp 104-112)

[DOI Link] | [Article Website]