Text Pre-Processing

Regular Expression

r

Description:Issue:Different text has different expressionNeed to form a general pattern where can be use to remove unnecessary expression in every text Important:To extract the main words that needed for analysis

Remove Punctuation

Removing leading space
and trailing whitespace

Remove symbols/digits/
unnecessary words

Translate & Detect Language

r

Description:Translate to universal language which is EnglishIssue:Package/Library that support multiple languageOpen source and free to use without limitation (TBC)Not able to detect & translate the words accurately if the text is written informalImportant:Avoid redundant of wordHelp in analyzing the context

Package/Library

translate

r

Issue:Limited of words per day can be used

deep_translator

Tokenize

r

Description:Transform the text into single chunk/token of wordIssue:tokenize can mislead the meaning of the contextneed to consider of the n-gram of the token used (max 3)

n-gram

unigram

bigram

trigram

Part of Speech Tagging

r

Description:assign parts of speech to each word of a given text (nouns, verbs, adjectives, and others) based on its definition and its contextIssue:The words are based on the pre-trained model of the libraryNeed to update some of the words in our text

Name Entity Recognition
(NER)

r

Description:aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc)Issue:some of words in the text we have might not be able to detectneed to update the model of the library package used

Stemming/Lemmatize

r

Description:Transform the text into the base form/root wordIssue:Highly depend on the pre-trained model that available in the librarysome words not availableImportant:Avoid words redundancy (WordCloud, BOW, Tfidf)

Remove Stopword

r

Description:Remove the common words that in the text such as is are, be, the, and, etc.Issue:Might mislead the actual contextThe words depends on the pre-trained model of the library

Sentiment Score

r

Description:calculates the sentiment score of the text based on the individual words in a sentence are categorizedIssue:current package can be used to English languagemight not be able to determine the score accurately on several texts

Library/Package

nltk.sentiment.vader

textblob

Text Normalization

Major Processing