Text Pre-Processing

Больше похоже на это

Sentimental Analysis

от Mirza Rahman

structure essay on fall detection systems at night

от Jeremy Brommersma

New Map Roland Flores

от Roland Flores

Anomaly Detection

от Verma Verma

Major Processing

Text Normalization

Sentiment Score

Description:

calculates the sentiment score of the text based on the individual words in a sentence are categorized

Issue:

current package can be used to English language
might not be able to determine the score accurately on several texts

Library/Package

textblob

nltk.sentiment.vader

Remove Stopword

Description:

Remove the common words that in the text such as is are, be, the, and, etc.

Issue:

Might mislead the actual context
The words depends on the pre-trained model of the library

Stemming/Lemmatize

Description:

Transform the text into the base form/root word

Issue:

Highly depend on the pre-trained model that available in the library
some words not available

Important:

Avoid words redundancy (WordCloud, BOW, Tfidf)

Part of Speech Tagging

Description:

assign parts of speech to each word of a given text (nouns, verbs, adjectives, and others) based on its definition and its context

Issue:

The words are based on the pre-trained model of the library
Need to update some of the words in our text

Name Entity Recognition (NER)

Description:

aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc)

Issue:

some of words in the text we have might not be able to detect
need to update the model of the library package used

Tokenize

Description:

Transform the text into single chunk/token of word

Issue:

tokenize can mislead the meaning of the context
need to consider of the n-gram of the token used (max 3)

n-gram

trigram

bigram

unigram

Translate & Detect Language

Description:

Translate to universal language which is English

Issue:

Package/Library that support multiple language
Open source and free to use without limitation (TBC)
Not able to detect & translate the words accurately if the text is written informal

Important:

Avoid redundant of word
Help in analyzing the context

Package/Library

deep_translator

translate

Issue:

Limited of words per day can be used

Regular Expression

Description:

Issue:

Different text has different expression
Need to form a general pattern where can be use to remove unnecessary expression in every text

Text Pre-Processing

Text processing involves several key steps, each with its unique challenges. Tokenization transforms text into individual words or tokens, but can sometimes mislead the context, necessitating the use of n-grams up to three words.

Sentimental Analysis

structure essay on fall detection systems at night

New Map Roland Flores

Anomaly Detection

Major Processing

Text Normalization