Категории: Все - detection

по Fathi Razi 3 лет назад

192

Text Pre-Processing

Text processing involves several key steps, each with its unique challenges. Tokenization transforms text into individual words or tokens, but can sometimes mislead the context, necessitating the use of n-grams up to three words.

Text Pre-Processing

Major Processing

Text Normalization

Text Pre-Processing

Sentiment Score

Description:

calculates the sentiment score of the text based on the individual words in a sentence are categorized


Issue:

Library/Package
textblob
nltk.sentiment.vader

Remove Stopword

Description:

Remove the common words that in the text such as is are, be, the, and, etc.


Issue:


Stemming/Lemmatize

Description:

Transform the text into the base form/root word


Issue:


Important:


Part of Speech Tagging

Description:

assign parts of speech to each word of a given text (nouns, verbs, adjectives, and others) based on its definition and its context


Issue:

Name Entity Recognition (NER)

Description:

aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc)


Issue:

Tokenize

Description:

Transform the text into single chunk/token of word


Issue:


n-gram
trigram
bigram
unigram

Translate & Detect Language

Description:

Translate to universal language which is English


Issue:


Important:


Package/Library
deep_translator
translate

Issue:

Limited of words per day can be used

Regular Expression

Description:



Issue:


Important:


Remove symbols/digits/ unnecessary words
Removing leading space and trailing whitespace
Remove Punctuation