Dry|Natural Language Processing (5) - English Text Mining Preprocessing Process
preamble
Link to original article: http://www.cnblogs.com/pinard/p/6756534.html
In the Chinese text mining pre-processing process summary, we summarized the pre-processing process for Chinese text mining, and here we are again on English text mining (English text mining, ETM ) to make a summary of the pre-processing process.
ETM Features
The preprocessing method for English text is partially different from that for Chinese. First. English text mining pre-processing can generally be done without word separation (except for special needs), while Chinese pre-processing word separation is an essential step . Second point.Most English text is encoded in uft-8, so that most of the time when processing without considering the problem of encoding conversion, while Chinese text processing must deal with the encoding of unicode . We have already talked about these two parts in Chinese text mining preprocessing.
And the pre-processing of English texts has its own special The third point is the spelling Many times, our preprocessing has to include spell checking, for example, errors like "Helo World", which we cannot correct during analysis. So it needs to be corrected before pretreatment.The fourth point is that Stem extraction (stemming) and lemmatization。 This thing is mostly English with singular, Plurals and various tenses, Resulting in a word that will have different forms。 for example“countries” harmony"country","wolf" harmony"wolves", We expect that there is a word。
Later in the preprocessing, we will focus on the processing of the third and fourth points.
ETM Pre-Processing (I) - Data Acquisition
This part is similar in English and Chinese. There are two general ways to get it: using a corpus made by someone else and using a crawler to go on the web and crawl your own corpus data yourself.
For the first method, there are many common text corpora on the Internet, so if you are just learning, you can download them directly and use them, but if it is a corpus of some special topics, such as "deep learning" related corpus, then this method does not work and we need to use the second method to get them ourselves.
For the second method of using crawlers There are many open source tools, generic crawlers I generally usebeautifulsoup . But we need some specific corpus data, such as the "deep learning"-related corpus mentioned above, and we need a topic crawler (also called a focus crawler) to do it. This one I usually useache . ache allows us to filter out the corpus of topics we need using keywords or a classification algorithm model that is more powerful.
ETM Preprocessing (II) - Removing Non-Text
This step is mainly for the corpus data we collected with the crawler, as the crawled down content has a lot of html with some tags that need to be removed. Small amounts of non-text content can be removed directly with Python's regular expressions (re), and complex ones can be removed with beautifulsoup. There are also some special non-English characters (non-alpha) that can be removed with Python's regular expression (re).
ETM Preprocessing (III) - Spell Check
Spell-checking is generally required because of the potential for spelling errors in English texts. This step can be omitted if you are confident that the text we are analyzing has no spelling problems.
spell checker, We generally usepyenchant Class library completion。pyenchant The installation is simple:"pip install pyenchant" can then (do sth)。
For a paragraph of text, we can identify spelling errors in the following way.
from enchant.checker import SpellChecker chkr = SpellChecker("en_US") chkr.set_text("Many peope likee to watch In the Name of People.")for err in chkr: print "ERROR:", err.word
The output is.
ERROR: peope ERROR: likee
After finding the error, we can decide for ourselves if we want to correct it. Of course, we You can also use the wxSpellCheckerDialog class in pyenchant to interactively decide whether to ignore, correct or correct all misspellings in the text using a dialog box. You can study the official documentation of pyenchant if you are interested.
ETM Pretreatment (IV) of
Stem extraction (stemming) and
lemmatization
Stem extraction (stemming) and word type reduction (lemmatization) are features of English text preprocessing. The two actually have something in common, That is, it's all about finding the original form of the word。 Only the stem extraction(stemming) It'll be a little more aggressive, It can get stems that aren't words when it's looking for them。 for example"imaging" The stem of the word may be obtained as"imag", It's not a word.。 And the word forms are still conservative in principle, It generally only works on words that can be reduced to a correct word。 Personally, I prefer to use word pattern reduction rather than stem extraction。
In practice, nltk is generally used for stem extraction and word type reduction. erectnltk It's easy too.,"pip install nltk" can then (do sth)。 Only we usually need to downloadnltk corpus, This can be done with the following code,nltk A dialog box will pop up to select the content to download。 Just choose to download the corpus。
import nltk nltk.download()
In nltk, the methods that do stem extraction are PorterStemmer, LancasterStemmer and SnowballStemmer. Personally, I recommend using SnowballStemmer. This class can handle many languages, except, of course, Chinese.
from nltk.stem import SnowballStemmer stemmer = SnowballStemmer("english") # Choose a languagestemmer.stem("countries") # Stem a word
The output is"countri", This word stem is not a word。 And if one is doing word type reduction, then you can generally useWordNetLemmatizer kind, i.e.wordnet morphological reduction method。
from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer()print(wnl.lemmatize('countries'))
The output is"country", More in line with demand。 During the actual English text mining pre-processing, It is recommended to use thewordnet The morphological reduction of the word will be sufficient。
There's a stem extraction and word type reduction demo here, so if you're new to this piece you can check it out, it's a great way to get started.
ETM Preprocessing (V) of Lowercase Normalization
Since English words are case-sensitive, we expect statistics like "home" and "home" to be one word. Therefore it is generally necessary to convert all words to lower case. This is straightforward to handle with the python API.
ETM Preprocessing (VI) - Introducing Deactivation Words
There are many invalid words in the English text, such as "a", "to", some short words, and some punctuation marks, which we do not want to introduce in the text analysis, so we need to remove them, these words are deactivated words. A list of commonly used English stop words for individuals is available for download here. Of course there are other versions of the deactivation word list, but this one is the one I commonly use.
When we do feature processing with scikit-learn, we can introduce an array as a list of deactivated words by taking the parameter stop_words. This method is the same as the previous article on Chinese deactivation words, so we will not write out the code here, we can refer to the previous article.
ETM Preprocessing (VII) of Feature Processing
Now we're ready to use scikit-learn on our text features, the In Text Mining Preprocessing of Vectorization and Hash Trick, we talked about two methods of feature processing, Vectorization and Hash Trick. And vectorization is the most commonly used method, as it can be followed by TF-IDF feature processing. In TF-IDF of Text Mining Preprocessing, we also talked about the TF-IDF feature processing method.
The TfidfVectorizer class helps us with the three steps of vectorization, TF-IDF and normalization. And of course, it helps us with deactivated words. This part of the work is also identical to the feature processing in Chinese, so you can just refer to the previous article.
ETM Preprocessing (VIII) - Feature Processing
With the feature vector of TF-IDF for each text, we can use this data to build classification models, or clustering models, or to perform topic modeling. The classification clustering model at this point is no different from the data analysis for non-natural language processing that was covered earlier. Therefore the corresponding algorithms can all be used directly. And topic models are a more specific piece of natural language processing, which we'll talk about separately later.
Phase Summary
Above we have summarized the process of English text mining pre-processing, Hope this helps.。 Note that this process is mainly for some common text mining, and used the bag-of-words model, For some natural language processing requirements the process needs to be modified。 For example, sometimes a lexical marker is needed, And sometimes we need English participles too, For example, getting"New York" rather than“New” harmony“York”, So this process is for natural language processing beginners only, We can choose the appropriate pre-processing method depending on the purpose of our data analysis。