How to do Chinese word splitting in Python?


Planning on mapping Chinese word clouds? Then you have to learn how to do Chinese text splitting first. Follow our tutorial and get hands-on with Python step by step.

(Some of the links in this article may not open correctly due to restrictions on external links to WeChat Public. If necessary, please click on the "Read the original article" button at the end of the article to access the version that displays the external links properly. )

need

In the article "How to make word clouds in Python", we introduced the method of making word clouds for English texts. Did everyone have a good time?

As mentioned in the text, the English text was chosen as an example because it was the easiest to handle. But soon a reader tried to make a word cloud with Chinese text. Did you succeed in following the previous method?

It's probably not going to work. Because there's an important step missing here.

Observe your English text. You'll notice that spaces are used as mandatory separators between English words.

For example.

Yes Minister is a satirical British sitcom written by Sir Antony Jay and Jonathan Lynn that was first transmitted by BBC Television between 1980 and 1984, split over three seven-episode series.

However, there is no such space separation for Chinese text. In order to make a word cloud, we first need to know what "words" are in the Chinese text.

You may not think it's a problem at all - I can see the boundaries between words at a glance!

Yes, of course you can. You can manually process 1 sentence, 100 sentences, or even 10,000 sentences. But what if you were given a million sentences?

This is the most significant difference between manual processing and automated computer processing - scale.

Don't be so quick to give up, you can use the computer to help.

Your question should be: how can I use my computer to properly split Chinese text into individual words?

This kind of work is called, in technical terms, subtext.

Before introducing the word splitting tool and its installation, make sure you have read the article "How to make a word cloud with Python" and followed the steps therein to prepare for it, then proceed to follow the step-by-step practice described in this article.

participle

There are various tools for Chinese word splitting. Some are free, some charge. Some can be installed and used in your laptop, while others require an internet connection to do cloud computing.

Today, I'm going to show you how to do Chinese word splitting for free, using Python, on your laptop.

The tool we use, with a characteristic name, is called " Stuttering Splitter".

Why such a strange name?

After reading this article, you should be able to figure it out for yourself.

Let's start by installing this word splitting tool. Go back to your "terminal" or "command prompt".

Go to the demo folder you created earlier.

Enter the following command.

pip install jieba

Okay, now the Python in your computer knows how to split words for Chinese.

data

In the article "How to make a word cloud in Python", we used the Wikipedia introductory text of the British play "Yes, minister". This time we have found the Chinese page corresponding to this British drama from Wikipedia again. The translated title is called "Yes, Minister".

After copying the body of the page, store it in the text file yes-minister-cn.txt and move this file to our working directory under demos.

okay, We have the Chinese text for analysis data finish。

Don't get too busy programming yet. One more thing we need to do before we can officially enter the code is to download a Chinese font file.

Please go to this URL to download simsun.ttf.

After downloading, move this ttf font file to the demo directory as well, and place it with the text file.

code

At the command line, execute.

jupyter notebook

The browser will automatically open and the following screen will be displayed.

Here's more of the fruits of our labor from the last word cloud production. At this point, there is an additional text file in the directory, which is the introduction message of "Yes, Minister" in Chinese.

Open this file and browse the contents.

We confirm that the Chinese text content has been stored correctly.

Go back to the main page of the Jupyter notebook. Click the New button to create a new notebook (Notebook). Inside Notebooks, please select the Python 2 option.

We will be prompted to enter the name of the Notebook. To differentiate it from the last English word cloud making notebook, let's call it wordcloud-cn.

We enter the following 3 statements in the only code text box on the page. After typing, press Shift+Enter to execute.

filename = "yes-minister-cn.txt"
with open(filename) as f:
 mytext = f.read()

Then we try to display the contents of mytext. After entering the following statement, you still have to press Shift+Enter to execute it.

print(mytext)

The results displayed are shown in the figure below.

Since there is no problem reading Chinese text content, let's start splitting words. Enter the following two-line statement.

import jieba
mytext = " ".join(jieba.cut(mytext))

The system will prompt for some information, that is the preparation that needs to be done when the stuttering participle is enabled for the first time. Just ignore it.

What is the result of the participle? Let's see. Input.

print(mytext)

You'll be able to see the result of the word split as shown in the image below.

Words are no longer tightly connected to each other, but are separated by spaces, just as the natural divisions between English words are.

Can't wait to use the split Chinese text as a word cloud?

Yes, enter the following statement.

from wordcloud import WordCloud
wordcloud = WordCloud().generate(mytext)
%pylab inline
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off"

Excited for the Chinese word cloud?

Unfortunately, the word cloud you see looks like this.

Are you very angry and feel like you've fallen into the hole again this time?

Don't worry, it's not a problem with the word splitting or word cloud drawing tools, let alone a mistake in our tutorial steps, it's just because the font is missing. The default font used by the wordcloud drawing tool wordcloud is English and does not contain Chinese encoding, that's why the box is a piece. The solution is to take the simsun.ttf you downloaded earlier and use it as the specified output font.

Enter the following statement.

from wordcloud import WordCloud
wordcloud = WordCloud(font_path="simsun.ttf").generate(mytext)
%pylab inline
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

This time the output graphic you see becomes this.

In this way, we appreciate the need for Chinese word separation through the process of making Chinese word clouds.

Here's a thought-provoking question for you, comparing the Chinese word cloud generated this time, with the English word cloud made last time.

The two word clouds correspond to texts from Wikipedia that describe the same play, how are they similar and different? From this comparison, what interesting patterns can you find between the content of Wikipedia's English and Chinese presentations?

discussions

After mastering this method, what kind of Chinese word cloud map did you make by yourself? Besides making word clouds, what other application scenarios do you know of for Chinese word separation? Feel free to leave a comment and share it with everyone. We share and discuss together.



Recommended>>
1、Change The state suddenly announced a big earthquake in the auto industry
2、Peking University Management Science Data Center Think Tank Receives Funding
3、VAT invoice management new system invoicing software tax rate adjustment upgrade understanding paper
4、Attention high school students Gravitational Waves Internet of Things Artificial Intelligence Coming to Classrooms
5、800000 for a cat 13 billion for the most expensive Whats even more exaggerated is that these cats are Poverty limits my imagination

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号