Applying machine learning to the analysis of text – The Case of “The Spring and Autumn Annals of Master Yan”
Abraham Meidan, PhD This paper was submitted in conference at Renmin University , Beijing, China
WizSoft
In what follows we present a machine learning technique for analyzing text. The examples refer to the YZCQ text.
Machine learning is a set of computerized statistical techniques that “learn” by discovering valid patterns in the data. In the standard case the data are saved in a flat file (like an Excel sheet). The rows refer to the records of some population. One column is the field to be explained, the dependent variable. The rest of the columns are the independent variables. The machine learning algorithm is supposed to find a valid model that explains the values of the dependent variable as a function of the values of the independent variables.
As mentioned the model has to be valid: “Validity” means that when the model is used for issuing predictions in regard to the expected values of new records (belonging to the same population that was used for creating the model), the accuracy of the predictions is higher than that of random predictions, or predictions that are based only on the frequencies of the various values of the dependent variable. If the predictions fail to meet these expectations, the model is said to be the result of coincidental patterns (or in the professional language – the result of overfitting).
As mentioned the data should be saved in a flat file like an Excel sheet. This is a structured data. Text data are usually saved as unstructured data – there is no straightforward way to preset text data in an Excel sheet. Still one can convert some of the contents of the text into structured data and then apply a machine learning algorithm.
In the present research the question was: Can the existence of certain words in sections of the YZCQ be explained as a function of the existence of other words in these sections?
We selected the following words as the dependent variables:
民
社稷
仁
The independent variables were the rest of the words in the text, but for reasons that will be explained later we restricted the list to the 200 most frequent words.
Since many signs in Chinese have more than one meaning, we created additional flat files where the independent variables were the 200 most frequent pairs of successive signs.
So we analyzed six tables. Three tables (each one for a different dependent variable) had the following structure:
Section# | Word1 | Word2 | Word3 | …. | …. | …. | Dependent variable |
1 | |||||||
2 | |||||||
3 | |||||||
4 | |||||||
…. | |||||||
…. | |||||||
….. |
And three additional tables referred to pair of words and had the following structure:
Section# | Pair of Words1 | Pair of Words2 | Pair of Words3 | …. | …. | …. | Dependent variable |
1 | |||||||
2 | |||||||
3 | |||||||
4 | |||||||
…. | |||||||
…. | |||||||
….. |
In each cell the value was either 1, when the word (or pair of words) exists in the section, or 0, when it does not.
There are various algorithms of machine learning. We selected an algorithm that reveals if-and-only-if rules (necessary and sufficient conditions). You can read about this algorithm in: Abraham Meidan: Wizsoft’s WizWhy, in Oded Maimon, Lior Rokach (Eds.), The Data Mining and Knowledge Discovery Handbook, Springer 2005, pp. 1365-1369.
The main reason for using this algorithm is: it displays an easy to understand model. When using other algorithms the model is either a black box (this is the case when using artificial neural networks), or too complex to be easily understood (this is the case when using random forest).
The YZCQ text includes 217 sections. This is quite small. The rule of thumb is that in order to avoid overfitting (that is, revealing accidental patterns) the number of rows should be 20 times the number of columns. In the current research since we had 200 columns (words or pairs of words), we should have 4,000 sections rather than 215. This is the reason why we limited the research to 200 most frequent words (or 200 most frequent pairs of word: If we included more words or pairs of words we would increase the risk of revealing accidental patterns).
The results
Below is the analysis of the text when the dependent variable is: 民
If-and-only-if Rule 1 (out of 2)
The following conditions explain when
民 exists
1) 我 does not exist
and 歛 exists
2) 樂 exists
and 內 exists
3) 厚 does not exist
and 財 exists
4) 得 exists
and 和 exists
5) 知 does not exist
and 怨 exists
6) 是 exists
and 窮 exists
7) 此 does not exist
and 力 exists
8) 使 does not exist
and 禁 exists
9) 欲 exists
and 危 exists
10) 令 exists
and 多 exists
11) 用 exists
and 退 exists
12) 如 exists
and 姓 exists
13) 日 exists
and 厚 exists
14) 政 exists
and 當 exists
15) 用 exists
and 祿 exists
16) 國 exists
and 正 exists
17) 乎 does not exist
and 邪 exists
18) 臣 exists
and 財 exists
19) 請 does not exist
and 禁 exists
20) 王 exists
and 說 exists
21) 成 exists
and 臺 exists
22) 夫 does not exist
and 眾 exists
23) 王 exists
and 長 exists
When at least one of the conditions holds, the probability that
民 exists
is 0.971 (100 out of 103 cases)
When all the conditions do not hold, the probability that
民 does not exist
is 0.930 (106 out of 114 cases)
The total number of cases explained by the set of conditions: 206
The total number of cases in the data: 217
Success rate: 0.949 (206 / 217)
The primary probability that:
民 exists is 0.498 (108 out of 217 cases)
民 does not exist is 0.502 (109 out of 217 cases)
Improvement Factor: 9.818 (min((108*1),(109*1)) / (8*1+3*1))
If-and-only-if Rule 2 (out of 2)
The following conditions explain when
民 does not exist
1) 何 does not exist
and 賜 exists
2) 景 does not exist
and 善 exists
3) 矣 does not exist
and 左 exists
4) 國 does not exist
and 左 exists
5) 然 does not exist
and 乘 exists
6) 景 does not exist
and 齊 exists
7) 此 exists
and 二 exists
8) 公 does not exist
and 對 does not exist
9) 是 does not exist
and 去 exists
10) 問 does not exist
and 辭 exists
11) 所 does not exist
and 一 exists
12) 上 does not exist
and 酒 exists
13) 見 exists
and 入 exists
14) 治 does not exist
and 殺 exists
When at least one of the conditions holds, the probability that
民 does not exist
is 0.776 (90 out of 116 cases)
When all the conditions do not hold, the probability that
民 exists
is 0.812 (82 out of 101 cases)
The total number of cases explained by the set of conditions: 172
The total number of cases in the data: 217
Success rate: 0.793 (172 / 217)
The primary probability that:
民 does not exist is 0.502 (109 out of 217 cases)
民 exists is 0.498 (108 out of 217 cases)
Improvement Factor: 2.400 (min((108*1),(109*1)) / (26*1+19*1))
Each of these two rules explains when the sign 民 exists (or does not exist) in the sections of the YZCQ text. The first rule lists 23 conditions, and each condition is composed of two sub-conditions. The rule says that if condition #1 holds or condition #2 holds or …. condition #23 holds, then there is a high probability that 民 exists in the section, and if all the conditions do not hold, then there is a high probability that 民 does not exist in the section.
Note that to say that all the conditions do not hold is to say that (referring to the first rule) –
1) 何 does exist
or 賜 exists and
2) 景 exists
or 善 does not exist and
3) 矣 exists
or 左 does not exists and ….
etc…
This formulation follows De-Morgan law in Logic according to which
Not (A and B) is equal to (Not-A or Not-B)
Not (A or B) is equal to (Not-A and Not-B)
As mentioned the rule presents necessary and sufficient conditions. These conditions refer to all the records (contrary to if-then rules that usually refer to some records only).
When revealing the rules the target is to maximize the number of records that are explained by the conditions (both the sections where 賜 exists and the sections where do not exist) and to minimize the number of the conditions. In other words the program looks for a model that is as simple as possible and as accurate as possible. Obviously usually there is a trade-off between these two targets.
At the end of each rule the program displays the improvement factor: this number denotes how much the predictions based on the rule are better than predictions that are based on the frequencies of the values, taking into account the cost of a miss and the cost of a false alarm. However this issue is beyond the scope of this paper.
The second dependent variable was: 社稷
The program discovered only one rule. This rule includes just 5 conditions, so it is much simpler that the previous rules.
If-and-only-if Rule 1 (out of 1)
The following conditions explain when
社稷 does not exist
1) 令 does not exist
and 遂 does not exist
2) 所 exists
and 危 does not exist
3) 國 does not exist
and 朝 does not exist
4) 大 does not exist
and 遂 exists
5) 下 does not exist
and 令 exists
When at least one of the conditions holds, the probability that
社稷 does not exist
is 0.995 (204 out of 205 cases)
When all the conditions do not hold, the probability that
社稷 exists
Is 1.00 (12 out of 12 cases)
The total number of cases explained by the set of conditions: 216
The total number of cases in the data: 217
Success rate: 0.995 (216 / 217)
The primary probability that:
社稷 does not exist is 0.940 (204 out of 217 cases)
社稷 exists is 0.060 (13 out of 217 cases)
Improvement Factor: 13.000 (min((13*1),(204*1)) / (1*1+0*1))
Finally we analyzed the dependent variable: 仁
Unfortunately the rules that explain the existence of this sign are much weaker than the previous rules:
The program discovered just one rule that refers to the single signs in each section.
If-and-only-if Rule 1 (out of 1)
The following conditions explain when
仁 does not exist
1) 臣 does not exist
and 聞 does not exist
2) 治 exists
and 歸 does not exist
3) 國 does not exist
and 行 exists
4) 君 does not exist
and 及 does not exist
5) 矣 does not exist
and 若 exists
6) 可 does not exist
and 欲 exists
7) 言 does not exist
and 朝 exists
8) 成 exists
and 入 does not exist
9) 三 does not exist
and 受 exists
10) 焉 does not exist
and 哉 exists
11) 死 does not exist
and 二 exists
12) 景 exists
and 邪 exists
13) 何 does not exist
and 日 exists
14) 君 exists
and 正 exists
15) 乎 does not exist
and 臣 does not exist
16) 矣 does not exist
and 賢 exists
When at least one of the conditions holds, the probability that
仁 does not exist
Is 1.00 (190 out of 190 cases)
When all the conditions do not hold, the probability that
仁 exists
is 0.963 (26 out of 27 cases)
The total number of cases explained by the set of conditions: 216
The total number of cases in the data: 217
Success rate: 0.995 (216 / 217)
The primary probability that:
仁 does not exist is 0.880 (191 out of 217 cases)
仁 exists is 0.120 (26 out of 217 cases)
Improvement Factor: 26.000 (min((26*1),(191*1)) / (0*1+1*1))
The second rule refers to the pairs of signs. And once again, only one rule was discovered.
If-and-only-if Rule 1 (out of 1)
The following conditions explain when
仁 does not exist
1) 子曰 does not exist
and 天下 does not exist
2) 也公 does not exist
and 曰君 exists
3) 不足 exists
and !晏 does not exist
4) 何晏 exists
5) 君之 does not exist
and 之以 exists
6) 先君 exists
and !晏 does not exist
7) 曰嬰 does not exist
and 以不 exists
8) 之所 does not exist
and 曰夫 exists
9) 君子 does not exist
and 子晏 exists
10) 夫子 does not exist
and 之〕 exists
11) 天下 does not exist
and 之行 exists
12) 之晏 does not exist
and 曰善 exists
13) 公不 exists
and 之言 does not exist
14) 公問 exists
and 也公 does not exist
15) 不可 does not exist
and 曰臣 exists
16) 古之 does not exist
and 〕不 exists
17) 不可 does not exist
and 君子 exists
18) 景公 does not exist
and 曰嬰 does not exist
When at least one of the conditions holds, the probability that
仁 does not exist
is 0.979 (187 out of 191 cases)
When all the conditions do not hold, the probability that
仁 exists
is 0.846 (22 out of 26 cases)
The total number of cases explained by the set of conditions: 209
The total number of cases in the data: 217
Success rate: 0.963 (209 / 217)
The primary probability that:
仁 does not exist is 0.880 (191 out of 217 cases)
仁 exists is 0.120 (26 out of 217 cases)
Improvement Factor: 3.250 (min((26*1),(191*1)) / (4*1+4*1))
Conclusion
Machine learning techniques can be used in order to discover patterns in text. We demonstrated applying a machine learning technique on the YZCQ text but obviously any text can be analyzed in this method.
Are the above-mentioned rules interesting? When the number of conditions is small and the accuracy (probability) is high, the rule is unexpected, and being unexpected is a necessary condition for being interesting. It is a necessary condition but not a sufficient one. The scholars of Chinese texts have to say whether or not these rules may contribute to their research.
If the answer is positive, the following two recommendations are relevant:
When analyzing other text files it is recommended to look for files having many sections (the more the better) in order to avoid revealing accidental patterns.
It is also recommended to use a machine learning algorithm that issues an easy to understand model. If you want to use the software program that was used in this research download WizWhy demo from www.wizsoft.com. The demo version is identical to full version except for being limited to 1,000 records. If your text data include more than 1,000 sections you may send me the Excel file (convert the Chinese signs into Unicode) and I’ll send you the analysis: [email protected]