Structuring online text database for Mongolic languages
Hello, I’m Hirata, a Research Associates at the IRC. How are you doing after the Golden Week holidays?
For the first article for the “IRC Special Articles,” we would like to talk about “the 1st R training camp for beginners” which I participated in March 2017. This article refers to slides used in the training and “Text mining by R” written by Yuichiro Kobayashi (Ohmsha, 2017) used as its textbook.
First of all, what is R? The answer is as follows:
R is a free programming language which enables to perform statistical and texts analyses.
In other words, R is a “programming language” which enables us to perform statistical analysis or tag texts by repeating the actions that we type on a keyboard and press the Enter key.
R is available for free, and various function can be added by installing free additional packages. If we sort and appropriately search what we want to do, we will able to do various things with R.
In the workshop, following three usage environments for R was introduced:
First of all, R and RStudio are available as downloadable applications. The difference between R and RStudio can be broadly explained as follows. RStudio is an environment with various functions encouraging the use of R; as the top page of the RStudio official website states, “RStudio makes R easier to use. It includes a code editor, debugging & visualization tools (Access: May 10, 2017).”
RStudio Server requires setting up a server. Thus, it will be easier to use R or RStudio than to use RStudio for your individual trial. In this training, some classes used RStudio Server in order to unify the environments for all participants regarding the installed package(s).
From now, I would like to briefly explain how to use R with an example of executing Chi-squared test of 2×2 contingency table.
The following table is made with Excel on the number of male students and female students tallied according to each School (School of Language and Culture Studies and School of International and Area Studies). （Reference: http://www.tufs.ac.jp/abouttufs/outline/data.html , May 10, 2017）
When we make this kind of table with Excel, we can only use the keyboard to input words and figures using the cursor movement keys and the Enter key. We can also finish this work using only the mouse, except when inputting characters.
On the contrary, we have to input the following characters with the keyboard in order to make the same table with R with the name “table.”
It seems difficult to some people. It must be difficult for those who are used to the Windows or the Macintosh operating systems and do all tasks only by character input. However, it seems that this difficult process also has an advantage as it makes it easy to preserve histories of work processes and enables us to easily repeat the same task.
Press input “table” and press the Enter key to confirm if the “table” you made before is correctly made.
Finally, the p-value was so small that I received a finding that “the hypothesis ‘No difference between two schools exists on the ratio of male students to female students’ is denied”, namely, “male-to-female ratio is significantly different.” To tell the truth, I have had no knowledge about statistics until I had participated in this training, but they kindly taught me how to read the results.
So, I learned how to do this kind of task from scratch in this training. The time schedule was as follows: https://www.rbootcamp.org/?p=128
I had this training in a training and accommodation facility located in Makuhari, Chiba. I didn’t know what type of facility it was, but it had almost the same structure as an economy hotel except for a floor for training rooms.
During 5 days of training, I studied the above subjects very hard after installing the software. Trainees received hands-on practice of programming using their own laptop PC in a training room. With lecturers, TA answered questions as needed in making the rounds there. It was a precious experience that I received direct lectures from experts for learning the way to use something totally unknown. With my Macintosh-based PC, making graphs using Japanese language causes character corruption by default, but I finally became able to use Japanese for graph making because a lecturer referred me to a site stating the way of dealing with character corruption. It would have been very difficult for me to deal with this problem myself.
In addition, as the title “beginners” suggests, they gave lectures on all subjects from the basics with opportunities to ask questions at any time, so I was able to learn properly.
R has so many capabilities of processing as the above timetable conveys that I cannot say anything but “Great! We can do many things!” The below website has a list of recommended books. So you choose to buy the above textbook “Text mining by R” or one in the list.
When I had this training in May 2017, the next training camp had already been planned as the name of the training camp suggests “the 1st.” I think we have a rare opportunity in being able to receive information organized by experts through asking questions. Please access the below website if you are interested in R.
Website of R Bootcampers, Japan https://www.rbootcamp.org/
Thank you for reading the first article of “IRC Special Articles.” IRC members will continue to update our messages.
Digital archiving of text and sound data of morphosyntactic microvariation of southern Bantu languages