Summary: In this assignment, you will implement a topological sort algorithm as a way to reorder kanji (; Japanese logographic characters) into an order optimal for memorization.
In this assignment, you will practice applying your knowledge of graph data structures and algorithms to determine an optimal order to learn logographic characters.
Consider the following scenario: as a native speaker of the English language, you wish to learn a language like Chinese () or Japanese (), which uses a special character set. Unlike the English language, which uses the Roman alphabet to encode phonetics, many east Asian languages use Hanzi derived characters, which are symbols with semantics bound to them (to use the technical term: logographic character). Often times, English speakers learning a language using logographic characters find themselves stumped by the apparently insurmountable problem of memorizing thousands of unique characters. They all look so different and yet so similar - can we hope to tell them apart or write them? Wouldn't it be nice if we could use our knowledge of algorithms and data structures to somehow address this problem so that English speakers would have a better shot at learning or ...?
One interesting dependency graph that can be constructed is a "component" graph for Hanzi characters, or Hanzi derived characters (e.g., Kanji). You see, complex characters are often built from simpler characters as components. This sub-structure is very useful! Components not only define a common appearance and stroke order but can indicate phonetics and/or semantics. Furthermore, there are considerably fewer components (hundreds) than actual characters (thousands). (Please note that we use the term component very generally here - it does not map to the traditional notion of a radical.) This sub-structure is particularly useful for people memorizing characters - instead of looking at each character as a monolithic block, one can memorize the individual components, and then reuse that knowledge for more complex characters. The following graph is an example of this for the character ("method"):
Figure 1: Dependency graph for . See image
Reading this graph, we can see that is made from ("gone") and ("water") ( is written as here, a standard transformation). Recursively, is made from ("soil") and . Based on these dependencies, we would want to learn these characters in the order: . If we do that, then instead of learning each stroke in , we just have to remember "method=water+gone", which tells us to write and . (For more information on these ideas, see Remembering the Kanji by James Heisig.) That said, in order to make use of this recursive structure, we would have to learn characters in an order such that we also see simpler characters before we see the more complex characters that are built out of them. To solve this problem, we can produce a topological sort of a graph. A topological sort is an ordered list of vertices in graph such that all dependencies are listed before their dependent.
In this assignment, you will implement an editable graph data structure and a new algorithm for finding the topological sort of a graph. Our goal will be to load two data files, and print out a topological order for the characters that they list. Attached to this post are a base file and two interfaces. Further down the BlackBoard page, you'll find a PDF describing the Java implementation of hashtables (you may find it useful), and a visualization of the dependencies in the data files. The data is formatted as follows:
data-kanji.txt (UTF-8 formatted) stores nodes:
data-components.txt (ASCII formatted) stores edges:
In terms of programming, you will need to:
Below is a sample output for your program that contains the characters in the default order from the file, and the order resulting from a topological sort. Note that your topological sort may be in a slightly different order than is shown below. The key is that simpler characters are shown before the more complicated characters that are built from them.
Figure 2: Sample output. see image.