<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>notebook - Latest Comments in Cedict Sqlite Database</title><link>http://notebook-bwong.disqus.com/</link><description></description><language>en</language><lastBuildDate>Tue, 02 Sep 2008 21:17:24 -0000</lastBuildDate><item><title>Re: Cedict Sqlite Database</title><link>http://notebook.bwong.net/2008/04/27/cedict-sqlite-database/#comment-2138137</link><description>What to say for such a nice gift? You are my newest hero. Not only does your script fulfill my need, it also allows me to learn a bit of Python with something that is a current interest (learning Chinese and building software tools to help out). If ever I make any such tool public, you'll be on my "admirers" list. :)&lt;br&gt;&lt;br&gt;Thanks so much.&lt;br&gt;Jean-Philippe Valois</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Philippe Valois</dc:creator><pubDate>Tue, 02 Sep 2008 21:17:24 -0000</pubDate></item><item><title>Re: Cedict Sqlite Database</title><link>http://notebook.bwong.net/2008/04/27/cedict-sqlite-database/#comment-2138138</link><description>Hi Benny,&lt;br&gt;&lt;br&gt;One point: There's a constraint on CC_CEDIT, where the combination of traditional characters, simplified characters, and pronunciation for a given record is guaranteed to be unique. This is because Chinese words used as common nouns my also be used as proper nouns. The pinyin for proper nouns has the appropriate letters capitalized, and allows one to lookup proper nouns, as well as count the number of proper nouns in the database, a metric which could come in handy if one does much reading about local events in China or Taiwan, or even personalities on the world stage.&lt;br&gt;&lt;br&gt;I have also developed a Unihan.txt to SQLlite conversion system. I use the tag as the table name, the data point as the key, and the third field as the content. Since the data point is the primary key in all of the 88 tables, the indexing is done automatically when the table is populated. &lt;br&gt;&lt;br&gt;To keep things simple, my system only depends on gawk and the sqlite3 interpretor, so all the necessary development files could exist in a single directory. I just started my project on Google Code, but it can be accessed at &lt;a href="http://code.google.com/p/unihan-sqlite-3-database" rel="nofollow"&gt;http://code.google.com/p/unihan-sqlite-3-database&lt;/a&gt;. I plan to have more documentation and SQL scripts available as I find the time to develop and post.&lt;br&gt;&lt;br&gt;The lack of comments in your Python Unihan script makes it a bit hard to follow.It looks like your are creating a single table, unihan, which has a columns for various values associated with a datapoint. This works if you know which records contain which values, but is not very versatile, and doesn't follow the principles of database design. &lt;br&gt;&lt;br&gt;Since some columns hold as few as a tens of values, and others hold many thousands, it seem to make better sense to "unflatten" the database, and store each set of values in its own table.The SQL is a bit more complex, but avoids constructions like "select character, definition from unihan where definition != '';" If a number of columns are involved, the amount of logic to test for (sets) of empty values associated with (sets) of real values can grow exponentially. That is the logic behind the normalization of databases. &lt;br&gt;&lt;br&gt;So far, my testing has shown good response when joining a  number of tables. I hope to write some scripts to produce the SQL for joining any number of tables. Of course, views could be used to simplify things. I just need to determine which views would be most useful for the average user.&lt;br&gt;&lt;br&gt;If you can explain what your Python program is doing, I'd be interested in knowing. Stay in touch. I'm interested Xul as well, and may be able to by a better than average beta tester for your project. Take Care.&lt;br&gt;&lt;br&gt;David Rutkowski</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Rutkowski</dc:creator><pubDate>Fri, 23 May 2008 02:56:01 -0000</pubDate></item></channel></rss>