Google App Engine Datastore Example : Dictionary Application


I have a dictionary website which needs to store over 50,000+ xml files of words in the datastore. The following is the snapshot of part of the xml fils.

$ ls
ebhi.xml             ekaṃsena.xml          ekunasattati.xml  esaṃ.xml
edhati.xml           ekamsikata.xml        ekūnasattati.xml  esanam.xml
edha.xml             ekaṁsikatā.xml        ekunasatthi.xml   esānaṃ.xml
edhi.xml             ekamsika.xml          ekūnasaṭṭhi.xml   esana.xml
edh.xml              ekaṃsika.xml          ekunasiti.xml     esāna.xml
edisaka.xml          ekanika.xml           ekūnāsīti.xml     esanā.xml
edisa.xml            ekānika.xml           ekunatimsati.xml  esani.xml
ehi-bhikkhu.xml      ekantam.xml           ekūnatiṃsati.xml  esanī.xml
ehibhikkhu.xml       ekantaṃ.xml           ekunavisati.xml   esanta.xml
...

I use Google App Engine to host my website, and it takes me a while to figure out how to put 50,000+ xml on GAE datastore and retrieve them efficiently. The following is how I do it.

Before we start, we first need to define some terminology, which comes from GAE. Objects in the GAE datastore are known as entities. Each entity in the Datastore has a key that uniquely identifies it. The key consists of the following components:

  • The kind of the entity, which categorizes it for the purpose of Datastore queries
  • An identifier for the individual entity

So now we define four terms: entity, key, kind, id (identifier). Refer to Entities, Properties, and Keys - Python — Google Cloud Platform for more details.

In my application, I defile a kind called PaliWord for storing xml files in the datastore.

from google.appengine.ext import ndb

...

class PaliWord(ndb.Model):
  xmlfilename = ndb.StringProperty()
  xmlfiledata = ndb.TextProperty()

To store one xml file in the datastore, I use file name of the xml file without .xml extension as id for manipulating the entity. The name and data of the file is store in the entity.

def storeToNDB(filename, filedata):
  # id = filename without .xml extension
  paliword = PaliWord(id = filename[0:-4],
                      xmlfilename = filename,
                      xmlfiledata = filedata)
  paliword.put()
  return '%s : ok' % filename

To retrieve specific entity, we simply supply id, i.e., the file name without .xml extension, to the get_by_id() of PaliWord kind. The entity will be returned by this call if exists. Then we can further process the entity the way we want.

def lookup(word):
  paliword = PaliWord.get_by_id(word)
  if (paliword):
    return decodeXML(paliword.xmlfilename, paliword.xmlfiledata.encode('utf8'))
  else:
    return u'查無此字(No Such Word)'

In my dictionary application, the use of App Engine Datastore is very simple and straight forward because of the characteristics of my data. If you have a similar application and want to know how to use GAE Datastore, I hope this would be helpful.