Archive for the ‘Software Programming’ Category

A Little About Google Docs

You all know what it is or you would not be reading this. Something to watch out for, Google Docs does not enforce unique file names. You can have two files named ‘really.txt’ in the same folder ‘FacePalm’. Google Docs stores these using a generated number as the unique index. You can get these indexes and work with them from the resource object. But, for the purposes of this example I am going to assume that you are going to make file names globally unique (the delete would just delete the first file of a given name in my code). By this I mean that you are going to enforce it in your code, or just by not doing it.

Hopefully after reading this you will see what I mean. You can handle more than globally unique filenames, it just takes more work. You would need to find the collection, then I think there is a parent attribute in the file resource you can use to check you have the correct file in the correct folder.

Read on to make sense of that.

Hopefully.

Background

I recently had to create a script which backed up files to Google Docs from the command line.  Google have created an API to do this and the language I chose to use for it was Python.  I am not really sure why, probably because I found the already written GoogleCL.  This command line tool uses the gdata-python-client to manage not only Google Docs but things like Calendar and Blogger as well.  I wrote a bash script for my backup logic and got the backup up and running.

I soon discovered a problem however.  When files were being deleted, like daily backups which were only to be kept for so many weeks, they were going into the recycle bin in Google Docs.  Items in the bin are included in the storage quota and I could not find anyway to turn the bin off or set it to empty weekly or something.  I also could not get GoogleCL to bypass the command line with its deletions. So the account maxed out and required manual bin emptying, defeating the whole point of a manual backup.

Reading further into gdata-python I found the appropriate function to bypass the recycle bin.  It was however in version 2.0.16 and GoogleCL uses 2.0.14.  In the end I decided to write my own Python script, using version 2.7.2 and gdata-2.0.16.  See the the links in the first paragraph for the websites for these projects.  They all have pretty simple install instructions.  I tested on OSX 10.6 but deployed on Linux.

Please note I am not an expert on this.  Some of my understanding of the subtleties may be a bit off.  As far as I am aware my code works but it is presented here with no warranty or guarantee of being fit for any purpose.  There were various code snippets and forum posts I found but none gave the kind of basic introduction to doing the basic tasks that I was looking for.  I imported the following.  Some may no longer be required but I do not want to go back and check them all.  They are pretty basic imports anyway, plus gdata.

If you are running this code please backup any data and I take no responsibility for anything you break, damage destroy or incite into a robotic revolution.

#!/usr/bin/python
# Or wherever your python install is, remember to make the script executable as well (chmod +x)

import sys
import os.path
import gdata.data
import gdata.acl.data
import gdata.docs.client
import gdata.docs.service
import gdata.docs.data
import gdata.sample_util
import argparse

 
 


The Client Object

The client object in gdata is the object you use to communicate with Google Docs.  Think of it as Google Docs, you tell it you want to upload something, delete something or do a search.  The following creation code is a slightly modified version of what is in the gdata examples. I have put everything just in one script/function. I split it out into more functions but you can do that as you prefer.

APP_NAME = 'What ever you want to name it'
DEBUG = False
client = gdata.docs.client.DocsClient(source=APP_NAME)
client.http_client.debug = DEBUG

# Authenticate the user with CLientLogin, OAuth, or AuthSub.
try:
  gdata.sample_util.authorize_client(client,
  service=client.auth_service,
  source=client.source,
  scopes=client.auth_scopes
)
except gdata.client.BadAuthentication:
  exit('Invalid user credentials given.')
except gdata.client.Error:
  exit('Login Error')

When you run this code you will be asked three questions.

  • Which authentication type do you want, I chose 1 which is username and password
  • The email address / username of your Google account
  • The password for your Google account

You can pass these are arguments to the command line when you run the script and gdata will pick them up automatically like:

./this_script.py --auth_type=1 --email=you@wherever.com --password=********

 
 


Upload

So now you have a script which can create a client object without you having to enter all the details at the CLI.  You can upload a file using the collowing

# Give the document a title and put the location of the file to be uploaded in LOCAL_FILE
DOC_TITLE='test_file.txt'
LOCAL_FILE='/tmp/test_file.txt'
doc = gdata.docs.data.Resource(type='document', title=DOC_TITLE)
media = gdata.data.MediaSource()
media.SetFileHandle(LOCAL_FILE, 'application/octet-stream')
create_uri = gdata.docs.client.RESOURCE_UPLOAD_URI + '?convert=false'
upload_doc = client.CreateResource(doc, create_uri=create_uri, media=media)

Now, the doc and media lines set the file information and local path. The create_uri variable is just the default setting plus an argument to turn off file conversion. By default when you upload files to Google Docs it will convert the file to its own format. This will ruin your day if you are trying to upload a tar.gz file or something. Unfortunately this only works on a paid Google account, if you have a free one then you can only upload files which you are happy to have converted. Just take the create_uri=create_uri out of the CreateResource function and it will use the default.

 
 


Searching for a collection

To find a collection you need to pass the uri (see the API for more details) telling Google Docs you want collections. Took me a while to find as it is not in the examples that come with gdata (that I could see).

COLLECTION_QUERY_URI 	= '/feeds/default/private/full/-/folder'
# Note that the target_collection is still just a resource object like found_resource
target_collection = None
folder_to_find = 'MyFolder'
for resource in client.GetAllResources(uri=COLLECTION_QUERY_URI):
  if resource.title.text == folder_to_find:
    # target_collection = client.GetResourceById(resource.resource_id.text)
    target_collection = resource

This loops through each resource including folders and gets the one with the target name. I know that the GetResourceById line is pointless, it is just there to show you where you get an ID for these resource objects and how you look it up again.

 
 


Putting the uploaded file in an existing collection

I could not get the file to upload directly to a collection, but I could move it to one.  Make sure you have made the collection in google docs first

client.MoveResource(upload_doc,target_collection)

 
 


Searching for a file

Now to search objects in Google Docs.  This could be refined using specific searches but I just queried all documents and searched for a title match.  You can query just for specific things but this post is intended for people who need to do something quickly and simply

file_to_find = 'test_file.txt.txt'
found_resource = None               # Really, what is wrong with nil or null

# By default this will return all non-collection, non-bin resources
for resource in client.GetAllResources():
  if hasattr(resource.filename, 'text'):
    if resource.filename.text == file_to_find:
      print resource.__dict__ # So you can see all the crap it has
      found_resource = resource
      break

 
 


Deleting

To delete the file you just uploaded you could use the following. This uses the resource object you got when you uploaded above. You could also use a resource object which you have searched for.

client.DeleteResource(upload_doc, True, force=True)

The True and force=True make the deletion bypass the recycle bin, this little line is the whole reason I started this in the first place. The rest is just because I did not want to be using this for some things and Google CLI for others, even though it has more features and is likely more robust. Hopefully they will add something like this if/when they move to gdata-2.0.16

To make sure this works I have put it all in one file and tested it, you can find it here.  I think I got all the mistakes that were in as I typed this (should have done it in the file first).  If there are any discrepancies then the file is correct.  When run, provided you have a /tmp/test_file.txt file (or have changed the code for your own test file) and have a MyFolder ‘Collection’ setup in Google Docs the script should upload your file, find the MyFolder collection, move your file to MyFolder then delete your file.

Hopefully this will help somebody.  A document explaining a way to do some simple operations in Google Docs was something I was struggling to find.  If there are better ways to do this let me know.  I realise it could be more robust but it is meant just to show the basic workings.

Seems WordPress will not let me save a .py file, so here is the full code for ease of copying:

 

#!/usr/bin/python
# Or wherever your python install is, remember to make the script executable as well (chmod +x)

import sys
import os.path
import gdata.data
import gdata.acl.data
import gdata.docs.client
import gdata.docs.service
import gdata.docs.data
import gdata.sample_util
import argparse

APP_NAME = 'What ever you want to name it'
DEBUG = False
client = gdata.docs.client.DocsClient(source=APP_NAME)
client.http_client.debug = DEBUG

# Authenticate the user with CLientLogin, OAuth, or AuthSub.
try:
  gdata.sample_util.authorize_client(client,
  service=client.auth_service,
  source=client.source,
  scopes=client.auth_scopes
)
except gdata.client.BadAuthentication:
  exit('Invalid user credentials given.')
except gdata.client.Error:
  exit('Login Error')
  
# Give the document a title and put the location of the file to be uploaded in LOCAL_FILE
DOC_TITLE='test_file.txt'
LOCAL_FILE='/tmp/test_file.txt'
doc = gdata.docs.data.Resource(type='document', title=DOC_TITLE)
media = gdata.data.MediaSource()
media.SetFileHandle(LOCAL_FILE, 'application/octet-stream')
create_uri = gdata.docs.client.RESOURCE_UPLOAD_URI + '?convert=false'
upload_doc = client.CreateResource(doc, create_uri=create_uri, media=media)

COLLECTION_QUERY_URI 	= '/feeds/default/private/full/-/folder'
# Note that the target_collection is still just a resource object like found_resource
target_collection = None
folder_to_find = 'MyFolder'
for resource in client.GetAllResources(uri=COLLECTION_QUERY_URI):
  if resource.title.text == folder_to_find:
    # target_collection = client.GetResourceById(resource.resource_id.text)
    target_collection = resource
  
client.MoveResource(upload_doc,target_collection)

file_to_find = 'test_file.txt'  # Make a file of this name in the script directory, or pass a full path to whatever file.
found_resource = None         # Really, what is wrong with nil or null

# By default this will return all non-collection, non-bin resources
for resource in client.GetAllResources():
	if hasattr(resource.filename, 'text'):
		if resource.filename.text == file_to_find:
			print resource.__dict__ # So you can see all the crap it has
			found_resource = resource
			break

client.DeleteResource(upload_doc, True, force=True)

 
/badger

Advertisements