Uploading, Deleting (bypassing the bin) and Searching for Documents on Google Docs with python gdata

Posted: April 11, 2012 in Software Programming
Tags: , , , , ,

A Little About Google Docs

You all know what it is or you would not be reading this. Something to watch out for, Google Docs does not enforce unique file names. You can have two files named ‘really.txt’ in the same folder ‘FacePalm’. Google Docs stores these using a generated number as the unique index. You can get these indexes and work with them from the resource object. But, for the purposes of this example I am going to assume that you are going to make file names globally unique (the delete would just delete the first file of a given name in my code). By this I mean that you are going to enforce it in your code, or just by not doing it.

Hopefully after reading this you will see what I mean. You can handle more than globally unique filenames, it just takes more work. You would need to find the collection, then I think there is a parent attribute in the file resource you can use to check you have the correct file in the correct folder.

Read on to make sense of that.

Hopefully.

Background

I recently had to create a script which backed up files to Google Docs from the command line.  Google have created an API to do this and the language I chose to use for it was Python.  I am not really sure why, probably because I found the already written GoogleCL.  This command line tool uses the gdata-python-client to manage not only Google Docs but things like Calendar and Blogger as well.  I wrote a bash script for my backup logic and got the backup up and running.

I soon discovered a problem however.  When files were being deleted, like daily backups which were only to be kept for so many weeks, they were going into the recycle bin in Google Docs.  Items in the bin are included in the storage quota and I could not find anyway to turn the bin off or set it to empty weekly or something.  I also could not get GoogleCL to bypass the command line with its deletions. So the account maxed out and required manual bin emptying, defeating the whole point of a manual backup.

Reading further into gdata-python I found the appropriate function to bypass the recycle bin.  It was however in version 2.0.16 and GoogleCL uses 2.0.14.  In the end I decided to write my own Python script, using version 2.7.2 and gdata-2.0.16.  See the the links in the first paragraph for the websites for these projects.  They all have pretty simple install instructions.  I tested on OSX 10.6 but deployed on Linux.

Please note I am not an expert on this.  Some of my understanding of the subtleties may be a bit off.  As far as I am aware my code works but it is presented here with no warranty or guarantee of being fit for any purpose.  There were various code snippets and forum posts I found but none gave the kind of basic introduction to doing the basic tasks that I was looking for.  I imported the following.  Some may no longer be required but I do not want to go back and check them all.  They are pretty basic imports anyway, plus gdata.

If you are running this code please backup any data and I take no responsibility for anything you break, damage destroy or incite into a robotic revolution.

#!/usr/bin/python
# Or wherever your python install is, remember to make the script executable as well (chmod +x)

import sys
import os.path
import gdata.data
import gdata.acl.data
import gdata.docs.client
import gdata.docs.service
import gdata.docs.data
import gdata.sample_util
import argparse

 
 


The Client Object

The client object in gdata is the object you use to communicate with Google Docs.  Think of it as Google Docs, you tell it you want to upload something, delete something or do a search.  The following creation code is a slightly modified version of what is in the gdata examples. I have put everything just in one script/function. I split it out into more functions but you can do that as you prefer.

APP_NAME = 'What ever you want to name it'
DEBUG = False
client = gdata.docs.client.DocsClient(source=APP_NAME)
client.http_client.debug = DEBUG

# Authenticate the user with CLientLogin, OAuth, or AuthSub.
try:
  gdata.sample_util.authorize_client(client,
  service=client.auth_service,
  source=client.source,
  scopes=client.auth_scopes
)
except gdata.client.BadAuthentication:
  exit('Invalid user credentials given.')
except gdata.client.Error:
  exit('Login Error')

When you run this code you will be asked three questions.

  • Which authentication type do you want, I chose 1 which is username and password
  • The email address / username of your Google account
  • The password for your Google account

You can pass these are arguments to the command line when you run the script and gdata will pick them up automatically like:

./this_script.py --auth_type=1 --email=you@wherever.com --password=********

 
 


Upload

So now you have a script which can create a client object without you having to enter all the details at the CLI.  You can upload a file using the collowing

# Give the document a title and put the location of the file to be uploaded in LOCAL_FILE
DOC_TITLE='test_file.txt'
LOCAL_FILE='/tmp/test_file.txt'
doc = gdata.docs.data.Resource(type='document', title=DOC_TITLE)
media = gdata.data.MediaSource()
media.SetFileHandle(LOCAL_FILE, 'application/octet-stream')
create_uri = gdata.docs.client.RESOURCE_UPLOAD_URI + '?convert=false'
upload_doc = client.CreateResource(doc, create_uri=create_uri, media=media)

Now, the doc and media lines set the file information and local path. The create_uri variable is just the default setting plus an argument to turn off file conversion. By default when you upload files to Google Docs it will convert the file to its own format. This will ruin your day if you are trying to upload a tar.gz file or something. Unfortunately this only works on a paid Google account, if you have a free one then you can only upload files which you are happy to have converted. Just take the create_uri=create_uri out of the CreateResource function and it will use the default.

 
 


Searching for a collection

To find a collection you need to pass the uri (see the API for more details) telling Google Docs you want collections. Took me a while to find as it is not in the examples that come with gdata (that I could see).

COLLECTION_QUERY_URI 	= '/feeds/default/private/full/-/folder'
# Note that the target_collection is still just a resource object like found_resource
target_collection = None
folder_to_find = 'MyFolder'
for resource in client.GetAllResources(uri=COLLECTION_QUERY_URI):
  if resource.title.text == folder_to_find:
    # target_collection = client.GetResourceById(resource.resource_id.text)
    target_collection = resource

This loops through each resource including folders and gets the one with the target name. I know that the GetResourceById line is pointless, it is just there to show you where you get an ID for these resource objects and how you look it up again.

 
 


Putting the uploaded file in an existing collection

I could not get the file to upload directly to a collection, but I could move it to one.  Make sure you have made the collection in google docs first

client.MoveResource(upload_doc,target_collection)

 
 


Searching for a file

Now to search objects in Google Docs.  This could be refined using specific searches but I just queried all documents and searched for a title match.  You can query just for specific things but this post is intended for people who need to do something quickly and simply

file_to_find = 'test_file.txt.txt'
found_resource = None               # Really, what is wrong with nil or null

# By default this will return all non-collection, non-bin resources
for resource in client.GetAllResources():
  if hasattr(resource.filename, 'text'):
    if resource.filename.text == file_to_find:
      print resource.__dict__ # So you can see all the crap it has
      found_resource = resource
      break

 
 


Deleting

To delete the file you just uploaded you could use the following. This uses the resource object you got when you uploaded above. You could also use a resource object which you have searched for.

client.DeleteResource(upload_doc, True, force=True)

The True and force=True make the deletion bypass the recycle bin, this little line is the whole reason I started this in the first place. The rest is just because I did not want to be using this for some things and Google CLI for others, even though it has more features and is likely more robust. Hopefully they will add something like this if/when they move to gdata-2.0.16

To make sure this works I have put it all in one file and tested it, you can find it here.  I think I got all the mistakes that were in as I typed this (should have done it in the file first).  If there are any discrepancies then the file is correct.  When run, provided you have a /tmp/test_file.txt file (or have changed the code for your own test file) and have a MyFolder ‘Collection’ setup in Google Docs the script should upload your file, find the MyFolder collection, move your file to MyFolder then delete your file.

Hopefully this will help somebody.  A document explaining a way to do some simple operations in Google Docs was something I was struggling to find.  If there are better ways to do this let me know.  I realise it could be more robust but it is meant just to show the basic workings.

Seems WordPress will not let me save a .py file, so here is the full code for ease of copying:

 

#!/usr/bin/python
# Or wherever your python install is, remember to make the script executable as well (chmod +x)

import sys
import os.path
import gdata.data
import gdata.acl.data
import gdata.docs.client
import gdata.docs.service
import gdata.docs.data
import gdata.sample_util
import argparse

APP_NAME = 'What ever you want to name it'
DEBUG = False
client = gdata.docs.client.DocsClient(source=APP_NAME)
client.http_client.debug = DEBUG

# Authenticate the user with CLientLogin, OAuth, or AuthSub.
try:
  gdata.sample_util.authorize_client(client,
  service=client.auth_service,
  source=client.source,
  scopes=client.auth_scopes
)
except gdata.client.BadAuthentication:
  exit('Invalid user credentials given.')
except gdata.client.Error:
  exit('Login Error')
  
# Give the document a title and put the location of the file to be uploaded in LOCAL_FILE
DOC_TITLE='test_file.txt'
LOCAL_FILE='/tmp/test_file.txt'
doc = gdata.docs.data.Resource(type='document', title=DOC_TITLE)
media = gdata.data.MediaSource()
media.SetFileHandle(LOCAL_FILE, 'application/octet-stream')
create_uri = gdata.docs.client.RESOURCE_UPLOAD_URI + '?convert=false'
upload_doc = client.CreateResource(doc, create_uri=create_uri, media=media)

COLLECTION_QUERY_URI 	= '/feeds/default/private/full/-/folder'
# Note that the target_collection is still just a resource object like found_resource
target_collection = None
folder_to_find = 'MyFolder'
for resource in client.GetAllResources(uri=COLLECTION_QUERY_URI):
  if resource.title.text == folder_to_find:
    # target_collection = client.GetResourceById(resource.resource_id.text)
    target_collection = resource
  
client.MoveResource(upload_doc,target_collection)

file_to_find = 'test_file.txt'  # Make a file of this name in the script directory, or pass a full path to whatever file.
found_resource = None         # Really, what is wrong with nil or null

# By default this will return all non-collection, non-bin resources
for resource in client.GetAllResources():
	if hasattr(resource.filename, 'text'):
		if resource.filename.text == file_to_find:
			print resource.__dict__ # So you can see all the crap it has
			found_resource = resource
			break

client.DeleteResource(upload_doc, True, force=True)

 
/badger

Advertisements
Comments
  1. Deniz says:


    gdata-2.0.17
    googlecl_0.9.14-2_all.deb


    Traceback (most recent call last):
    File "/home/deniz/bin/gugl", line 38, in
    upload_doc = client.CreateResource(doc, create_uri=create_uri, media=media)
    File "/usr/local/lib/python2.7/dist-packages/gdata/docs/client.py", line 300, in create_resource
    return uploader.upload_file(create_uri, entry, **kwargs)
    File "/usr/local/lib/python2.7/dist-packages/gdata/client.py", line 1085, in upload_file
    start_byte, self.file_handle.read(self.chunk_size))
    File "/usr/local/lib/python2.7/dist-packages/gdata/client.py", line 1044, in upload_chunk
    raise error
    gdata.client.RequestError: Server responded with: 400, Invalid Request

  2. shadowbadger says:

    I may or may not have time to look at this. However, the above is insufficient. Would need at least the code being run, details of system. It would also be preferable to post asking a question instead of just a trace.

  3. Deniz says:

    I’m just launched the script with no args, and after auth it said:
    Traceback
    ...
    Invalid Request

    system: Ubuntu 12.04.2 LTS amd64
    kernel: 3.2.0-23-lowlatency x86_64
    cpu: Pentium(R) Dual-Core CPU E5300 @ 2.60GHz

  4. shadowbadger says:

    OK,

    I have not tested this since 2.16, about a year ago, although I do have an app running fine on 2.17 currently. I will redo the test example based on that.

  5. shadowbadger says:

    I am not sure what has gone on here. I went to look at the code today and it works. I have confirmed this with gdata-2.0.17 on pythons 2.7.2 and 2.7.3. This is on OSX (although the 2.7.3 is on macports, not the OSX native python).

    First, just check if works now, if so then I suspect temporary google glitch (unless it keeps happening).

    If you still cannot get it to work I suggest testing it with a python compiled from source (you can use the –prefix argument to install it in a place seperate from your system python. Then install gdata to this python (you may need to install argparse as well).

    Doing so would indicate whether it is a problem specific to the Ubuntu python stuff. Did you install gdata from Google Code source here: https://code.google.com/p/gdata-python-client/downloads/detail?name=gdata-2.0.17.tar.gz

    Or did you use an Ubuntu package?

    Also, you do not need googlecl for this script (I wrote this because of flaws in googlecl at the time, although only with the minimal functionality I needed).

  6. shadowbadger says:

    To the user who replied saying the link was broken, I am suspicious of your post being aimed at me allowing a reply so you can spam. I have checked the links and they all seem to work. Can you please reply stating specifically which URL is not working and I will check it out.

  7. […] i am trying 1 of example written in python from this link: when i try to upload_doc […]

  8. Rich says:

    I’m struggling with resource deletion when attempting to upload a more recent version of a file: the call to GetAllResources() doesn’t appear to find the file that I know is there: I’m printing the filenames in the loop, so can see that some subfolders are being searched, but apparently not all. I’ve seen some mention of paging in the API docs (https://pythonhosted.org/gdata/docs/fetching.html#retrieving-a-list-of-resources), but GetAllResources sounds like paging shouldn’t be required. Anyone else seeing limited return collections from GetAllResources()?

    • shadowbadger says:

      I have not seen this issue. GetAllResources has just worked for me, I am using this on a near empty drive though.

      Have you tried using the nextpage thing like with GetResources()

      resource_feed = client.GetResources()
      nextpage_feed = client.GetResources(uri=resource_feed.GetNextLink().href)

      I have noot looked into the source code but GetAllResources may just be a quick call to GetResources but specifying search variables of “anything” rather than making you specify.

      My way of doing this was quick and dirty, it would probably be better to actually use GetResources but specify the details of the file you want, these is I think a way to do that.

      • Rich says:

        Sorry slow reply: I’m afraid the Pi on which I was doing this work got killed by some clumsy soldering, and it took a while to revive. My notes from the time are incomplete, but the issue appeared to be something to do with file types & conversion from uploaded text to Google’s document representation. I definitely got deletion working by title, but ultimately moved to allowing the updater to convert my text file to the document representation, which seemed to mean that subsequent uploads overwrite existing, which suited my needs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s