Thursday, February 3, 2011

Google Goggles API

Update 5/18/11 - Finally I had the time to update the post. I now have a working Google Goggles API set. In case you are wondering what changed: The protobuffer number a has changed from x + 401 to x + 32, where x is the image size. Also, there are trailing bytes that you need to send after the image byte array when posting the photo. All of this information is detailed in the post - just copy the code as-is and it should work.  Also, I am no longer having issues sending images larger than 140 KB.  I do not know what the upper limit is, but it seems to have increased.

Update 2/14/11 
- I have been battling a class of failures that I discovered when posting an image. It turns out Goggles expects images to be less than 140 KB in size. If you pass in an image that is too large (1 MB), you will get a 404 failure. If you pass in an image slightly larger than 140 KB, you will get a 200 OK, but with an empty response. The image has to be less than 140 KB.

Google Goggles posts: Google Goggles is a service that allows you to search the web using pictures taken from your mobile device. I was quite taken with it when I took a picture of a painting in my living room, and it correctly recognized it as a Banksy's Flower Thrower stencil.

I started to look for an API set for Goggles online to help me with a WP7 application that I have in mind, but did not find anything. So I went about trying to reverse engineer it myself, and I came out with some successful results.

A word of caution though, Google hasn't officially published an API set yet, so a lot of this is subject to change. In other words, use it at your own risk.

So let's start. You do not need to authenticate to use the Goggles API. To use the API, you will need to generate the following:

  • CSSID: a string composed of 16 random hexadecimal values. An example of a CSSID would be 43CBE97C0C31FE5A. The CSSID will remain valid for subsequent requests until you receive an error using it, then you will have to generate a new one.

Once you generated a CSSID, you will need to send a POST request to http://www.google.com/goggles/container_proto?cssid=[your cssid] with the following headers:

    Content-Type: application/x-protobuffer
    Pragma: no-cache

and with the following body (as a byte array, NOT strings - see code below for clarification):

    22, 00, 62, 3C, 0A, 13, 22, 02, 65, 6E, BA, D3, F0, 3B, 0A, 08, 01, 10, 01, 28, 

    01, 30, 00, 38, 01, 12, 1D, 0A, 09, 69, 50, 68, 6F, 6E, 65, 20, 4F, 53, 12, 03, 
    34, 2E, 31, 1A, 00, 22, 09, 69, 50, 68, 6F, 6E, 65, 33, 47, 53, 1A, 02, 08, 02, 
    22, 02, 08, 01

If your post is successful, you should receive a 200 OK. This code demonstrates how to create and post your CSSID (synchronous requests are used for simplicity):

namespace GoogleGoggles.NET
{
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Net;

    public class GoogleGoggles
    {
        // The POST body required to validate the CSSID.
        private static byte[] CssidPostBody = new byte[] { 34, 0, 98, 60, 10, 19, 34,
            2, 101, 110, 186, 211, 240, 59, 10, 8, 1, 16, 1, 40, 1, 48, 0, 56, 1, 18,
            29, 10, 9, 105, 80, 104, 111, 110, 101, 32, 79, 83, 18, 3, 52, 46, 49, 
            26, 0, 34, 9, 105, 80, 104, 111, 110, 101, 51, 71, 83, 26, 2, 8, 2, 34,
            2, 8, 1 };
 
        // Bytes trailing the image byte array. Look at the next code snippet to see
        // where it is used in SendPhoto() method.
        private static byte[] TrailingBytes = new byte[] { 24, 75, 32, 1, 48, 0, 146,
            236, 244, 59, 9, 24, 0, 56, 198, 151, 220, 223, 247, 37, 34, 0 };

        // Generates a cssid.
        private static string Cssid
        {
            get
            {
                Random random = new Random((int)DateTime.Now.Ticks);
                return string.Format(
                    "{0}{1}",
                    random.Next().ToString("X8"), 
                    random.Next().ToString("X8"));
            }
        }

        static void Main(string[] args)
        {
            string cssid = GoogleGoggles.Cssid;
            GoogleGoggles.ValidateCSSID(cssid);

            // See next code snippet for SendPhoto()
            //GoogleGoggles.SendPhoto(cssid, yourImageByteArray);
            Console.ReadLine();
        }

        // Validates the CSSID we just created, by POSTing it to Goggles.
        private static void ValidateCSSID(string cssid)
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
                string.Format(
                "http://www.google.com/goggles/container_proto?cssid={0}", 
                cssid));

            GoogleGoggles.AddHeaders(request);
            request.Method = "POST";

            using (Stream stream = request.GetRequestStream())
            {
                stream.Write(
                    GoogleGoggles.CssidPostBody, 
                    0,
                    GoogleGoggles.CssidPostBody.Length);

                stream.Flush();
            }

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        }

        private static void AddHeaders(HttpWebRequest request)
        {
            request.ContentType = "application/x-protobuffer";
            request.Headers["Pragma"] = "no-cache";
            request.KeepAlive = true;
        }
    }
}

If you received a 200 OK response, you will use the same CSSID for future request until you receive an error using it. As a rule of thumb, I generate a new one each time I launch the application, and keep on using it until I get an error.

Moving on. We can start posting images now that you have a valid CSSID. But first we will need to understand a bit about Google's ProtoBuffer. Protocol Buffers, or ProtoBuffer, is Google's way of serializing structured data for use in communication protocols, data storage, and more (in their own words). Explaining ProtoBuffer in detail is out of the scope of this post - I will only cover the basics, so you will have to follow the link if you would like to know more about it. First you will need the following:

  • A jpeg image - I haven't played with other formats. If you will be using images taken from a phone camera (if you are building a phone application for example), you will likely need to resize it since high-megapixel phone cameras are common now a days.
  • The size of the jpeg image in bytes.

You will then need to calculate the following integers. Let x be the size of the image in bytes.

    a = x + 32
    b = x + 14
    c = x + 10

Now, you will need to encode a, b, c and x into varints. Let's assume x = 27985. To encode it into variant, do the following:
  1. Convert the number into binary:
    110110101010001
  2. From right to left, divide the bits into groups of 7 bits each. Append zeros to the most left group to make it 7 bits if needed:
    0000001 1011010 1010001
  3. Reverse the groups:
    1010001 1011010 0000001
  4. Convert the groups of 7 bits into bytes. From left to right, set the msb (most significant bit) if there are further bytes to come. If you reach the last byte, the msb should 0:
    11010001 11011010 00000001
    0xD1DA01
So x is 0xD1DA01 in varint. Using the same exercise, you will find that a, b and c encoded in varint to be 0xF1DA01, 0xDFDA01 and 0xDBDA01 respectively.

Now you are ready to conduct a search using a picture. Send a POST request to the same URL http://www.google.com/goggles/container_proto?cssid=[your cssid] with the same headers:

    Content-Type: application/x-protobuffer
    Pragma: no-cache

and with the following body (as a byte array, NOT strings):

    0A a 0A b 0A c 0A x [image as a byte array] [trailing bytes]

With our example, that would be:

    0A F1 DA 01 0A DF DA 01 0A DB DA 01 0A D1 DA 01[image as a byte array] [trailing bytes]

The trailing bytes are (as byte array, NOT strings - see code below for more details):

    18, 4B, 20, 01, 30, 00, 92, EC, F4, 3B, 09, 18, 00, 38, C6, 97, DC, DF, F7, 25,
    22, 00

This code summarizes the above and shows how to post an image (again, using synchronous requests for simplicity):

// Conducts an image search by POSTing an image to Goggles, along with a valid CSSID.
public static void SendPhoto(string cssid, byte[] image)
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
        string.Format(
        "http://www.google.com/goggles/container_proto?cssid={0}", 
        cssid));

    GoogleGoggles.AddHeaders(request);
    request.Method = "POST";

    // x = image size
    int x = image.Length;
    byte[] xVarint = GoogleGoggles.ToVarint32(x).ToArray<byte>();

    // a = x + 32
    byte[] aVarint = GoogleGoggles.ToVarint32(x + 32).ToArray<byte>();

    // b = x + 14
    byte[] bVarint = GoogleGoggles.ToVarint32(x + 14).ToArray<byte>();

    // c = x + 10
    byte[] cVarint = GoogleGoggles.ToVarint32(x + 10).ToArray<byte>();

    // 0A [a] 0A [b] 0A [c] 0A [x] [image bytes]
    using (Stream stream = request.GetRequestStream())
    {
        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // a
        stream.Write(aVarint, 0, aVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // b
        stream.Write(bVarint, 0, bVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // c
        stream.Write(cVarint, 0, cVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // x
        stream.Write(xVarint, 0, xVarint.Length);

        // Write image 
        stream.Write(image, 0, image.Length);

        // Write trailing bytes
        stream.Write(
            GoogleGoggles.TrailingBytes, 
            0, 
            GoogleGoggles.TrailingBytes.Length);

        stream.Flush();
    }

    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
}

// Encodes an int32 into varint32.
public static IEnumerable<byte> ToVarint32(int value)
{
    int index = 0;

    while ((0x7F & value) != 0)
    {
        int i = (0x7F & value);
        if ((0x7F & (value >> 7)) != 0)
        {
            i += 128;
        }

        yield return ((byte)i);
        value = value >> 7;
        index++;
    }
}

If the POST succeeded, you should receive a 200 OK with a ProtoBuffer response buddy. The response will include a bunch of URLs depending on the results your image search generated. If Goggles did not find any results to your image search, it would return URLs of similar images to browse from. If it found one or more results for your search, it will return URLs representing the search results. In other words, the search results are URLs viewed in a web browser, rather than receiving an XML or a JSON blob. I will not explain how to interpret/parse the URLs out of the ProtoBuffer response body in this post (perhaps in a future post).

Before I end this post, let's take a quick look at an example. If you post the image shown below, using the above directions/code:

you will get the results similar to this (obtained from Wireshark trace):

    24b


    ..
    .....;B
    .......s '.../.-
    http://www.google.com/goggles/images/text.png2.en...;Q.O
    http://www.google.com/goggles/result?
    cssid=3FEDE280756ECB89&rh=F0FB212061A4DC0D...=.
    .Hello, goggles!..Text
    .....;..
    .......} ......
    Bhttp://benedictehh.blogg.no/images/315802-2-1241960116211-
    n400.jpg._http://t1.gstatic.com/images?q=
    tbn:TGUR3fPDjEkjMM&w=120&h=87&usg=___LKNIk0JoGL1z6Rc43y8v
    yv9DDE=.Ahttp://benedictehh.blogg.no/1241959699_lesevurdere_
    din_blogg.html...;Q.Ohttp://www.google.com/goggles
    /result?cssid=3FEDE280756ECB89&rh=066B8E469AA45B09...="
    .religi..se bilder.
    Similar Image...;...E..Y...

    0

You will notice that the Goggles was able to convert the image to text (Hello, goggles! in green). The highlighted link in yellow is the search result with "Hello, goggles!" as the search term. The link will have probably expired by the time you check it out.

The rest of the ProtoBuffer (most of them represented by '.' because the values don't have an ASCII representation) is probably additional meta-data that the Google provides for a better UX. I am willing to bet that some of these are coordinates inside of the picture, which when interpreted correctly, it shows the user where the text in the picture resides. I am not interested enough to decode them myself though (yet).

If you are planning to build an application using this, I am interested to hear what is it you are trying to build - please feel free to drop a line. Similarly, if you have any comments or suggestions, please do not hesitate to post a comment.