Monday, January 16, 2012

Bing Vision API

As part of the Windows Phone 7.5 Mango upgrade, Microsoft added a nifty feature to Bing search called Bing Vision. Bing Vision is capable of scanning barcodes, book covers, album covers and posters. It can also perform OCR very capably. It appears that Bing Vision is Microsoft's response to Google Goggles. Just like Goggle Goggles, Microsoft's Bing Vision does not seem to have a public API. This post is an attempt at reverse-engineering the image search portion of the Bing Vision API. I have not been able to reverse-engineer the OCR portion of the Bing Vision yet, but once I get it I will be the first to post it.

Warning: this API is not publicly released by Microsoft, so a lot of it is subject to change. Use at your own risk.

It is important to note that Bing Vision's image search is limited to searching products. If I were to feed it an image of the Mona Lisa, I would get back a list of books and frames of the Mona Lisa for sale rather than identifying it as a portrait by the Italian artist Leonardo da Vinci.


T
hat said, the beauty in Bing Vision is in its simplicity. You simply feed the API call with an image, and you get back the result in an XML. Sounds simple.. right? Now let's dig deeper.

The API that we are going to examine in this post is Bing Vision's image search. Send a POST request to http://
wp.bingvision.ar.glbdns.microsoft.com/ImageSearchV2.ashx with the following headers:

    Pragma: no-cache
    Content-Type: image/jpeg

with the image you would like to search as the body of the POST request.

The following working code sample illustrates how you could issue your request in C#. Note the HTTP requests here are synchronous just for simplicity.

namespace BingImageSearch.NET
{
    using System;
    using System.Net;
    using System.IO;

    class BingImageSearch
    {
        static void Main(string[] args)
        {
            string path = @"C:\image.jpg";
            byte[] imageByteArray;

            using (FileStream fs = File.Open(path, FileMode.Open))
            {
                FileInfo info = new FileInfo(path);
                imageByteArray = new byte[info.Length];
                fs.Read(imageByteArray, 0, imageByteArray.Length);
            }

            BingImageSearch.BingImageQuery(imageByteArray);
        }

        private static void BingImageQuery(byte[] image)
        {
            HttpWebRequest request =
                (HttpWebRequest)WebRequest.Create("http://wp.bingvision.ar.glbdns.microsoft.com/ImageSearchV2.ashx");

            request.Method = "POST";
            BingImageSearch.AddHeaders(request);

            using (Stream stream = request.GetRequestStream())
            {
                stream.Write(image, 0, image.Length);
                stream.Flush();
            }

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            BingImageSearch.PrintResponse(response);
        }

        private static void AddHeaders(HttpWebRequest request)
        {
            request.ContentType = "image/jpeg";
            request.Headers["Pragma"] = "no-cache";
            request.KeepAlive = true;
        }

        private static void PrintResponse(HttpWebResponse response)
        {
            using (StreamReader reader = new StreamReader(response.GetResponseStream()))
            {
                Console.WriteLine(reader.ReadToEnd());
            }
        }
    }
}

The service is capable of accepting a barcode image, and it would return to the caller the barcode type and its number, as shown below:

  
    
  

And it can also accept the cover of the product. I get back the following XML when I post the cover of this game.


  
    
      EA FIFA Soccer 12
      FIFA Soccer 12 delivers a true soccer experience with authentic club and league licenses, and intelligent gameplay that mirrors real-world soccer. Compete as any one of over 500 officially licensed clubs and experience responsive, intelligent and realistic action. Enjoy turning defenders with sophisticated dribbling and ball control, snapping off precision shots and placing beautifully timed passes with pin point accuracy.
      EA FIFA Soccer 12
      http://bingvision.blob.core.windows.net/thumbnails/655f98db86ca372371078f0abc4016d6.jpg
      E0703D6431D7D0845005
    
    .
    .
    .
  

It is up to you to choose how you will parse the XML response. If you plan to explore this API further with C#, you might want to create proxy objects (via xsd.exe) so that you can easily serialize the XML and enumerate the objects and their properties. I will not be covering that in this post.

Sunday, May 22, 2011

Sudoku Wizard - a Google Goggles Experiment

Google Goggles posts:
Back in February, I worked on reverse engineering the Google Goggles service and came out with some pretty good results. With more or less a working API set, I decided to write an application just to have fun with the stuff.

So you ask, "what did you exactly build?" I remember back in January when Goggles 1.3 was announced, Google demoed a nifty new feature in Goggles, which recognizes a Sudoku puzzle when provided a picture of one and solves it for you. I was really intrigued with the service since, and that's what drove me to reverse engineer the protocol. So back to the question, I wrote a Sudoku solver Windows Phone 7 application that does just that. You might also ask, "why Windows Phone 7?" Well, it's simple really. Google Goggles is available on iOS and Android platforms - it wouldn't make much sense to build another app targeting those platforms.

If you happen to have a WP7 in your possession, check out the "Sudoku Wizard" app
here. Did I mention it's free (yes, Ad-free too).

Sudoku Wizard was written using the information from my previous
post. In my previous post, I said that there are two headers that need to be added to all HTTP communication with Goggles. There is another header required for this to work. I needed to modify the user-agent as well. Apparently, Goggles requires an iOS/Android device (or perhaps a WebKit browser) in order to solve your Sudoku puzzle - and that's easily taken care of by modifying the user-agent of each HTTP request. I omitted it from my instructions in my previous post because I had thought it was insignificant. So the only thing different that I needed to do was to change the user-agent in all of the HTTP POST requests. The headers now look like:

    Content-Type: application/x-protobuffer
    Pragma: no-cache
    User-Agent: Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_1_3 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7E18 Safari/528.16 GoogleMobileApp/0.7.3.5675 GoogleGoggles-iPhone/1.0; gzip

In C#:

httpWebRequest.UserAgent = "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_1_3 like " +
"Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7E18 " + 
"Safari/528.16 GoogleMobileApp/0.7.3.5675 GoogleGoggles-iPhone/1.0; gzip"

When posting an image of a Sudoku, Goggles will respond with a protobuffer blob with some URLs in it. Goggles solves the Sudoku and returns the solution in a URL, which needs to be displayed in a browser. An example of a response is shown below. The highlighted URL is the one that contains the solution and needs to be shown to the user (keep in mind that the URLs will be expired by the time you have viewed them):

    2f6


    ..
    ..
    ..00407003060000810701300000503871000000000000000005932050000
    210107800006090060700|854971632629538147713246895238  
    714569965382471471659328586497213147823956392165784.
    similar image...;..

    ....... .....1./http://www.google.com/goggles/images/sudoku.png:
    Sudoku PuzzleB.J
    Sudoku PuzzleR.ZOhttp://www.google.com/goggles/result?cssid=   
    30F2B33236EF604A&rh=061F6991C6883113...;Q.Ohttp://  
    www.google.com/goggles/result?
    cssid=30F2B33236EF604A&rh=E3378E086956A162...=.

    Sudoku Puzzle.
    Sudoku Puzzle...;..+.8.n._.2.965f0e6e1638152b"..
    +TR=T=47zjUtTuOC0:X=N2eE6:S=_Wfmn5yVnPqmIncl
    +TR=T=VRi6Lyke0hg:X=N2eE6:S=v6Pd5Kfnqtmr6F1u
    +TR=T=1Ez8_2tSjsc:X=N2eE6:S=ZH8A8bLT2fSEU0L0
    +TR=T=itRprQ9o70M:X=N2eE6:S=H8Oyx0Srj8HaeyQL
    +TR=T=A4_uRb4_Axg:X=N2eE6:S=_afui5_H60kiNDEg....

    0

There is obviously some amount of parsing that I had to do to extract the URLs that I am interested in. I did not implement a protobuffer interpreter/parser - I just used some .NET String methods and regex to get what I need and it works great.

Sudoku Wizard.

Thursday, February 3, 2011

Google Goggles API

Update 5/18/11 - Finally I had the time to update the post. I now have a working Google Goggles API set. In case you are wondering what changed: The protobuffer number a has changed from x + 401 to x + 32, where x is the image size. Also, there are trailing bytes that you need to send after the image byte array when posting the photo. All of this information is detailed in the post - just copy the code as-is and it should work.  Also, I am no longer having issues sending images larger than 140 KB.  I do not know what the upper limit is, but it seems to have increased.

Update 2/14/11 
- I have been battling a class of failures that I discovered when posting an image. It turns out Goggles expects images to be less than 140 KB in size. If you pass in an image that is too large (1 MB), you will get a 404 failure. If you pass in an image slightly larger than 140 KB, you will get a 200 OK, but with an empty response. The image has to be less than 140 KB.

Google Goggles posts: Google Goggles is a service that allows you to search the web using pictures taken from your mobile device. I was quite taken with it when I took a picture of a painting in my living room, and it correctly recognized it as a Banksy's Flower Thrower stencil.

I started to look for an API set for Goggles online to help me with a WP7 application that I have in mind, but did not find anything. So I went about trying to reverse engineer it myself, and I came out with some successful results.

A word of caution though, Google hasn't officially published an API set yet, so a lot of this is subject to change. In other words, use it at your own risk.

So let's start. You do not need to authenticate to use the Goggles API. To use the API, you will need to generate the following:

  • CSSID: a string composed of 16 random hexadecimal values. An example of a CSSID would be 43CBE97C0C31FE5A. The CSSID will remain valid for subsequent requests until you receive an error using it, then you will have to generate a new one.

Once you generated a CSSID, you will need to send a POST request to http://www.google.com/goggles/container_proto?cssid=[your cssid] with the following headers:

    Content-Type: application/x-protobuffer
    Pragma: no-cache

and with the following body (as a byte array, NOT strings - see code below for clarification):

    22, 00, 62, 3C, 0A, 13, 22, 02, 65, 6E, BA, D3, F0, 3B, 0A, 08, 01, 10, 01, 28, 

    01, 30, 00, 38, 01, 12, 1D, 0A, 09, 69, 50, 68, 6F, 6E, 65, 20, 4F, 53, 12, 03, 
    34, 2E, 31, 1A, 00, 22, 09, 69, 50, 68, 6F, 6E, 65, 33, 47, 53, 1A, 02, 08, 02, 
    22, 02, 08, 01

If your post is successful, you should receive a 200 OK. This code demonstrates how to create and post your CSSID (synchronous requests are used for simplicity):

namespace GoogleGoggles.NET
{
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Net;

    public class GoogleGoggles
    {
        // The POST body required to validate the CSSID.
        private static byte[] CssidPostBody = new byte[] { 34, 0, 98, 60, 10, 19, 34,
            2, 101, 110, 186, 211, 240, 59, 10, 8, 1, 16, 1, 40, 1, 48, 0, 56, 1, 18,
            29, 10, 9, 105, 80, 104, 111, 110, 101, 32, 79, 83, 18, 3, 52, 46, 49, 
            26, 0, 34, 9, 105, 80, 104, 111, 110, 101, 51, 71, 83, 26, 2, 8, 2, 34,
            2, 8, 1 };
 
        // Bytes trailing the image byte array. Look at the next code snippet to see
        // where it is used in SendPhoto() method.
        private static byte[] TrailingBytes = new byte[] { 24, 75, 32, 1, 48, 0, 146,
            236, 244, 59, 9, 24, 0, 56, 198, 151, 220, 223, 247, 37, 34, 0 };

        // Generates a cssid.
        private static string Cssid
        {
            get
            {
                Random random = new Random((int)DateTime.Now.Ticks);
                return string.Format(
                    "{0}{1}",
                    random.Next().ToString("X8"), 
                    random.Next().ToString("X8"));
            }
        }

        static void Main(string[] args)
        {
            string cssid = GoogleGoggles.Cssid;
            GoogleGoggles.ValidateCSSID(cssid);

            // See next code snippet for SendPhoto()
            //GoogleGoggles.SendPhoto(cssid, yourImageByteArray);
            Console.ReadLine();
        }

        // Validates the CSSID we just created, by POSTing it to Goggles.
        private static void ValidateCSSID(string cssid)
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
                string.Format(
                "http://www.google.com/goggles/container_proto?cssid={0}", 
                cssid));

            GoogleGoggles.AddHeaders(request);
            request.Method = "POST";

            using (Stream stream = request.GetRequestStream())
            {
                stream.Write(
                    GoogleGoggles.CssidPostBody, 
                    0,
                    GoogleGoggles.CssidPostBody.Length);

                stream.Flush();
            }

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        }

        private static void AddHeaders(HttpWebRequest request)
        {
            request.ContentType = "application/x-protobuffer";
            request.Headers["Pragma"] = "no-cache";
            request.KeepAlive = true;
        }
    }
}

If you received a 200 OK response, you will use the same CSSID for future request until you receive an error using it. As a rule of thumb, I generate a new one each time I launch the application, and keep on using it until I get an error.

Moving on. We can start posting images now that you have a valid CSSID. But first we will need to understand a bit about Google's ProtoBuffer. Protocol Buffers, or ProtoBuffer, is Google's way of serializing structured data for use in communication protocols, data storage, and more (in their own words). Explaining ProtoBuffer in detail is out of the scope of this post - I will only cover the basics, so you will have to follow the link if you would like to know more about it. First you will need the following:

  • A jpeg image - I haven't played with other formats. If you will be using images taken from a phone camera (if you are building a phone application for example), you will likely need to resize it since high-megapixel phone cameras are common now a days.
  • The size of the jpeg image in bytes.

You will then need to calculate the following integers. Let x be the size of the image in bytes.

    a = x + 32
    b = x + 14
    c = x + 10

Now, you will need to encode a, b, c and x into varints. Let's assume x = 27985. To encode it into variant, do the following:
  1. Convert the number into binary:
    110110101010001
  2. From right to left, divide the bits into groups of 7 bits each. Append zeros to the most left group to make it 7 bits if needed:
    0000001 1011010 1010001
  3. Reverse the groups:
    1010001 1011010 0000001
  4. Convert the groups of 7 bits into bytes. From left to right, set the msb (most significant bit) if there are further bytes to come. If you reach the last byte, the msb should 0:
    11010001 11011010 00000001
    0xD1DA01
So x is 0xD1DA01 in varint. Using the same exercise, you will find that a, b and c encoded in varint to be 0xF1DA01, 0xDFDA01 and 0xDBDA01 respectively.

Now you are ready to conduct a search using a picture. Send a POST request to the same URL http://www.google.com/goggles/container_proto?cssid=[your cssid] with the same headers:

    Content-Type: application/x-protobuffer
    Pragma: no-cache

and with the following body (as a byte array, NOT strings):

    0A a 0A b 0A c 0A x [image as a byte array] [trailing bytes]

With our example, that would be:

    0A F1 DA 01 0A DF DA 01 0A DB DA 01 0A D1 DA 01[image as a byte array] [trailing bytes]

The trailing bytes are (as byte array, NOT strings - see code below for more details):

    18, 4B, 20, 01, 30, 00, 92, EC, F4, 3B, 09, 18, 00, 38, C6, 97, DC, DF, F7, 25,
    22, 00

This code summarizes the above and shows how to post an image (again, using synchronous requests for simplicity):

// Conducts an image search by POSTing an image to Goggles, along with a valid CSSID.
public static void SendPhoto(string cssid, byte[] image)
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
        string.Format(
        "http://www.google.com/goggles/container_proto?cssid={0}", 
        cssid));

    GoogleGoggles.AddHeaders(request);
    request.Method = "POST";

    // x = image size
    int x = image.Length;
    byte[] xVarint = GoogleGoggles.ToVarint32(x).ToArray<byte>();

    // a = x + 32
    byte[] aVarint = GoogleGoggles.ToVarint32(x + 32).ToArray<byte>();

    // b = x + 14
    byte[] bVarint = GoogleGoggles.ToVarint32(x + 14).ToArray<byte>();

    // c = x + 10
    byte[] cVarint = GoogleGoggles.ToVarint32(x + 10).ToArray<byte>();

    // 0A [a] 0A [b] 0A [c] 0A [x] [image bytes]
    using (Stream stream = request.GetRequestStream())
    {
        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // a
        stream.Write(aVarint, 0, aVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // b
        stream.Write(bVarint, 0, bVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // c
        stream.Write(cVarint, 0, cVarint.Length);

        // 0x0A
        stream.Write(new byte[] { 10 }, 0, 1);

        // x
        stream.Write(xVarint, 0, xVarint.Length);

        // Write image 
        stream.Write(image, 0, image.Length);

        // Write trailing bytes
        stream.Write(
            GoogleGoggles.TrailingBytes, 
            0, 
            GoogleGoggles.TrailingBytes.Length);

        stream.Flush();
    }

    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
}

// Encodes an int32 into varint32.
public static IEnumerable<byte> ToVarint32(int value)
{
    int index = 0;

    while ((0x7F & value) != 0)
    {
        int i = (0x7F & value);
        if ((0x7F & (value >> 7)) != 0)
        {
            i += 128;
        }

        yield return ((byte)i);
        value = value >> 7;
        index++;
    }
}

If the POST succeeded, you should receive a 200 OK with a ProtoBuffer response buddy. The response will include a bunch of URLs depending on the results your image search generated. If Goggles did not find any results to your image search, it would return URLs of similar images to browse from. If it found one or more results for your search, it will return URLs representing the search results. In other words, the search results are URLs viewed in a web browser, rather than receiving an XML or a JSON blob. I will not explain how to interpret/parse the URLs out of the ProtoBuffer response body in this post (perhaps in a future post).

Before I end this post, let's take a quick look at an example. If you post the image shown below, using the above directions/code:

you will get the results similar to this (obtained from Wireshark trace):

    24b


    ..
    .....;B
    .......s '.../.-
    http://www.google.com/goggles/images/text.png2.en...;Q.O
    http://www.google.com/goggles/result?
    cssid=3FEDE280756ECB89&rh=F0FB212061A4DC0D...=.
    .Hello, goggles!..Text
    .....;..
    .......} ......
    Bhttp://benedictehh.blogg.no/images/315802-2-1241960116211-
    n400.jpg._http://t1.gstatic.com/images?q=
    tbn:TGUR3fPDjEkjMM&w=120&h=87&usg=___LKNIk0JoGL1z6Rc43y8v
    yv9DDE=.Ahttp://benedictehh.blogg.no/1241959699_lesevurdere_
    din_blogg.html...;Q.Ohttp://www.google.com/goggles
    /result?cssid=3FEDE280756ECB89&rh=066B8E469AA45B09...="
    .religi..se bilder.
    Similar Image...;...E..Y...

    0

You will notice that the Goggles was able to convert the image to text (Hello, goggles! in green). The highlighted link in yellow is the search result with "Hello, goggles!" as the search term. The link will have probably expired by the time you check it out.

The rest of the ProtoBuffer (most of them represented by '.' because the values don't have an ASCII representation) is probably additional meta-data that the Google provides for a better UX. I am willing to bet that some of these are coordinates inside of the picture, which when interpreted correctly, it shows the user where the text in the picture resides. I am not interested enough to decode them myself though (yet).

If you are planning to build an application using this, I am interested to hear what is it you are trying to build - please feel free to drop a line. Similarly, if you have any comments or suggestions, please do not hesitate to post a comment.