Andrew C. Oliver
Contributing Writer

How to use the Google Vision API

how-to
Apr 26, 20186 mins

Happy or sad? Cat or person? Use the Google API to detect details about images

women looking through binoculars future vision prediction millennial
Credit: Thinkstock

Recently, I covered how computers can see, hear, feel, smell, and taste. One of the ways your code can โ€œseeโ€ is with the Google Vision API. Google Vision API connects your code to Googleโ€™s image recognition capabilities. You can think of Google Image Search as a kind of API/REST interface to images.google.com, but it does much more than show you similar images.

Google Vision can detect whether youโ€™re a cat or a human, as well as the parts of your face. It tries to detect whether youโ€™re posed or doing something that wouldnโ€™t be okay for Google Safe Searchโ€”or not. It even tries to detect if youโ€™re happy or sad.

Setting up the Google Vision API

To use the Google Vision API, you have to sign up for a Google Compute Engine Account. GCE is free to try but you will need a credit card to sign up. From there you select a project (but My First Project is selected if you have just signed up). Then get yourself an API key from the lefthand menu.

google vision api screen 1 IDG

Here, Iโ€™m using a simple API key that I can use with the command line tool Curl (if you prefer, you can use a different tool able to call REST APIs):

google vision api screen 2 IDG

Save the key it generates to a text file or buffer somewhere (I refer to it as YOUR_KEY for now on) and enable the API on your project (go to this URL and click Enable the API):

google vision api screen 3 IDG

Select your project from the next screen:

google vision api screen 4 IDG

Now youโ€™re ready to go! Stick this text in a file called google_vision.json:

{
ย "requests":[
ย ย ย ย {
ย  ย ย ย ย "image":{
ย ย ย  ย ย "source":{
ย ย ย ย ย  "imageUri":
ย ย ย ย ย ย "https://upload.wikimedia.org/wikipedia/commons/9/9b/Gustav_chocolate.jpg"
ย ย ย  ย ย ย ย ย }
ย  ย ย ย },
ย ย ย ย ย ย ย "features": [{
ย ย ย ย ย ย ย ย  "type": "TYPE_UNSPECIFIED",
ย ย ย ย ย ย ย ย ย "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย ย {
ย ย ย ย ย ย ย   "type": "LANDMARK_DETECTION",
ย ย ย ย ย ย ย ย ย "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย  {
ย ย ย ย ย ย  ย  "type": "FACE_DETECTION",
ย ย ย ย ย ย ย ย ย "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย ย {
ย ย ย ย ย ย  ย ย "type": "LOGO_DETECTION",
ย ย ย ย ย ย ย ย ย "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย ย {
ย ย ย ย ย ย ย ย  "type": "LABEL_DETECTION",
ย ย ย ย ย ย ย ย  "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย  {
ย ย ย ย ย     "type": "TEXT_DETECTION",
ย ย ย ย ย  ย ย  "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย   {
ย ย ย  ย ย ย ย  "type": "SAFE_SEARCH_DETECTION",
ย ย ย ย ย ย ย ย  "maxResults": 50
ย ย ย ย  },
ย ย ย     {
ย ย ย ย ย   ย  "type": "IMAGE_PROPERTIES",
ย ย ย ย ย ย ย ย ย "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย ย {
ย ย ย ย ย ย ย ย  "type": "CROP_HINTS",
ย ย ย ย ย ย ย ย  "maxResults": 50
ย ย ย ย ย },
ย ย ย ย ย ย  {
ย ย ย ย    ย  "type": "WEB_DETECTION",
ย ย ย ย ย ย ย   "maxResults": 50
ย ย ย ย ย }
ย ย ย ย ]
ย ย  }
ย  ]
}

This JSON request tells the Google Vision API which image to parse and which of its detection features to enable. I just did most of them up to 50 results.

Now use Curl:

curl -v -s -H "Content-Type: application/json" https://vision.googleapis.com/v1/images:annotate?key=YOUR_KEY --data-binary @google_vision.json > results

Looking at the Google Vision API response

* Connected to vision.googleapis.com (74.125.196.95) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.googleapis.com
* Server certificate: Google Internet Authority G3
* Server certificate: GlobalSign
> POST /v1/images:annotate?key=YOUR_KEY HTTP/1.1
> Host: vision.googleapis.com
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2252
> Expect: 100-continue
>
* Done waiting for 100-continue
} [2252 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Vary: X-Origin
< Vary: Referer
< Date: Tue, 24 Apr 2018 18:26:10 GMT
< Server: ESF
< Cache-Control: private
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< Alt-Svc: hq=":443"; ma=2592000; quic=51303433; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="43,42,41,39,35"
< Accept-Ranges: none
< Vary: Origin,Accept-Encoding
< Transfer-Encoding: chunked
<ย 
{ [905 bytes data]
* Connection #0 to host vision.googleapis.com left intact

You should see something like this:

If you look in results, youโ€™ll see this:

{
ย "responses": [
   {
ย  ย ย "labelAnnotations": [
ย ย ย  {
ย ย  ย "mid": "/m/01yrx",
ย ย ย ย "description": "cat",
ย ย ย ย "score": 0.99524164,
ย ย ย ย "topicality": 0.99524164
ย ย ย  },
ย ย ย  {
ย ย ย ย "mid": "/m/035qhg",
ย ย ย ย "description": "fauna",
ย ย ย ย "score": 0.93651986,
ย ย ย ย "topicality": 0.93651986
ย ย ย  },
ย ย  ย {
ย ย ย ย "mid": "/m/04rky",
ย ย ย ย "description": "mammal",
ย ย ย ย "score": 0.92701304,
ย ย ย ย "topicality": 0.92701304
ย ย ย  },
ย ย ย  {
ย ย ย ย "mid": "/m/07k6w8",
ย ย ย ย "description": "small to medium sized cats",
ย ย ย ย "score": 0.92587274,
ย ย ย ย "topicality": 0.92587274
ย ย ย  },
ย ย ย  {
ย ย ย ย "mid": "/m/0307l",
ย ย ย ย "description": "cat like mammal",
ย ย ย ย "score": 0.9215815,
ย ย ย ย "topicality": 0.9215815
ย ย ย  },
ย ย ย  {
ย ย ย ย "mid": "/m/09686",
ย ย ย ย "description": "vertebrate",
ย ย ย ย "score": 0.90370363,
ย ย ย ย "topicality": 0.90370363
ย ย ย  },
ย ย ย  {
ย ย ย ย "mid": "/m/01l7qd",
ย ย ย ย "description": "whiskers",
ย ย ย ย "score": 0.86890864,
ย ย ย ย "topicality": 0.86890864
โ€ฆ

Google knows you have supplied it a cat picture. It even found the whiskers!

Now, Iโ€™ll try a larger mammal. Replace the URL in the request with my Twitter profile picture and run it again. It has a picture of me getting smooched by an elephant on my 2014 trip to Thailand.

The results will include locations of my facial features.

โ€ฆ
ย ย ย ย ย  ย ย ย ย ย  "landmarks": [
ย ย ย ย ย ย ย  ย ย ย  {
ย ย ย ย ย ย ย ย ย ย  "type": "LEFT_EYE",
ย ย ย ย ย ย ย ย ย ย  "position": {
"x": 114.420876,
"y": 252.82072,
"z": -0.00017215312
ย ย ย ย ย ย ย ย ย ย  }
ย ย ย ย ย ย ย  ย ย ย  },
ย ย ย ย ย ย ย  ย ย ย  {
ย ย ย ย ย ย ย ย ย ย  "type": "RIGHT_EYE",
ย ย ย ย ย ย ย ย ย ย  "position": {
"x": 193.82027,
"y": 259.787,
"z": -4.495486
ย ย ย ย ย ย ย ย ย ย  }
ย ย ย ย ย ย ย  ย ย ย  },
ย ย ย ย ย ย ย  ย ย ย  {
ย ย ย ย ย ย ย ย ย ย  "type": "LEFT_OF_LEFT_EYEBROW",
ย ย ย ย ย ย ย ย ย ย  "position": {
"x": 95.38249,
"y": 234.60289,
"z": 11.487803
ย ย ย ย ย ย ย ย ย ย  }
ย ย ย ย ย ย ย  ย ย ย  },
โ€ฆ

Google isnโ€™t as great at judging emotion as facial features:

"rollAngle": 5.7688847,
"panAngle": -3.3820703,
"joyLikelihood": "UNLIKELY",
"sorrowLikelihood": "VERY_UNLIKELY",
"angerLikelihood": "UNLIKELY",
"surpriseLikelihood": "VERY_UNLIKELY",
"underExposedLikelihood": "VERY_UNLIKELY",
"blurredLikelihood": "VERY_UNLIKELY",
"headwearLikelihood": "VERY_UNLIKELY"

I was definitely surprised, because I was not expecting the kiss (I was just aiming for a selfie with the elephant). The picture may show a bit joy combined with โ€œyuckโ€ because elephant-snout kisses are messy and a bit slimy.

Google Vision also noticed some other things about the picture and me:

{
"mid": "/m/0jyfg",
"description": "glasses",
"score": 0.7390568,
"topicality": 0.7390568
},
{
"mid": "/m/08g_yr",
"description": "temple",
"score": 0.7100323,
"topicality": 0.7100323
},
{
"mid": "/m/05mqq3",
"description": "snout",
"score": 0.65698373,
"topicality": 0.65698373
},
{
"mid": "/m/07j7r",
"description": "tree",
"score": 0.6460454,
"topicality": 0.6460454
},
{
"mid": "/m/019nj4",
"description": "smile",
"score": 0.60378826,
"topicality": 0.60378826
},
{
"mid": "/m/01j3sz",
"description": "laughter",
"score": 0.51390797,
"topicality": 0.51390797
}
]
โ€ฆ

Google recognized the elephant snout! It also noticed Iโ€™m smiling and that Iโ€™m laughing. Note the lower scores indicate lower confidence, but it is good that the Google Vision API noticed.

โ€ฆ
"safeSearchAnnotation": {
"adult": "VERY_UNLIKELY",
"spoof": "POSSIBLE",
"medical": "VERY_UNLIKELY",
"violence": "UNLIKELY",
"racy": "UNLIKELY"
ย  }
โ€ฆ

Google doesnโ€™t believe that this is more than a platonic kiss and realizes that Iโ€™m not being harmed by the elephant.

Aside from this, youโ€™ll find things like matching images and similar images in the response. Youโ€™ll also find topic associations. For example, I tweeted once time about a โ€œXennialsโ€ article, and now Iโ€™m associated with it!

How is the Google Vision API useful?

Whether youโ€™re working in security or retail, being able to figure out what something is from an image can be fundamentally helpful. Whether youโ€™re trying to figure out what cat breed you have or who this customer is or whether Google thinks a columnist is influential in a topic, the Google Vision API can help. Note that Googleโ€™s terms only allow this API to be used in personal computing applications. Still whether youโ€™re adoring data in a search application or checking whether user submitted content is racy or not, Google Vision might be just what you need.

While I used the version of the API that uses public URIs, you can also post raw binary or Google Cloud Storage file locations using different permutations.

Authorโ€™s note: Thanks to my colleague at Lucidworks, Roy Kiesler, whose research contributed to this article.