practicing techie

tech oriented notes to self and lessons learned

Monthly Archives: November 2017

Scala client for Amazon Glacier

Leave a comment Posted by marko on 2017-11-19

Amazon Glacier is a secure, durable, and extremely low-cost cloud storage service for data archiving and long-term backup. Glacier offers a cold storage data archival solution meaning the stored data is not available for immediate retrieval. You need to first request retrieval of the data and access time can vary from minutes to several hours, depending on the service level you choose.

While cold storage may feel cumbersome at first, it also has its advantages. No one will be able to accidentally modify important, archived files. It’s also possible to prevent deletion altogether, if needed.

Glacier is designed for use cases in which retrievals are infrequent and exceptional, and data will be stored for extended periods of time.

Concepts

If you haven’t worked with AWS services or Glacier before, it’s helpful to learn a few concepts first:

AWS Region – a named set of AWS resources in the same geographical area. Regions are completely isolated from each other, so when you view your resources, you’ll only see the resources tied to the region you’ve specified. In Glacier terms, stored data is bound to a particular region. Glacier storage prices vary across regions.

Vault – a container for storing data in the form of archives. An unlimited number of archives can be stored in a vault. Vaults and their contents are available only in the region where they were created. Access permissions, notifications and compliance controls are configured on vault level.

Archive – an archive can be any data such as a photo, video, or document and is a base unit of storage in Amazon Glacier. Each archive has a unique ID and an optional description. You may upload a single file as an archive, but your costs will be lower if you aggregate your data. Archives stored in Amazon Glacier are immutable, i.e. archives can be uploaded, downloaded and deleted, but cannot be edited or overwritten as with services like Dropbox.

(Vault) Inventory – AWS Console will show you a list of vaults, but not a list of vault contents, or inventory. Inventory needs to be separately requested for retrieval and fulfilling the request can take several hours.

Job – retrieving an archive or vault inventory (list of archives) are asynchronous operations in Amazon Glacier. You first initiate a job, and then download the job output after Amazon Glacier completes the job. With Amazon Glacier, your data retrieval requests are queued and jobs will take hours to complete.

Notification-configuration – because jobs take time to complete, Amazon Glacier supports a notification mechanism to notify you when a job is complete. You can configure a vault to send notification to an Amazon Simple Notification Service (Amazon SNS) topic when jobs complete. You can specify one SNS topic per vault in the notification configuration.

More info on these concepts can be found here: Amazon Glacier data model

Glacier client

Amazon Glacier can be used with the Amazon AWS CLI, but it’s quite clumsy to use, especially for archive uploads. Some backup tools support Glacier based storage, but the ones I came across didn’t seem to be suited for server side backups or programmatic use. Amazon AWS Console allows you to e.g. create and configure vaults, but archive operations are not supported.

Glacier client is a simple tool I created for working with Amazon Glacier. It was designed to support both interactive use (with Scala REPL), as well as programmatic use with Scala or Java. It’s well suited for server side use. Glacier client is built on Amazon AWS SDK for Java.

The code can be found on GitHub: https://github.com/marko-asplund/glacier-client

Setting up Glacier

AWS configuration

To use Glacier you need to first set up an AWS user account and permissions in AWS Console as follows:

Create user account in AWS IAM (identity and access management)
Grant the user the following permissions: AmazonGlacierFullAccess, Grant AmazonSQSFullAccess, AmazonSNSFullAccess
Create an access key

Some operations, such as creating a vault inventory or preparing an archive for download are performed asynchronously. Setting up notifications will be helpful with these operations. You need to enable notifications on the vault and configure a corresponding SNS topic in AWS Console.

Glacier client setup

Set up AWS credentials

The simplest way to setup Glacier client authorisation is to configure “default credential profiles file” as described in Working with AWS Credentials.

The profiles file is a text file with a simple file format, so you can set it up with just a text editor by following instructions on the above mentioned page.

You can also set up the file by using the AWS CLI invoking “aws configure” command to set up the default credentials file, as described in AWS CLI configure options.

Get glacier-client

To run glacier-client, you need to have Git, sbt and Java JRE installed.

git clone https://github.com/marko-asplund/glacier-client.git
cd glacier-client

Basic operation

Start up Scala REPL with sbt

~/glacier-backup-cli (master ✔) ᐅ sbt console

[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/marko/glacier-backup-cli/project
[info] Loading settings from build.sbt ...
[info] Set current project to glacier-backup-cli (in build file:/Users/marko/glacier-backup-cli/)
[info] Starting scala interpreter...
Welcome to Scala 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151).
Type in expressions for evaluation. Or try :help.

List the names of available AWS regions

scala> fi.markoa.glacier.GlacierClient.regions
res0: Array[String] = Array(us-gov-west-1, us-east-1, us-east-2, us-west-1, us-west-2, eu-west-1, eu-west-2, eu-central-1, ap-south-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, ap-northeast-2, sa-east-1, cn-north-1, ca-central-1)

Create a Glacier client that connects to the us-west-2 region

scala> val c = fi.markoa.glacier.GlacierClient("us-west-2")
c: fi.markoa.glacier.GlacierClient = fi.markoa.glacier.GlacierClient@11b6e34a

Create new vault. The ID (or ARN) for the newly created vault is returned.

scala> c.createVault("test-vault-1")
res1: String = /429963740182/vaults/test-vault-1

List all vaults in the region. A sequence of vault objects is returned, in this case it includes just the vault we created above. Please note that with vault operations the results are visible immediately.

scala> c.listVaults
res2: Seq[fi.markoa.glacier.Vault] = ArrayBuffer(Vault(arn:aws:glacier:us-west-2:429963740182:vaults/test-vault-1,test-vault-1,2017-11-19T08:18:38.990Z,None,0,0))

Now we’re ready to upload an archive into the vault:

scala> c.uploadArchive("test-vault-1", "my backup archive", "my-backup.zip")
TransferStarted: transfer started
TransferProgress: transfer progress: 5% (bytes: 516096)
TransferProgress: transfer progress: 10% (bytes: 1024000)
TransferProgress: transfer progress: 15% (bytes: 1540096)
TransferProgress: transfer progress: 20% (bytes: 2048000)
TransferProgress: transfer progress: 25% (bytes: 2564096)
TransferProgress: transfer progress: 30% (bytes: 3072000)
...
TransferProgress: transfer progress: 90% (bytes: 9216000)
TransferProgress: transfer progress: 95% (bytes: 9732096)
TransferProgress: transfer progress: 100% (bytes: 10240000)
TransferCompleted: transfer completed
res3: fi.markoa.glacier.Archive = Archive(WREjqj2BItYhI5BGV7mdJGsDl3oztPvpvVh_hngm5SWqJkOd5jnLipLyYy2KkM74-3mkt85nUjI4a_hcQZhtLnQF03K0sv2Bc97BYEwYQ7M4O_lmtgrCTuGCyAEEiuQmCFfRSnBkTw,Some(my-backup.zip),0c5dc86251d157e29cfadb04ac615426600a4e1177a8ac2c1134d895378b3acd,10240000,Some(my backup archive))

Note that Glacier doesn’t maintain an up-to-date list vault contents – a list of contents needs to be requested explicitly and preparing it can take a very long time. For this reason Glacier client stores a local catalogue of archives per vault. Vault contents can be listed as follows:

scala> c.catListArchives("test-vault-1")
res4: Seq[fi.markoa.glacier.Archive] = ArraySeq(Archive(WREjqj2BItYhI5BGV7mdJGsDl3oztPvpvVh_hngm5SWqJkOd5jnLipLyYy2KkM74-3mkt85nUjI4a_hcQZhtLnQF03K0sv2Bc97BYEwYQ7M4O_lmtgrCTuGCyAEEiuQmCFfRSnBkTw,Some(my-backup.zip),0c5dc86251d157e29cfadb04ac615426600a4e1177a8ac2c1134d895378b3acd,10240000,Some(my backup archive)))

Archives need to be prepared prior to their retrieval and preparation can take several hours. For this reason it’s often more convenient to retrieve them asynchronously: 1) you request archive retrieval and after Glacier has finished preparing archive you can 2) download it.

scala> c.prepareArchiveRetrieval("test-vault-1", "WREjqj2BItYhI5BGV7mdJGsDl3oztPvpvVh_hngm5SWqJkOd5jnLipLyYy2KkM74-3mkt85nUjI4a_hcQZhtLnQF03K0sv2Bc97BYEwYQ7M4O_lmtgrCTuGCyAEEiuQmCFfRSnBkTw")
res1: Option[String] = Some(h479o4kxdawFsho0POzQAznw6e6beampFAIBYuI7s41O_HmzqqWsg2qk2vL2Lw_4MOsI1VFarvokz7NXczBq0CrwPKzv)

Archive retrieval is added in the vault’s list of jobs. You can list unfinished jobs as follows:

scala> c.listJobs("test-vault-1")
res4: Seq[fi.markoa.glacier.Job] = ArrayBuffer(Job(h479o4kxdawFsho0POzQAznw6e6beampFAIBYuI7s41O_HmzqqWsg2qk2vL2Lw_4MOsI1VFarvokz7NXczBq0CrwPKzv,arn:aws:glacier:us-west-2:429963740182:vaults/test-vault-1,ArchiveRetrieval,null,2017-11-19T09:00:34.339Z,InProgress,null,None,Some(WREjqj2BItYhI5BGV7mdJGsDl3oztPvpvVh_hngm5SWqJkOd5jnLipLyYy2KkM74-3mkt85nUjI4a_hcQZhtLnQF03K0sv2Bc97BYEwYQ7M4O_lmtgrCTuGCyAEEiuQmCFfRSnBkTw)))

Notice the InProgress status. Once archive preparation has been finished the job list will look something like this:

scala> c.listJobs("test-vault-1")
res8: Seq[fi.markoa.glacier.Job] = ArrayBuffer(Job(h479o4kxdawFsho0POzQAznw6e6beampFAIBYuI7s41O_HmzqqWsg2qk2vL2Lw_4MOsI1VFarvokz7NXczBq0CrwPKzv,arn:aws:glacier:us-west-2:429963740182:vaults/test-vault-1,ArchiveRetrieval,null,2017-11-19T09:00:34.339Z,Succeeded,Succeeded,Some(2017-11-19T12:52:38.363Z),Some(WREjqj2BItYhI5BGV7mdJGsDl3oztPvpvVh_hngm5SWqJkOd5jnLipLyYy2KkM74-3mkt85nUjI4a_hcQZhtLnQF03K0sv2Bc97BYEwYQ7M4O_lmtgrCTuGCyAEEiuQmCFfRSnBkTw)))

Setting up notifications relieves you from having to periodically poll job completion status and instead receive notifications. Notifications can be set up via AWS Console.

A prepared archive can then be downloaded from Glacier using the retrieval job ID:

scala> c.downloadPreparedArchive("test-vault-1", "h479o4kxdawFsho0POzQAznw6e6beampFAIBYuI7s41O_HmzqqWsg2qk2vL2Lw_4MOsI1VFarvokz7NXczBq0CrwPKzv", "my-backup-restored.zip")
TransferStarted: transfer started
TransferProgress: transfer progress: 5% (bytes: 520869)
TransferProgress: transfer progress: 10% (bytes: 1025701)
TransferProgress: transfer progress: 15% (bytes: 1547941)
TransferProgress: transfer progress: 20% (bytes: 2052773)
TransferProgress: transfer progress: 25% (bytes: 2575013)
TransferProgress: transfer progress: 30% (bytes: 3079845)
...
TransferProgress: transfer progress: 90% (bytes: 9228965)
TransferProgress: transfer progress: 95% (bytes: 9736869)
TransferProgress: transfer progress: 100% (bytes: 10240000)
TransferCompleted: transfer completed

That’s it for the basic operations!

Some other tasks Glacier client let’s you do include delete vault, request a vault inventories (list of archives a vault contains), download inventories and delete archives.

development, devops