One of the biggest changes we made to the recent SkyDrive release was how we deal with permissions on files and folders. Making these underlying changes to our service without impacting customers is a bit like replacing the engines on an airplane while it’s flying. The technical challenges were tremendous, but the end result is a system that allows far more flexibility in how you share your files and photos. This post was authored by David Nichols, Software Development Lead for our Storage system, and discusses the technical challenges in making app-centric sharing possible.

-Omar Shahine, Group Program Manager, SkyDrive.com

Our latest releases of SkyDrive include a major revision to our sharing system that lets you give other people permission to see—or even edit—your documents and photos. These releases involved a lot of work in both our front-end web system, which implements the user interface to SkyDrive.com, and our back-end file system, designed to provide persistent storage for your documents and photos. You can also see this capability in SkyDrive for Windows Phone and iPhone in the form of “view-only” and “view and edit” link sharing. Along the way we had several design challenges, and in this post  we’ll look at three of them: Sharing your data with people who don’t use Windows Live, sharing your data from anywhere in your file tree, and finding the files that people have shared with you.

Share your data with anyone

Social networks were still new when we first designed SkyDrive. Facebook wasn’t available outside of universities; MySpace was in its heyday; the idea of integration between networks was a long way off. We expected the sharing patterns to be either sharing with a specific list of contacts in Windows Live or with Messenger buddies. In particular, it was awkward to share with someone who doesn’t have a Windows Live account. The solution to this problem lies in the way we represent sharing permission for files and folders.

Every file or folder in SkyDrive has an optional “access control list” that shows who’s allowed to read or edit the file or folder. You can apply permissions at the folder level (which means that everything inside the folder has the same set of permissions), or you can apply different permissions to individual items inside the folder. This is similar to how enterprise systems (such as Microsoft Windows) track permission information, but our system has a twist.

In addition to being able to hold entries such as “user x” or “buddies of user y,” our system can also hold “token-based” access items. A token is just a string of random (and thus unguessable) bits. If you know the bits, you can gain whatever access the token gives you. We embed these tokens in URLs and send them out in the invitation email when you share a file. When the recipient clicks the link in the invitation, they either get direct access to the file, or get the option to add their Windows Live ID to the access list for the file.

Here’s an example of how this works

Let’s say that Alice wants to share her famous fried okra recipe with Bob, Carol, and David. She knows their email addresses but only has a Windows Live ID for Carol, who is one of her Messenger buddies. Alice uses the Share dialog on the file “Fried Okra.docx” and enters the email addresses for Bob, Carol, and David. After sending the invitation, the access list for “Fried Okra.docx” looks something like this:

Who Access Comment

Token 23 (the real ones are longer)

Read

‘bob@contoso.com’

carol@hotmail-example.com (a Windows Live ID)

Read

 

Token 51

Read

david@contoso.com

Bob gets an email with the token URL, and simply uses it to read the document. As long as he saves the email, he can continue to use that URL (unless Alice changes her mind, see below). Carol uses the URL and logs in with her Windows Live ID. By doing so, not only can she see the document, but it shows up on her “Shared With Me” list whenever she uses SkyDrive. David has a Windows Live ID that Alice didn’t know about, so when he uses the URL, he’s able to substitute his actual Windows Live ID for the token and also see the okra recipe in his “Shared With Me” list. At this point, the access looks like this:

Who Access Comment

Token 23 (the real ones are longer)

Read

‘bob@contoso.com’

carol@hotmail-example.com (a Windows Live ID)

Read

 

david@live-example.com

Read

david@contoso.com

Why the comments? Their purpose is to help with revocation. Say Alice has a change of heart about sharing and wants to remove access from Bob and Carol. When she goes to edit access for the document, she needs to see something more informative than “Token 23.” Because the system remembered the original recipients the tokens were intended for, Alice can chose the correct items to remove from the access list. Once the token has been revoked, the URL in Bob’s saved email will stop working.

Share your files without moving them

The old sharing system for SkyDrive was optimized for the way we expected people to use the system at the time. SkyDrive was used mostly for sharing photos, so we wanted to make it as simple as possible to share an album at a time. We understood that tracking what was shared and what wasn’t could get complex, so we limited the possible “sharable things” to top-level albums in someone’s SkyDrive.

As we added support for storing, editing and finding Office documents, we realized that this simple sharing model wouldn’t capture the sharing patterns our users needed. As Tony East mentioned in his post Designing app-centric sharing for SkyDrive, part 1 of 2: Complexity of “simple,” the ability to share shouldn’t depend on file organization. You should be able to point to any file, anywhere, and share it without moving it.

The problem with this lay in an early decision to store file access information in a different service than the SkyDrive backend. Until this release, the access lists for folders were stored in our contacts and relationships system, ABCH. While this made sense in light of the scenarios at the time, the new sharing model was going to cause scaling issues, because every shared file in SkyDrive would require data in ABCH.

To get the access lists back in SkyDrive, we needed a data migration. Data migrations are quite complicated in large scale on-line systems, because the user data is partitioned across many servers in our data centers. Both SkyDrive and ABCH partition the users across servers, but we use different patterns to do so. So while Alice and Bob’s data might be on the same server in SkyDrive, their data is likely on different servers in ABCH.

We know how to do this: start up a set of migration tasks in our job system, have them examine each user individually, and then move that user’s data. Because we’re moving data from one system to another, this can take as long as few months to complete. To speed up the effective migration speed, we used a local-to-SkyDrive pass that tweaked our internal data format to support on-demand migration. As soon as this was done, we were ready to support the new features. If a user edits sharing on an existing folder, we bring the data for that folder over right away. In the meantime, our migration job is moving all the data, whether it’s changed or not.

Find what’s shared with you

Another feature of our sharing system that’s different from conventional file systems is the “Shared With Me” list. While you can save all the invitation emails you get that are letting you know about files your friends have shared, we’ve found that it’s great if the system can manage this list for you. Because we partition our file data on servers by the user who owns the data, this isn’t trivial to do. If ten people share files with Alice, the access lists for those files are on ten different servers out of hundreds in our system, so there’s no one good place to go to for the list. To solve this problem, our implementation builds on our full-text indexing system, so let’s take a look at that.

Full-text systems work by taking documents in the system and finding all the words in each. From this, they create “inverted indices,” which have words and the corresponding list of documents that contain those words. For example, there might be an entry like “okra: 1,7,107,243,512,514,…” and another, “recipe: 3,56,107,201,512,703,…” which means that the word “okra” appears in the first, seventh, 107st, 243rd, etc. documents, and that “recipe” appears in the third, 56th, 107th, 201st, etc. documents. To find all documents with “okra” and “recipe”, we take the intersection of the two lists (which is easy, since they’re in order), and discover that the 107th and 512th documents contain both words. 

SkyDrive Full-Text Index

For SkyDrive, we have a full-text index of all documents in the system. However, we can’t let people see all the documents in a search result, only the ones they are allowed to view. To do this, we index the Windows Live IDs of the allowed viewers onto the documents as well. In addition to the word entries above, we add special strings to the documents that get indexed much like the words do, but which encode the permission data. For example, the string “VIEWER=carol@hotmail-example.com” would mean that Carol has view permission for a specific document. Then the inverted index gets an entry like “VIEWER=carol@hotmail-example.com: 39, 107, 762, ...” When Carol searches for “okra recipe,” we change the query to “okra recipe VIEWER=carol@hotmail-example.com.” So Carol gets document 107 back, but not document 512, which she isn’t allowed to read.

With this index, an obvious way to implement “Shared With Me” is to search for the documents Carol is allowed to read. This isn’t exactly right, but it’s close. First, we want to exclude documents that she owns, because we’re showing them elsewhere. Second, we need to include photos, which normally aren’t in the full- text index. Finally, we don’t really want all the files Carol has access to, but only the files or folders where someone explicitly added Carol. If Alice shares a folder with 100 documents, we want only the folder to show up in Shared With Me, not all 100 of the contained documents. If she shares a single spreadsheet, we want to show it too.

The answer to these problems is to index all the shared files or folders with a second index field which tracks exactly the documents and folders that got explicitly shared. This field is only on the shared items, not on files contained within folders, and doesn’t include the document owner. Our search is then for “SHARED-WITH=carol@hotmail-example.com,” which gives us exactly what we want.

Moving forward

Our changes in the system are a big step forward in our ability to support our sharing scenarios, but we know we aren’t done yet. As we collect feedback from you, we’ll continue to evolve how the sharing system works. With this work, we think we’re in a good spot to move forward rapidly.

David Nichols

Software Development Lead, SkyDrive.com