CKM Blog

Syndicate content
UCSF Center for Knowledge Management
Updated: 3 hours 53 min ago

Curing Cancer (No, Not Really) With HTML5 (Sort Of)

Wed, 2014-11-05 09:49

Here’s a talk about federated search and Amalgamatic that I gave at HTML5DevConf in October. Hopefully, the conference will post a better quality video where you can actually see the demos. But for now, this is what I recorded.

Categories: CKM

Chunked Uploads with JQuery File Upload and Ruby on Rails

Tue, 2014-10-21 08:13

I was inspired by this thread on the Google Hydra-Tech forum to put together a short tutorial on chunked uploads with jQuery file upload and Ruby on Rails.

There’s a nice example on GitHub for using this plugin with Rails, so we don’t need to do this from scratch. To see how this plugin works without chunked file uploads, clone the example

https://github.com/tors/jquery-fileupload-rails-paperclip-example

bundle, rake:db, rails server, and go to port 3000. You should see a nice UI for file uploads.

Try uploading a small file (under 500k). It works well, has a nice looking upload bar that flashes for just a few moments, and allows you to delete or cancel an upload. Now try uploading something a little bigger, maybe around 500MB. The progress bar will really come in handy here, but once it is complete, your server will most likely hang for a few minutes (at least). A file a few GB in size, while permitted, will take even longer and may slow down your web browser and localhost server to the point where it is unusable, and you need to go off to terminate a few processes manually through the command prompt.

At some point, you’ll probably start wanting to use the chunked file uploads feature, which splits the file into smaller pieces and submits them, one after the other, across the wire to your server.

To activate chunked file uploads, open the jquery-fileupload-rails-paperclip-example app and navigate to the views/uploads/index.html.erb file. At the bottom of this file (line 118 on my system), you’ll see a call to:

$('#fileupload').fileupload();

To use chunked file uploads, replace the above bit of code with:

$('#fileupload').fileupload({ maxChunkSize: 1000000 maxFileSize: 1000000 * 10000 });

Note – these values were chosen to illustrate the app working, you may want different values in production. For more on chunked file uploads, check the documentation.

Give it another try (you may need to reload the javascript for your change to take effect. Upload that medium size (at least a few dozen MB to see the full effect) and watch your dev server spin. You’ll notice that the controller for processing a file is getting called repeatedly for each chunk.

Watch your server log as rails processes the upload, and you’ll see that the controller is called repeatedly as you upload the file. The file uploader is splitting the medium sized file into 1 mb sized chunks and submitting them sequentially to the server.

Great, except.. refresh the page.

The rails app is processing each chunk as if it were a separate upload. Keep in mind, this has nothing to do with the jquery file uploader plugin, it’s doing exactly what we asked it to do – split up a file into small pieces and submit each one to the server. We just need to change how we’re processing this on the server side.

Before we go on, go ahead and delete those files (the fastest way is to check the box next to the upper menu delete button).

To process each chunk as part of a single file, rather than as a separate independent file upload, take a look at the create method in the uploads_controller.rb controller.

The first line

@upload = Upload.new(params[:upload])

creates a new upload object and processes it as a single file upload. You’ll need modify this to process it as a chunk rather than as a complete file.

There are plenty of ways to do this. To keep it all contained in an example, I’ll just go ahead and hack the controller. Remove the line above and replace it with the following code (entire method posted to avoid confusion about what to paste where)…

def create @temp_upload = Upload.new(params[:upload]) @upload = Upload.where(:upload_file_name => @temp_upload.upload_file_name).first if @upload.nil? @upload = Upload.create(params[:upload]) else if @upload.upload_file_size.nil? @upload.upload_file_size = @temp_upload.upload_file_size else @upload.upload_file_size += @temp_upload.upload_file_size end end p = params[:upload] name = p[:upload].original_filename directory = "uploads" path = File.join(directory, name.gsub(" ","_")) File.open(path, "ab") { |f| f.write(p[:upload].read) } respond_to do |format| if @upload.save format.html { render :json => [@upload.to_jq_upload].to_json, :content_type => 'text/html', :layout => false } format.json { render json: {files: [@upload.to_jq_upload]}, status: :created, location: @upload } else format.html { render action: "new" } format.json { render json: @upload.errors, status: :unprocessable_entity } end end end

Note – to keep things simple, I hard coded a new upload directory into the method, so you’ll need a top level “uploads” directory in your rails app for this to work.

In this method, we are now reading the upload parameters into a temporary upload object. We then look for an upload object with that file name (yes, different uploads could have the same file name, so you’d need a different approach for production). If that object doesn’t exist, create it, and treat the file chunk as if it were a new upload. If it does exist, retrieve it and append the newly uploaded chunk to the existing file. You do this over and over until the last chunk is processed (there’s also a running counter for file size).

Give it a try and refresh the screen. This time, you should see only one file upload, and you should be able to retrieve that file from your uploads directory. You may also notice that the long lag time between the status bar completion and the actual file upload completion is much shorter now, as the file is getting written to disk in small increments (you don’t have to convert the entire temp file upload on the server at once after upload, just the last chunk).

One note – some of the functionality of the upload plugin (such as delete) will no longer work with the new directory location.

So, should I do chunked file uploads?

The general approach above, with some modifications, does make it possible to process bigger files and stick with a pure ruby solution, and it could take care of the problem of medium file uploads that need to be chunked but perhaps don’t require an industrial strength solution.

However, if you want to allow really big file uploads, you might want to consider a solution that allows a user to upload a file to box, dropbox, google drive, etc, and then transfer it to your server from there as a background job. In fact, there’s a very nice gem from hydra labs called “browse everything”  that provides this functionality and integrates nicely with sufia. If you’re already using the “browse everything” type approach, you might just go ahead and limit direct non-chunked uploads to small sized files that won’t tax the system rather than managing the complexity overhead of a solution that sits in between.

In addition to the obvious problems I dismissed with a bit of hand waving and vague excuses about this being for to a sample exercise, there are other issues to consider with large file uploads. What do you do about partial uploads? Partially uploaded files where the user closed the connection or lost the network connection? Checksums (we can check the sum on each chunk, but were the chunks assembled properly on the server)?

Chunked file uploads are a very useful way to get that middle ground, and I suspect many of your users would really appreciate being able to upload medium size files without having to create an external account. I just want to emphasize that will the above approach can work, there’s a lot to consider here.

Categories: CKM

Introducing Amalgamatic

Thu, 2014-10-16 07:40
“Search!” by Jeffrey Beall licensed CC BY-ND 2.0

Academic libraries offer many resources, but users cannot be expected to search in, say, a half dozen different interfaces to find what they’re looking for. So academic libraries typically offer federated search.

Sometimes, a solution is purchased. Many libraries, for example, use 360 Search.

Here at UCSF, we are among the libraries that have built our own federated search. Twice.

There are (at least) three ways to pull data out of other resources in real time.

  1. Cool, they have an API for that!
    This almost never happens.

  2. I will screen-scrape the #*%!?@ out of your website!
    This is by far the most common scenario.

  3. Web New-dot-Oh: It’s full of JavaScript that injects the content.
    An edge case, but one that is becoming more important all the time. Make friends with PhantomJS to scrape these sites.

When trying to implement these solutions, a common scenario is to build your screen-scraping federated search tool with traditional server-side languages like Java or PHP.

These strategies and technologies bring with them pitfalls to be avoided. Recompiling your WAR every time one of your target systems modifies their HTML layout, anyone?

Here’s another pitfall: Our group built a solution years ago (our first one) that is implemented in Drupal with no external facing API. So, if I want to experiment with a different results interface, I need to write it in Drupal. This tight coupling prevents experimentation with other technologies or things that don’t fit neatly into the Drupal paradigm.

A lot of the pitfalls can be avoided by following sound software architecture principles. But one thing should be uncontroversial:

No programming language has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than JavaScript.

So how about building your federated search server using Node.js? Or maybe even take it a step further and just let your user’s browser execute the federated search entirely by itself, no need to talk to your server! If it’s all just JavaScript, why not?

That is our approach this second time around.

First, we wrote a pluggable, extensible federated search tool called Amalgamatic.

Second, we wrote the plugins that we needed to search the resources we were interested in:

In the course of writing these plugins, we used all three of the techniques described above (API, scraping HTML, and using a headless browser to get JavaScript-generated content).

Third, we used Amalgamatic to expose federated search on our API server. (source code)

Fourth, we set up a prototype search interface to use that API. (source code)

Lastly, because we could, we used Browserify to create a demo showing how to use Amalgamatic so that all the retrieval and processing happens in the browser—no need for an intermediary API or search server! (source code)

I hope others find this work useful. Use Amalgamatic, ask questions, file issues for bugs or feature requests, and write and publish your own plugins.

(Or tell me about your project that already does this better and I need to fold up shop or at least steal all your ideas.)

While I’m at it with the small-text thing, here’s a caveat on the Browserify-ed version: The one thing the browser couldn’t do was launch the PhantomJS headless browser for scraping sites that depend on JavaScript execution to display results. Fortunately for us, that was needed in the LibGuides plugin only. And LibGuides offers an API, so if we really wanted LibGuides results, we could use the API. We initially implemented it that way, actually, but found that the API results differed from the LibGuides search page results. We thought that might be confusing to users, so we went with PhantomJS-assisted scraping.

Categories: CKM

Random Forests and Datashare at the CDL Code Jam

Fri, 2014-09-05 10:28

The California Digital Library hosted a code jam earlier this month at the Oakland City Conference center.

This gathering brought together librarians and developers from University of California campuses for a series of working meetings with an eye toward system-wide projects, especially involving data curation.

In the spirit of an informal, code-jam style meeting, I presented a bit on my recent experiments using machine learning to categorize data. As a starting example, I applied a random forest to suggest subject keywords for data sets uploaded to the recently launched DataShare website.

A researcher who uploads a data set to Datashare through the datashare-ingest app is prompted to provide keyword information along with other metadata describing the data set. One relatively common keyword is “middle-aged”.

http://datashare.ucsf.edu/xtf/search?f1-keyword=Middle%20Aged

For the code jam, I showed one possible way to use these existing records to train a random forest to determine whether a new dataset should be tagged with the keyword “middle-aged”.

For this particular exercise, there isn’t a great deal of data – we’d need a much larger data set to really apply this. However, the small dataset has some advantages for exploring the use of random forests, as it’s possible to visually inspect the input and gain a better understanding of how the random forest categorization is working.

“Middle aged” is a relatively good term to use for an experiment, as it isn’t as obvious (or as objective) more technical terminology. Highly technical keywords often show up repeatedly in the data set, and are often very predictable. More subjective and less technical keywords such as “middle aged” may apply to a wide range subjects, and they may show up in some records but not in others, as some researchers include the keyword and others don’t. Random Forest classification can be particularly useful in this case, as these subjective keywords are more likely to be applied when they don’t show up in the description, title, or technical methods for a particular dataset.

For example, here’s a record tagged with the keyword “middle aged”.

To get started, I approached this with a simple “bag of words” approach, with a small modification. Rather than taking a very large bag of words for all records (or a sample of all records) in Datashare, I limited the count to words that show up in records with the keyword “middle aged”.

data: 14
changes: 12
disease: 12
alzheimer: 11
diffusion: 10
dti: 9
subjects: 9
images: 9
slice: 9
brain: 8
ftd: 8
imaging: 8
axial: 7
white: 7
matter: 7
tracts: 7
reductions: 7
acquired: 7
gwi: 7
center: 7
diffusivity: 7
thickness: 6
type: 6
symptoms: 6

(for the full list, run the wordcount_summary.py script in the github repo).

I used this bag of words to populate a random forest. For each record, I created a vector indicating the word count for each of the most common words in each dataset, using the title, abstract, and methods as text fields.

Here’s the bag of words for a single record that includes the “middle-aged” subject tag:

ark+=b7272=q67p8w9z
data 1
changes 7
disease 0
alzheimer 0
diffusion 6
dti 5
subjects 2
images 4
slice 0
brain 2
ftd 0

And here’s the bag of words for a single record that doesn’t contain this keyword:

data 5
changes 6
disease 4
alzheimer 4
diffusion 0
dti 0
subjects 16
images 1
slice 0
brain 10
ftd 0

Each of these records is converted into a vector containing the word count along with information about whether it was tagged with the keyword “middle-aged” (for the python scikit-learn library, I expressed this as a binary 0 or 1).

A categorization problem like this, where a bag of words is used as the basis for determining whether a record belongs in a particular category, can be approached with a number of different supervised learning techniques. Logistic Regression and Decision Trees are common approaches, and Random Forest is a particularly accessible and often effective algorithm for many categorization problems. I used a Random Forest classifier here, though other algorithms might turn out to be more effective.

For this application, I converted each record into a vector representing the bag of words, with each position representing a common term, and each value representing the number of instances of that word. I ignore a number of stop words and common terms, but there’s a lot more that could be done here, especially around identifying common phrases rather than single words. Each vector, along with an indicator representing whether this record was tagged with the subject “middle aged”, is then used to train a random forest classifier.

The rforestDS-Middle-Age.py script contains python code using scikit-learn to train and evaluate a random forest classifier on the Datashare records for the “middle aged” keyword tag, using the strategy described above.

Note that if you run this multiple times, you’ll get slightly different output. This is probably amplified by the relatively small number of training samples.

The train data score indicates how well the random forest categorizes the records that were used to train it. Because of the small sample size and relatively specific vocabulary, the assessment is fairly high.

Train data score: 0.941176470588

While it is interesting to observe the Random Forest’s assessment of how it performs on its own training set, a common practice is to split the training data into a training set and testing set (often at a two-thirds training, one-third testing ratio). You can then use the testing data (which was not used to build or train the random forest) to evaluate the accuracy of the classifier. I didn’t in this particular case, as the training set was very small and this is a small experiment/exercise, but it would be an important step on a larger dataset where we plan to make real use of a classifier.

Random Forests can also estimate the relative importance of the classifiers (in this case, the importance of the word count for each of the most common words in determining whether a record has been tagged with the subject “middle aged”). Here’s the feature importance estimate for a run of the random forest. The numbers will change slightly each time the forest is run.

data: 0.115221184049
changes: 0.0497659524022
disease: 0.0414350226064
alzheimer: 0.0438029075727
diffusion: 0.0870807358224
dti: 0.0287297792369
subjects: 0.0548861615427
images: 0.0783953700445
slice: 0.401884434536
brain: 0.0378775508884
ftd: 0.0609209012994

Although such a small sample size isn’t ideal for building a useful classifier, it can be illuminating, as the data set is small enough to hint at why certain words are so important. “Slice”, for instance, probably wouldn’t be such a strong predictor for whether a record should be tagged as “middle aged” in a larger data set. This is almost certainly a quirk related to our small sample size.

The output for this script is stored in the “categories” folder in a file named “1-Middle-Aged.txt”.

ark+=b7272=q6154f00,””
ark+=b7272=q61z429d,””
ark+=b7272=q62z13fs,”Middle-Aged”
ark+=b7272=q65q4t1r,””
ark+=b7272=q66q1v54,”Middle-Aged”
ark+=b7272=q67p8w9z,”Middle-Aged”
ark+=b7272=q6bg2kwf,””
ark+=b7272=q6cc0xmh,”Middle-Aged”
ark+=b7272=q6h41pb7,””
ark+=b7272=q6kw5cxv,”Middle-Aged”
ark+=b7272=q6mw2f2n,””
ark+=b7272=q6pn93h6,””
ark+=b7272=q6qn64nk,”Middle-Aged”
ark+=b7272=q6rn35sz,””
ark+=b7272=q6td9v7j,””
ark+=b7272=q6x63jt1,””

To apply the random forest to all subject tags, run the rforestDS.py script. You should see similar output for each keyword tag, along the training data score and feature importance estimates. To assemble all the estimates into a single file, run the MergeFiles.py script.

GitHub repository for this exercise

Github repository for Dash (formerly Datashare)

Categories: CKM

Redesign Your Website in 4,000 Easy Steps

Tue, 2014-08-12 15:33

Earlier this month, Rich Trott and I delivered a session at the University of California Computing Services Conference (UCCSC) in San Francisco. It was about our experience using an approach of continuous iterative improvements and frequent feedback to help keep our site fresh and meeting user needs. We talked about why this approach has been working better than the tradition complete redesign that might happen every few years (or not.)

If you can’t wait to hear more, see the slides with notes.

Or if you’re more of the video type, you can check that out too.

Let us know about your experiences using this kind of approach to website upkeep and positive user experience. What works for your site or organization?

Photo by chexee

Categories: CKM

Announcing Symfony Ember.js Edition

Tue, 2014-08-05 12:30

We’ve spent a lot of time getting the configuration right for setting up Ember.js and EmberData to work with a Symfony backend. So we have decided to release a working example of getting this right:

Symfony Ember.js Edition

Background

The Ilios project is investigating a migration to Ember.js. Because we have a lot of PHP experience and a lot of PHP code, it makes sense to serve the content using Symfony. We chose Ember.js because of its convention over configuration approach and wanted to make as few customizations as possible.

EmberData provides a clean way to represent your data models in JavaScript and bind them to templates and controllers. It also has built-in REST functionality for keeping those models up to date on the server.

Specific Fixes Compiling Handlebars templates

Ember expects compiled Handlebars templates to be in the JavaScript Ember.TEMPLATES object instead of Handlebars.template. That’s fine if you put all of your templates in index.html like they do in most examples. In that case, Ember does the compilation itself.

However we wanted separate templates and routers in different files. This required pre-compiling the templates for Ember. Thankfully there is a Node.js application for doing this already called ember-precompile.

It is even supported in the latest version of Assetic. However AsseticBundle hasn’t been updated in a while, so we had to mess with the Composer definition to get this working. The Assetic compiler will fail silently if you don’t have ember-precompile installed in /usr/bin/ember-precompile. Hopefully a fix for that will be available soon.

Testing the API

We want test coverage for our API, but actually getting the right input proved to be a bit complicated. There is a demo controller test and a base test in the AcmeApiBundle in this distribution. You can use it as a starting point to make writing other tests easier.

JS Dependencies

We use Bower to install all of our dependencies, include them in the layout, and manage their version without checking the code into our repo.

None of this would have been possible without:

Categories: CKM