Feed aggregator

Introducing Amalgamatic

CKM Blog - Thu, 2014-10-16 07:40
“Search!” by Jeffrey Beall licensed CC BY-ND 2.0

Academic libraries offer many resources, but users cannot be expected to search in, say, a half dozen different interfaces to find what they’re looking for. So academic libraries typically offer federated search.

Sometimes, a solution is purchased. Many libraries, for example, use 360 Search.

Here at UCSF, we are among the libraries that have built our own federated search. Twice.

There are (at least) three ways to pull data out of other resources in real time.

  1. Cool, they have an API for that!
    This almost never happens.

  2. I will screen-scrape the #*%!?@ out of your website!
    This is by far the most common scenario.

  3. Web New-dot-Oh: It’s full of JavaScript that injects the content.
    An edge case, but one that is becoming more important all the time. Make friends with PhantomJS to scrape these sites.

When trying to implement these solutions, a common scenario is to build your screen-scraping federated search tool with traditional server-side languages like Java or PHP.

These strategies and technologies bring with them pitfalls to be avoided. Recompiling your WAR every time one of your target systems modifies their HTML layout, anyone?

Here’s another pitfall: Our group built a solution years ago (our first one) that is implemented in Drupal with no external facing API. So, if I want to experiment with a different results interface, I need to write it in Drupal. This tight coupling prevents experimentation with other technologies or things that don’t fit neatly into the Drupal paradigm.

A lot of the pitfalls can be avoided by following sound software architecture principles. But one thing should be uncontroversial:

No programming language has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than JavaScript.

So how about building your federated search server using Node.js? Or maybe even take it a step further and just let your user’s browser execute the federated search entirely by itself, no need to talk to your server! If it’s all just JavaScript, why not?

That is our approach this second time around.

First, we wrote a pluggable, extensible federated search tool called Amalgamatic.

Second, we wrote the plugins that we needed to search the resources we were interested in:

In the course of writing these plugins, we used all three of the techniques described above (API, scraping HTML, and using a headless browser to get JavaScript-generated content).

Third, we used Amalgamatic to expose federated search on our API server. (source code)

Fourth, we set up a prototype search interface to use that API. (source code)

Lastly, because we could, we used Browserify to create a demo showing how to use Amalgamatic so that all the retrieval and processing happens in the browser—no need for an intermediary API or search server! (source code)

I hope others find this work useful. Use Amalgamatic, ask questions, file issues for bugs or feature requests, and write and publish your own plugins.

(Or tell me about your project that already does this better and I need to fold up shop or at least steal all your ideas.)

While I’m at it with the small-text thing, here’s a caveat on the Browserify-ed version: The one thing the browser couldn’t do was launch the PhantomJS headless browser for scraping sites that depend on JavaScript execution to display results. Fortunately for us, that was needed in the LibGuides plugin only. And LibGuides offers an API, so if we really wanted LibGuides results, we could use the API. We initially implemented it that way, actually, but found that the API results differed from the LibGuides search page results. We thought that might be confusing to users, so we went with PhantomJS-assisted scraping.

Categories: CKM

Random Forests and Datashare at the CDL Code Jam

CKM Blog - Fri, 2014-09-05 10:28

The California Digital Library hosted a code jam earlier this month at the Oakland City Conference center.

This gathering brought together librarians and developers from University of California campuses for a series of working meetings with an eye toward system-wide projects, especially involving data curation.

In the spirit of an informal, code-jam style meeting, I presented a bit on my recent experiments using machine learning to categorize data. As a starting example, I applied a random forest to suggest subject keywords for data sets uploaded to the recently launched DataShare website.

A researcher who uploads a data set to Datashare through the datashare-ingest app is prompted to provide keyword information along with other metadata describing the data set. One relatively common keyword is “middle-aged”.


For the code jam, I showed one possible way to use these existing records to train a random forest to determine whether a new dataset should be tagged with the keyword “middle-aged”.

For this particular exercise, there isn’t a great deal of data – we’d need a much larger data set to really apply this. However, the small dataset has some advantages for exploring the use of random forests, as it’s possible to visually inspect the input and gain a better understanding of how the random forest categorization is working.

“Middle aged” is a relatively good term to use for an experiment, as it isn’t as obvious (or as objective) more technical terminology. Highly technical keywords often show up repeatedly in the data set, and are often very predictable. More subjective and less technical keywords such as “middle aged” may apply to a wide range subjects, and they may show up in some records but not in others, as some researchers include the keyword and others don’t. Random Forest classification can be particularly useful in this case, as these subjective keywords are more likely to be applied when they don’t show up in the description, title, or technical methods for a particular dataset.

For example, here’s a record tagged with the keyword “middle aged”.

To get started, I approached this with a simple “bag of words” approach, with a small modification. Rather than taking a very large bag of words for all records (or a sample of all records) in Datashare, I limited the count to words that show up in records with the keyword “middle aged”.

data: 14
changes: 12
disease: 12
alzheimer: 11
diffusion: 10
dti: 9
subjects: 9
images: 9
slice: 9
brain: 8
ftd: 8
imaging: 8
axial: 7
white: 7
matter: 7
tracts: 7
reductions: 7
acquired: 7
gwi: 7
center: 7
diffusivity: 7
thickness: 6
type: 6
symptoms: 6

(for the full list, run the wordcount_summary.py script in the github repo).

I used this bag of words to populate a random forest. For each record, I created a vector indicating the word count for each of the most common words in each dataset, using the title, abstract, and methods as text fields.

Here’s the bag of words for a single record that includes the “middle-aged” subject tag:

data 1
changes 7
disease 0
alzheimer 0
diffusion 6
dti 5
subjects 2
images 4
slice 0
brain 2
ftd 0

And here’s the bag of words for a single record that doesn’t contain this keyword:

data 5
changes 6
disease 4
alzheimer 4
diffusion 0
dti 0
subjects 16
images 1
slice 0
brain 10
ftd 0

Each of these records is converted into a vector containing the word count along with information about whether it was tagged with the keyword “middle-aged” (for the python scikit-learn library, I expressed this as a binary 0 or 1).

A categorization problem like this, where a bag of words is used as the basis for determining whether a record belongs in a particular category, can be approached with a number of different supervised learning techniques. Logistic Regression and Decision Trees are common approaches, and Random Forest is a particularly accessible and often effective algorithm for many categorization problems. I used a Random Forest classifier here, though other algorithms might turn out to be more effective.

For this application, I converted each record into a vector representing the bag of words, with each position representing a common term, and each value representing the number of instances of that word. I ignore a number of stop words and common terms, but there’s a lot more that could be done here, especially around identifying common phrases rather than single words. Each vector, along with an indicator representing whether this record was tagged with the subject “middle aged”, is then used to train a random forest classifier.

The rforestDS-Middle-Age.py script contains python code using scikit-learn to train and evaluate a random forest classifier on the Datashare records for the “middle aged” keyword tag, using the strategy described above.

Note that if you run this multiple times, you’ll get slightly different output. This is probably amplified by the relatively small number of training samples.

The train data score indicates how well the random forest categorizes the records that were used to train it. Because of the small sample size and relatively specific vocabulary, the assessment is fairly high.

Train data score: 0.941176470588

While it is interesting to observe the Random Forest’s assessment of how it performs on its own training set, a common practice is to split the training data into a training set and testing set (often at a two-thirds training, one-third testing ratio). You can then use the testing data (which was not used to build or train the random forest) to evaluate the accuracy of the classifier. I didn’t in this particular case, as the training set was very small and this is a small experiment/exercise, but it would be an important step on a larger dataset where we plan to make real use of a classifier.

Random Forests can also estimate the relative importance of the classifiers (in this case, the importance of the word count for each of the most common words in determining whether a record has been tagged with the subject “middle aged”). Here’s the feature importance estimate for a run of the random forest. The numbers will change slightly each time the forest is run.

data: 0.115221184049
changes: 0.0497659524022
disease: 0.0414350226064
alzheimer: 0.0438029075727
diffusion: 0.0870807358224
dti: 0.0287297792369
subjects: 0.0548861615427
images: 0.0783953700445
slice: 0.401884434536
brain: 0.0378775508884
ftd: 0.0609209012994

Although such a small sample size isn’t ideal for building a useful classifier, it can be illuminating, as the data set is small enough to hint at why certain words are so important. “Slice”, for instance, probably wouldn’t be such a strong predictor for whether a record should be tagged as “middle aged” in a larger data set. This is almost certainly a quirk related to our small sample size.

The output for this script is stored in the “categories” folder in a file named “1-Middle-Aged.txt”.


To apply the random forest to all subject tags, run the rforestDS.py script. You should see similar output for each keyword tag, along the training data score and feature importance estimates. To assemble all the estimates into a single file, run the MergeFiles.py script.

GitHub repository for this exercise

Github repository for Dash (formerly Datashare)

Categories: CKM

Redesign Your Website in 4,000 Easy Steps

CKM Blog - Tue, 2014-08-12 15:33

Earlier this month, Rich Trott and I delivered a session at the University of California Computing Services Conference (UCCSC) in San Francisco. It was about our experience using an approach of continuous iterative improvements and frequent feedback to help keep our site fresh and meeting user needs. We talked about why this approach has been working better than the tradition complete redesign that might happen every few years (or not.)

If you can’t wait to hear more, see the slides with notes.

Or if you’re more of the video type, you can check that out too.

Let us know about your experiences using this kind of approach to website upkeep and positive user experience. What works for your site or organization?

Photo by chexee

Categories: CKM

Announcing Symfony Ember.js Edition

CKM Blog - Tue, 2014-08-05 12:30

We’ve spent a lot of time getting the configuration right for setting up Ember.js and EmberData to work with a Symfony backend. So we have decided to release a working example of getting this right:

Symfony Ember.js Edition


The Ilios project is investigating a migration to Ember.js. Because we have a lot of PHP experience and a lot of PHP code, it makes sense to serve the content using Symfony. We chose Ember.js because of its convention over configuration approach and wanted to make as few customizations as possible.

EmberData provides a clean way to represent your data models in JavaScript and bind them to templates and controllers. It also has built-in REST functionality for keeping those models up to date on the server.

Specific Fixes Compiling Handlebars templates

Ember expects compiled Handlebars templates to be in the JavaScript Ember.TEMPLATES object instead of Handlebars.template. That’s fine if you put all of your templates in index.html like they do in most examples. In that case, Ember does the compilation itself.

However we wanted separate templates and routers in different files. This required pre-compiling the templates for Ember. Thankfully there is a Node.js application for doing this already called ember-precompile.

It is even supported in the latest version of Assetic. However AsseticBundle hasn’t been updated in a while, so we had to mess with the Composer definition to get this working. The Assetic compiler will fail silently if you don’t have ember-precompile installed in /usr/bin/ember-precompile. Hopefully a fix for that will be available soon.

Testing the API

We want test coverage for our API, but actually getting the right input proved to be a bit complicated. There is a demo controller test and a base test in the AcmeApiBundle in this distribution. You can use it as a starting point to make writing other tests easier.

JS Dependencies

We use Bower to install all of our dependencies, include them in the layout, and manage their version without checking the code into our repo.

None of this would have been possible without:

Categories: CKM

5 Questions with Dr. Daniel Lowenstein

The Better Presenter - Mon, 2013-07-29 07:30

In the previous post, we were introduced to Dr. Daniel Lowenstein and his “Last Lecture” presentation, which was both powerful and inspiring. Shortly after writing the post, Dr. Lowenstein contacted me, and we had an interesting discussion about his experience preparing for, and delivering that presentation.

I have always wanted to incorporate the voices of the instructors, students, and staff at UCSF, who work in the trenches and present or attend presentations on a daily basis. This post marks the beginning of a new series that will feature interviews of those people. I hope you enjoy the first episode of “5 Questions!”

5 Questions with Dr. Lowenstein

Bonus track: The Basement People

The full version of the original presentation has recently been uploaded to the UCSF Public Relations YouTube channel, so please head over there to watch the video, like it, and leave your comments!

If you have any ideas about who the next 5 Questions interviewee should be, please contact me or leave your ideas in the comments section below.

Categories: Better Presenter

Top 5 Lessons Learned from The Last Lecture

The Better Presenter - Thu, 2013-05-16 11:58

Powerful. Inspirational. Emotionally moving.

Those are the words that best describe Dr. Daniel Lowenstein’s “The Last Lecture” presentation, delivered to a packed house in Cole Hall on April 25th. The Last Lecture is an annual lecture series hosted by a UCSF professional school government group (and inspired by the original last lecture), in which the presenter is hand-picked by students and asked to respond to the question, ”If you had but one lecture to give, what would you say?” Dr. Daniel Lowenstein, epilepsy specialist and director of the UCSF Epilepsy Center, did not disappoint. In fact, I can say with confidence that he delivered one of the best presentations that I have attended.

Rather than attempt to paraphrase his words, or provide a Cliff Notes version that doesn’t do his presentation justice, I will instead encourage you to watch the video recording of his presentation. The video is an hour in length, and if you have any interest in becoming a better presenter yourself, it is a must-watch. After the jump, we’ll explore my top “top 5 lessons learned” from Dr. Lowenstein’s presentation.

Last Lecture – Top 5 Lessons Learned:

  1. “PowerPoint” is still boring. Dr. Lowenstein’s projected slide show was not typical PowerPoint. It did not consist of any bullet points, familiar and boring templates, or images “borrowed” from a last minute Google image search. Instead, used images from his own collection, and Prezi to build a canvas of images that moved in all directions, expanding, contracting and rotating to craft his message. The resulting slide show was personal, meaningful and most importantly, relatable.
  2. Story telling is the secret to success. When I first began studying the art of presenting, the idea of incorporating storytelling into a presentation was an elusive one. I am now convinced that storytelling is the secret to transforming a good presentation, into a great presentation. It is the glue that holds all of the elements of your presentation together, as well as the glitter that makes it shine. Dr. Lowenstein’s entire presentation was crafted into a story, the setting of which was established right from the beginning and illustrated by his first content slide. There were also chapters within the story, the most memorable of which for me was the Justice segment of his presentation, and his depiction of The Basement People. He didn’t begin by pointing out the original members of the UCSF Black Caucus that were in the audience, as most presenters would have done. Instead, he gradually painted a picture for us, so we could imagine what it was like to be a minority at UCSF over 50 years ago. He described their struggles in detail, and gave us time to relate, and even pointed out the fact that they had met in that very hall where we all sat. He didn’t reveal their presence until the end of the chapter, creating a crescendo of emotion, and the moment brought tears to the eyes of many audience members.
  3. Vulnerability equals trust. If you want your audience to believe in your message, you must first give them a reason to believe in you. And one of the most effective ways to make that happen is to share your vulnerabilities. In the eyes of the audience, this makes the presenter human, and it creates a bond between both parties. No one wants to listen to a sales-pitch presentation. Instead, they want the whole story with the ups and downs, so they can decide how we feel about it on their own terms. Just be sure to share vulnerabilities that relate to the subject of the presentation, because you’re going for empathy, not sympathy (which could have a negative effect). Dr. Lowenstein, when talking about Joy and Sorrow, shared one of his deepest personal sorrows, which was the unexpected passing of his son. In contrast, he shared a touching moment with his wife, expressing his love for her, right in front of the whole audience. These moments worked perfectly in the presentation because they were genuine, and they gave the audience a deeper understanding of Dr. Lowenstein.
  4. Don’t forget humor. No matter how serious, no matter how technical, there is a place in your presentation for a little humor. It can be used to lighten a heavy moment, open closed minds, and bring everyone in a room together (even if your audience members have very different backgrounds). Amidst Dr. Lowenstein’s presentation were timely moments of humor that seemed to come naturally from his personality. And hey, who doesn’t like a good male-patterened-baldness joke, anyway?! But seriously, if you can laugh at yourself, the audience has no excuse to not laugh along with you. There are two keys to using humor in your presentation; (1) it should be relevant to the current topic or story, and (2) it can’t be forced. If you’re not good at telling jokes, then try another form of humor!
  5. Present on your passions. As a presenter, your goal is simple – to instill in the audience an understanding of your message, and a belief in you. If you give them the impression, even for a moment, that you don’t believe in yourself or the message you’re presenting, you’re a dead man walking (or presenting) in the audience’s eyes. If you choose topics that you are passionate about, however, you will never have this problem. You may think it was easy for Dr. Lowenstein’s to be passionate about his presentation, because his task was, in essence, to present about his life’s passions… but I can assure you, it’s not easy to talk about your own life in front of an audience. In contrast, imagine that you have to give a presentation on, say, your department’s new accounting policies. To make matters worse, imagine that your audience is being forced to attend. What do you do? Surely, there is no passion to be found in accounting policy, is there?! Well, actually, there is, if you take the right angle. For example, does this new accounting policy save the department time, or money? And then, can that saved time and money be applied towards more constructive, or creative tasks that your coworkers actually want to do? If so, and you frame the presentation in a positive light, the audience will listen.

To top it all off, Dr. Lowenstein spent the last few minutes of his presentation reviewing each of the 4 segments of his talk, and then related it all back to a single, clear message. That, my friends, is an example of storytelling 101, so I hope you were talking notes!

Continue on to part 2 of this post, where I interview Dr. Lowenstein about his experiences preparing for and delivering the Last Lecture presentation!

If you also found inspiration in Dr. Lowenstein’s presentation, please share your thoughts below, and I’ll see you at next year’s “Last Lecturer” event.

Categories: Better Presenter
Syndicate content