Who was there?
The Wolfram Data Summit is an ‘invitation-only’ annual conference attended by folks who spend their days grappling with and analysing the phenomenal amount of data being generated across disciplines and across the world each day (a.k.a ‘Big Data’).
The conference was introduced by Stephen Wolfram. Stephen is the founding father of Wolfram and is passionate about data. Any data, be it personal analytics (see wonderful blog on his email habits over the last 25 years), the data generated by analysing a plate of food with a RGB camera to establish its calorific content or uses of the huge amount of data generated by mobile phones, both for good (e.g. traffic updates) and for evil (e.g. big brother).
Stephen explained Wolfram have a 20 year long to-do list…mostly generated from user based searches submitted at WolframAlpha.com which uses disambiguation algorithms of text input courtesy of Wolfram Mathematica. Apparently 93% of queries (aka web “utterances”) are ‘understood’ by WolframAlpha.com. This is the same linguistic technology used by Siri (in the iPhone) although Stephen explained that Siri only has a 90% efficacy rate … as typically people like to ask questions about themselves and Siri struggles with questions like “Who’s the fairest of them all, Siri?”!
To see WolframAlpha linguistic understanding in action, type the following at wolframalpha.com
How many baseballs does it take to fill a 747?
The wonderful thing here is Wolfram both understands the need for a volume-based metric and that 747 represents the plane. Smart!
Wolfram are now starting to release private implementations of WorlframAlpha, which is apparently producing some very interesting combinations of public and private data. It’s also now possible to submit image based searches but I am still unsure as to why anyone would want to do that.
As per Stephen’s blog, Wolfram are also interested in personal analytics. This is the sort of stuff that we’re increasingly being asked to produce by academics and so if you’ve not seen it already, then go to http://www.wolframalpha.com/facebook/ to see some pretty visualisations of your Facebook ‘friends’. I’ve only got about 6 so I won’t be showcasing my results here.
Stephen wrapped up his talk by heralding the birth of knowledge-based computer programming. Sounds good to me!
Other highlights from the event:
- A fast-paced talk provided by Kalev Leetaru on GDELT whose goal is to help uncover previously obscured spatial, temporal, and perceptual trends (his words not mine) through new forms of analysis of huge textual repositories that capture global societal activity, from news and social media archives to knowledge repositories. An example being the analysis of social media data to establish what’s really going on in Syria (or not). See http://gdelt.utdallas.edu for more info.
- A talk on the analysis of the symptoms entered in WebMD Sympton Checker from across the world to better understand the geographical distribution of disease activity e.g. flu tracking.
- An interesting talk by Greg Newby on the Gutenberg project which is attempting to publicly release all out-of-copyright books and even included a free giveaway of 29k books on DVD.
- A fascinating talk by Paul Lamere from The Echo Nest on mining music data for tasks such as automatic genre detection, song similarity for music recommendation, and data visualisation for music exploration and discovery. Very fascinating and cool stuff!
- A whimsical talk by Anthony Scriffignano from Dunn & Bradstreet on the ways in which companies are struggling to ask meaningful questions that leverage the “v’s” of large amounts of data (volume, variety, velocity, veracity). This was a lot more familiar territory to me and Anthony did a really good job at summarising both the good and bad practices of analysing large datasets, with data ‘assumptions’ being one of his biggest bugbears.
- A talk by Leslie Johnston, Chief of Repository Development at the Library of Congress, who explained that Congress were now planning to expand their collections to also serve research data associated with journals. This move had seen them add 45 new file types in the first 12 months. Leslie summarised the challenges associated with hosting datasets in a public facing repository. We can relate to a lot of them: Multiple file formats, unclear and undocumented rights, security, missing metadata, data citation and identifier issues, discovery expectations and of course, cost.
- Finally a talk by Jan Brase, Executive Officer of DataCite on what DataCite is and where it was heading. This was a particularly interesting session for me to sit in as I was one of the very few people in the room who had heard of DataCite. Interestingly, it was also one of the sessions that produced the most questions during the Q&A … probably because a lot of the audience were interested in getting their data cited! n.b. figshare are members of DataCite through CDL.
I asked if affiliate information would be included in a future schema and Jan confirmed this would be discussed at an upcoming DataCite summit. Until that point, it’s going to be hard to include DataCite as an Elements data source.
He also demo’d Citeproc which is a collaborative effort with CrossRef.
One of the big topics at the conference was data privacy and the anonymisation (or not) of data. For example, the US Census people go to huge lengths to ensure any information based on US Census data can not be traced back to the individual, and yet we allow mobile phone operators to track our every waking move with apparently little concern.
Sadly for us, the lack of persistent identifiers means bibliographic data is naturally anonymised but it was nice to be reminded that our collective efforts form part of the wider knowledge economy.
Like this post? Follow Jonathan on Twitter: @breezier