Talk:Data Storage

From Eureka

Jump to: navigation, search

Contents

Vision for data storage

I thought I'd outline here how I'd really like the data on this site to work...that way everyone understands what the basic issues are. Update : I've abandoned this ideal slightly, and I outline here now just the "pragmatic" vision . --Brucebartlett 21:20, 15 August 2007 (EDT)

I'd like things to be done through Semantic Mediawiki. This means that the site is a brave new mix of a wiki and a database. It's perhaps not a big problem that we can't run version 0.7, since I think everything we need is already in 0.4.

How wiki contributors will enter data

We run via Option B, and customize our own templates. An ordinary wiki editor, when they edit a page, will then just see something simple like this :

{{Journal Info
|Name = Antarctica Journal of Mathematics
|Published by = Y. Eti and Sons
|Editor1 = M.K.R.S. Veera Kumar
|Editor2 = K. Chandrasekhara Rao
|Annual price = 400
|Eigenfactor = 1.7
}}

In other words, it looks just like an ordinary wiki template. It is an ordinary wiki template, except those fields are also now stored as searchable data. This is really useful, for example, at any point at any page someone can now give an inline query like

<ask> [[Category:Mathematics Journals]] [[Eigenfactor >= 1.8]] </ask>

to show all the mathematics journals whose eigenfactor is greater than or equal to 1.8. These results can be displayed anyway one wants.

Carrying on. He/she then adds human text to the journal page, like "The Antarctica Journal of Mathematics began in the 1990's, when it was realized how temperature can influence mathematical prowess." If the wiki contributor has updated info (like an updated price, or new editors), they add this to the 'human' text (if they want to) and also change the table. In short - just like ordinary wikipedia.

How the Journal Info template will function

The hard work here is reserved for the Great Wiki Genius moderately-good-wiki-person who will construct the Journal Info template. There, the data will be annotated, i.e. it will contain things like [[Published by::{{{Published by}}}]] . But... to get proper functionality, this template is going to have to be hardcoded/coded in Java/use PHP tricks.

Here's how I hope it will function. Consider the lines

Published by : Springer (+)
Managing Editors : A. Einstein (+) and I. Newton (+)
Subfields : Topology (+), Geometry (+).
Eigenfactor : 1.3 (+)

Clicking on "Published by" takes you to the page which explains the relation "Published by". Clicking on Springer takes you to Springer's page in the wiki. Clicking on the (+) sign next to Springer takes you to the category [[Category:Journals published by Springer]] .

Clicking on "Managing Editors" takes you to the page which defines that term. Clicking on A. Einstein takes you to his page in the wiki. Clicking on (+), takes you to a page giving nicely formatted results of all journals for which A. Einstein is an editor for. By "a page giving nicely formatted results" I'm thinking "set up a nice template, using inline <ask> tags, that does this for you". Rather, and this is the key point, clicking on (+) simply takes you to the web address for the Special:SearchTriple code for searching for all journals edited by that editor. In other words, we go to the link:

http://www.eurekajournalwatch.org/index.php?title=Special:SearchTriple&relation=Edited+by&object=M.K.R.S.+Veera+Kumar&do=Search+Relations

And so on. Thus some of the (+) signs - like the one next to Springer - evaluate to categories (e.g. the category of journals published by Springer), while other (+) signs - like the ones next to editors names, or the price of a journal - takes you to the Special:SearchTriple page web address which searches those particular fields.

Users can always go directly to the Special:SearchTriple page (in 0.7 this is called the Ask page I think) where they can ask complicated questions directly, instead of via inline queries.

But I'm thinking most of the time people will just want to click on something quickly in the table, like they'll click on the (+) next to "Eigenfactor" to compare this journal's eigenfactor with some other ones.

The advantage of this pragmatic system are:

  • It requires no special coding, no peeping under the hood, and can be implemented right away.
  • It just needs the template {{urlencode}} which for some reason doesn't work.
  • When we upgrade to SMW 0.7, things improve automatically (you get better sorted tables for your queries, etc.)
  • The Journal Info template can be made to look very professional. You can have an image in the top, everything, just like that of the Tajiks.
  • We completely dump the inflexible "Factbox" system of SMW.

The disadvantages are:

  • For searching things like "what journals does G. Segal edit for", we have to rely on the ugly tables generated at the Special:SearchTriple page. Will improve with SMW 0.7. The alternative was to completely generate our own pages, using inline queries, and then format them to our heart's content. But to do that we'd need to be able to call wiki pages with parameters, something which seems completely impossible.
  • Since we're not poking under the hood, the display will never be optimal. For instance, I'd prefer the editors to be an expandable list, but such widgets are not possible in standard Mediawiki, I don't think.


We might not all be wiki geniuses (I'm not!), but we all have a fair idea of how we'd like the site to function. What do people think? --Brucebartlett 21:20, 15 August 2007 (EDT)


Update on Option B

It seems we might be able to use option B (see discussion below). As an example of the kind of Journal Info box that will be obtained, see the yellowish box I hacked together near the bottom of Antarct._J._Math.. This is supposed to be a replacement for the less flexible, automatically generated, "factbox" at the bottom of the page. It looks horrible now but I'm hoping that a wiki genius can make it look very pretty.

The aforementioned wiki genius will have to think about the following:

  • Look at the Infobox Ethnic Group template for the Tajiks. This infobox doesn't use semantic markup; that's easy. You just change {{{journalprice}}} to [[journalprice:={{{journalprice}}}]]. See the help page for semantic wiki templates.
  • Notice that advanced parser syntax, if-then statements etc. are possible.
  • Multiple editors names can be handled by having fields {{{editor1}}}, {{{editor2}}}, ... , and then using the code [[Edited by::{{{editor1}}}]], [[Edited by::{{{editor2}}}]], ...
  • But here's the 'make-or-break' issue. The whole point is to be able to do inline queries, like the "+" functionality on the standard factboxes. I'm not sure if this is possible, which could scrap the entire idea. In particular, the bottom of this page suggests it's not possible, while the bottom of this page suggests it might be. This is the crucial point!
  • If it's not possible, then it's a big problem. Semantic searches on the data can always be made via inline searches or via the Special:Ask page, but we'd really like to be able to click on those tables to do the searches automatically. --Brucebartlett 08:00, 15 August 2007 (EDT)
Update : I hacked away at this, by attempting to fill in the actual web link for the search, in the journal info textbox. See the entry for the editor M.K.R.S. Veera Kumar at Antarct._J._Math.. It seems spaces are a problem, for one thing. --Brucebartlett 08:21, 15 August 2007 (EDT)
Update : I tried to use the standard parser function urlencode, but it didn't recognize it for some reason.


Prototype Journal Page

The page for the Journal of Knot Theory and its Ramifications has been setup as a prototype page, to show the proposed way in which data can be stored in Eureka. It uses Semantic Mediawiki version 0.4, since there are issues installing the latest version 0.7. It looks completely awful at the moment, but nevertheless the basic idea is there.

The underlying philosophy is that one enters text on a journal page in ordinary english, and not directly in template/tabular form. When important attributes are mentioned, they are marked up, e.g.:

"...is a journal published by [[Published by::World Scientific]]. It was started by Louis Kauffman in [[First Published In:=1992]] during the 'quantum topology' revolution."

This is only a very small addition to ordinary wiki syntax, in which the terms "World Scientific", etc. would have been marked up anyway. The new fields for marking up must be explained to new wiki users.

The idea is that all this data (called annotation) is then also displayed in a handy "infobox" which Blake is tweaking to look better. Ultimately we want it to look like this table on the Tajik ethnic group.

Clicking on the "+"-symbol next to Louis Kauffman's name in the "Editor-in-chief" box then takes you to a page showing all the journals he is an editor for. Again, the way these results are shown must be made to look MUCH more elegant; we wait in eager anticipation.

This leaves some questions though... what data do we store under the page for Louis Kauffman, for instance? An automatically generated table? How do we categorize the data? And so on.

I've entered in all the data available from www.journalprices.com and www.eigenfactor.org (the AMS data has not been incorporated, though it should have). All the editors (not just the main ones) appear too. Obviously this might be overkill, but I wanted to show all the data there is, so we can chop some out if need be. For instance, as regards the editors, we could (a) display them as a drop-down-list widget, or (b) not have them at all.

It is important that we settle these issues before we enter too much data into the site. --Brucebartlett 17:53, 14 August 2007 (EDT)

I feel like the page on Louis Kaufmann should include an automatically generated table, including his editorships, and a space for user generated content, probably a brief blurb on him and his interests (stealing the content from Wikipedia would be a good start). I don't feel like this sort of information should be a high priority, since it's available other places, but it is good to include, and I don't see any reason to not include it.
Ok. I was just a bit worried that it might give editors a fright to see their data pop up like that. Do you think we should be including all the editors (possibly as a drop-down-list widget), and not just the "chief" ones? I'd like to include all of them... or at least have the option of including all of them. One thing that worries me though is naming conventions : if we include all of them, then our list of editors will soon grow to over a thousand, and I'm a bit concerned as to what to do with naming conventions, although I guess a "disambiguation page" (whatever that is) will do the trick. --Brucebartlett 06:07, 15 August 2007 (EDT)
Incidentally, if the data from journalprices.org is better, why don't we get rid of that table from the AMS survey? --Ben Webster 21:48, 14 August 2007 (EDT)
I copied Voevodsky's wikipedia entry, and made a template (Template:from-wp) to attribute wikipedia for any copied text. Does anyone know how to automate this process, and how to fix interwiki linking, so that the links on his page link back to wikipedia? --Ben Webster 22:09, 14 August 2007 (EDT)

All this stuff you're doing is wonderful, Bruce! Here are a couple of small comments:

  • Right now, some journal prices are listed in pounds and some in dollars; we can imagine others listed in euros. But, a mere number without units appears in the 'infobox'. This could be a real problem!
  • You write: "I was just a bit worried that it might give editors a fright to see their data pop up like that." I think it's very important that editors get that scary sensation their actitivies are being monitored!!! But, I don't think we should let any old Joe Schmoe add information to the editor's pages. Of course this is allowed in Wikipedia, where George Herbert Walker Bush's entry was once transformed by replacing 'Walker' with 'Wanker'. But, precisely because we're advocating a cause, it's important to avoid this sort of rudeness.
  • To me it seems quite tiring to list all editors of all journals. What really matters most are the managing/head editor. Links to the journal webpages will reveal the rest --- this would mainly becomes important if someone wanted to pressure an editorial board to resign. But, maybe we can leave it as an option. It would certainly be amusing to see who is on the most editorial boards!

--John Baez 11:30, 16 August 2007 (EDT)

Regarding the issue of currencies... yes, I agree. Remember all these attributes have their own data type, click on "Annual Price" and you'll be taken to the page which explains it. These are also listed in the "Special Pages" part of the wiki, click on Annual Price under Special:Attributes. I said a bit about this currency issue there; the long and the short of it is that the currency data type is a feature of SMW 0.7 and not of 0.4. However we can, and must, make workarounds. --Brucebartlett 12:10, 16 August 2007 (EDT)

Option D

Didn't Blake suggest that there was an Option D, which is "produce a code fork of SemanticWiki which has the features we want from 0.7 but is compatible with his installation of PHP?"

Though, my inclination is Option B. Finding a host which uses a sufficiently modern version of PHP just doesn't seem like that big a hurdle. --Ben Webster 21:40, 14 August 2007 (EDT)

Hooray! I was wrong about Semantic markup in templates only being a feature of version 0.7. It works in 0.4 as well! See the "Thanked by" template in the Antarct._J._Math., and see the code [[1]]. The help page for this can be found here. Now just need someone to make a nice template. I'm going to try my hand at a Tajiks style one. To see a site that uses this technology (of using Semantic Wiki in templates), look at the Building Info template here.
To summarize : I am also in favour of option B. It gives us (a) greater flexibility over the format - we can dump those inflexible "factboxes" which we're currently using, (b) new users don't have to learn any new syntax, (c) all journal pages can be set up with these basic vital statistix. Of course, we can still use semantic markup in the main text if you want to; we're not ruling that option out. --Brucebartlett 06:49, 15 August 2007 (EDT)

Older discussion

  • It is urgent that we incorporate the data from www.journalprices.com and www.eigenfactor.org soon. The data of the former is available as a comma-delimited text file. They have 401 math journals, as opposed to our 270. Their data is also more up-to-date. I'll do this myself by hand if no-one more technically competent than me offers a better solution soon.
I can get to this either today or possibly tomorrow. Blake Stacey 12:58, 14 August 2007 (EDT)
Thanks Blake! Perhaps we should wait until we find a suitable data storage model though. Because its not just the data from www.journalprices.com, its the eigenfactor data too... and we should get our system well-oiled and ready for this data before it comes.--Brucebartlett 14:27, 14 August 2007 (EDT)
  • I am concerned that the way Mediawiki stores data might be insufficient for our purposes. Consider editors' names for instance - at the moment we are giving each editor a category of their own, as it seems to be the only solution. At one stage we might have over a thousand such categories! Also, journal data is being stored on the individual pages in a haphazard fashion. This is ok, in my opinion it makes it look more human and some times it's impossible to give "computer-style" data, for instance, "the editors at the moment are R. Bradfield and G. Tong, but in August G. Tong is resigning to make way for A Pelushi" is something we can enter at the moment, but which is impossible to enter if we get too strict about things.
  • Nevertheless, it does worry me a bit. We'd like to be a one-stop resource. Thus if people want to see the "10 most expensive topology journals", it's going to be a problem, since the data storage model is not amenable to such calculations. However, these "Top 10" lists could be prepared ahead of time by the wikiworkers.
  • I am currently trying to understand the basic data storage possibilities of MediaWiki, such as SemanticWiki. We'd better get the data storage model right in the beginning, else we might have some big headaches later. I don't think our current system is quite right. What do the experts say?

--Brucebartlett 09:29, 14 August 2007 (EDT)

Follow up on this. I suggest we use Semantic MediaWiki, which is an extension of MediaWiki. The quick-and-dirty guide to what it does can be found here. Ultimately, this means that when one is entering the text on the journal page for the journal “K-Theory”, one will write something like:
K-Theory is a journal published by [[published by:Springer]]. It’s managing editors are [[managing editors: A.Bak]].
Well, something like that. Anyhow the advantage of this is that this nice, loose English information can then also be automatically displayed in a uniformly structured infobox, which appears alongside. Thus the page will look something like this one, for the Tajik ethnic group.
Update. Look at the bottom of this page to see the kind of factbox I'm talking about. In our case, where it says "Publisher : Springer", one will be able to click on Springer to get a list of all journals published by Springer. In other words, there's no need for a category "Journals published by Springer". --Brucebartlett 11:11, 14 August 2007 (EDT)
And moreover, it enables very powerful searches and bots and things to be let loose on the data. Anyhow, I believe this issue of the method of data storage is crucially important and needs to be decided urgently. Using Semantic MediaWiki, for instance, involves some subtle but far-reaching changes to the basic way a wiki works. We can't just leave it for later... it is an important decision. --Brucebartlett 12:40, 14 August 2007 (EDT)


Interesting questions! I'm not the one who can answer them: I'm a mere rabblerouser, not a hacker. So here's my only comment: please sign your posts so we can more easily see who you are! Just type four tildes after your post, to get something like this: John Baez 09:21, 14 August 2007 (EDT)
Right, sorry about that. Fixed now. --Brucebartlett 09:29, 14 August 2007 (EDT)
I like what Semantic MediaWiki might have to offer, and I'll try to get it up and running. It's an extension of MediaWiki, so it shouldn't upset anything we've got right now. My only worry is that it might depend upon a newer version of PHP than my server currently has installed, and installing a new version of that without breaking anything which depends upon the old one would be a real headache. Blake Stacey 12:58, 14 August 2007 (EDT)
Ok thanks Blake, I hope it installs ok. It must be said that Semantic MediaWiki doesn't yet offer one or two small features that are desirable... but hopefully these will appear in the newer versions. For instance, there are two basic ways of entering "semantic data". The first is via semantic templates, which is just an ordinary wiki template upgraded to actually store the annotated data. As an example, see the Building Info template at this site. The second way is to type in normal english text, just in an annotated way, like I described above (" K-Theory is a journal published by [[published by:Springer]] "), and then let Semantic MediaWiki automatically generate a factbox, like at the bottom of this page. Personally I prefer the latter "normal english" method, since its more "organic". However, a small nuisance is that the current version of Semantic MediaWiki doesn't seem to allow you to fine-tune the way the factbox is displayed... which is an advantage of the former template method. --Brucebartlett 13:16, 14 August 2007 (EDT)


Yep. It doesn't work with PHP 4. Let me see if I can find an older version of the extension which does.
Sigh. Blake Stacey 13:14, 14 August 2007 (EDT)
SemanticMediaWiki 0.4 is running! Now, we just have to figure out what to do with it. I suggest experimenting with a few pages (such as K-Theory and Topology) to decide what we like. (Some examples of syntax can be found here.) Once we know what a "semantically enabled" page should look like, we can run a bot to modify the other pages, and then incorporate new datasets to make pages with the enhanced features. I'd rather not have to modify too much existing content (that's just asking for trouble), so I'd rather we decide how the basic article should be phrased, including all the semantic whoosywhatsits, before I try grabbing hundreds more journals. Blake Stacey 13:35, 14 August 2007 (EDT)
Great, thanks Blake! But sadly... I suspect that version 0.4 is not advanced enough for our needs. Let's give it a go, and see what it can do, but I don't think this is a long-term solution. Mmm. This data storage model issue does need to be sorted out urgently though, one way or another. It makes a critical difference to the scope, purpose and utility of the site. --Brucebartlett 13:41, 14 August 2007 (EDT)
Here are my first impressions with version 0.4 (big thanks to Blake for installing it!). It seems to provide the basic annotation functionality we need, like "is published by" and "is editor for" or "cost per article". However, its graphical display is very much inferior to 0.7, compare the bottom of the K-theory page to the bottom of this page. This tends to negate all the usefulness of it. Moreover, the namespaces "Relations" and "Attributes" under the Special pages section are much more primitive than their 0.7 counterparts. I feel we need to sort out the PHP problem so as to install 0.7, or think about another system of storing data.
It may well be possible to improve the graphical output of 0.4 without moving to 0.7. (I'm going through the code now, and it's surprisingly readable for a MediaWiki extension. The relevant stuff appears to be in SMW_SemanticData.php, and in particular the printFactbox() function and its dependencies.) I'm not so sure about the "Relations" and "Attributes" namespaces.
I tentatively propose that we work with 0.4 for the moment. I'll try to make its output prettier/more useful, and if the functionality it provides seems to be serving our purposes, then we can sort out the version issues. After all, upgrading to 0.7 won't involve changing any page text — and if we move to a new server (UCR?) then we'll be able to build up the infrastructure with the proper versions of everything. Blake Stacey 14:36, 14 August 2007 (EDT)
Ok. I see you're making the output prettier already! You could sneakily just take a peek at the source of 0.7... although we even want to make that look prettier. Anyhow, right now I am concerned : we need to have multiple "Edited by" relations, see [Journal of K-Theory]. I'm not sure this is possible? --Brucebartlett 15:14, 14 August 2007 (EDT)
I think we need n-ary relations. We want to be able to type "The editors of the Journal of K-Theory are [[edited by::A.Bak; M. Hopkins; B. Simpson]]", as opposed to "The editors of the Journal of K-Theory are [[edited by::A. Bak]], [[edited by::M. Hopkins]], [[edited by::B. Simpson]] ". Good news though. Biowiki has already extended Semantic Mediawiki 0.4 to do this! Can you use Biowiki, or understand what they've done, Blake? --Brucebartlett 15:39, 14 August 2007 (EDT)
Personal tools
discussions