Issue Details (XML | Word | Printable)

Key: CODEBASE-200
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Jonathan Rochkind
Reporter: Jonathan Rochkind
Votes: 0
Watchers: 0
Operations

Clone this issue
If you were logged in you would be able to see more operations.
Blacklight Plugin

SolrMarc 2.1 update

Created: 12/Jan/10 04:38 PM   Updated: 13/May/10 03:03 PM
Return to search
Component/s: None
Affects Version/s: Down the Road
Fix Version/s: 2.5


 Description  « Hide
I have a branch on github that updates the embedded SolrMarc to the latest 2.1 jar. It also updates the rake solr:marc:index task to use the new SolrMarc 2.1, which allows the rake task to do a number of things much more neatly, due to new features in SolrMarc.

the config/SolrMarc/ files used by the rake task are also updated.

The idea is that a user, after after installing Blacklight via the template, and choosing the optional jetty/solr install, should simply be able to run "rake solr:marc:index my_file.mrc", and that marc file will be indexed against that optional jetty/solr install, using the solrmarc properties files found in the plugin (vendor/plugins/blacklight/config/SolrMarc/*). For that to work, the relative paths to Solr in the plugin config/SolrMarc/config.properties need to actually be right for the template install optional jetty install scenario. I believe they are, but am not entirely sure, as I can't currently succesfully do the template install.

Note that config/SolrMarc now includes ./translation_maps/ and ./index_scripts/ directories, used by SolrMarc 2.1. These directories _could_ have been placed in blacklight/solr_marc/ instead, along with SolrMarc.jar, but they are in config/SolrMarc in order to provide a model for what is described in the next paragraph.

When a user wants to customize the SolrMarc config (their own index mappings, different locations of solr, etc), they can simply copy the plugin's config/SolrMarc dir to their own RAILS_ROOT/config/SolrMarc. The rake task will now find all of that there. Then they can customize all they want, including changed/new translation maps or scripts. (They will probably immediately HAVE to at least customize the relative paths to solr in config.properties, once they've made that copy -- execpt for that, a straight copy will Just Work).

I believe that this code should be re-useable by Matt's idea for a 'index test solr data' rake task. You would need to use a different config.properties file pointing to the (embedded?) test solr, instead of the anticipated location of the optionally installed quick-start solr. (the rake task will already look for config-test.properties if you run with RAILS_ENV=test). (Alternately, we could take these out of the config.properties file, and instead re-write the rake task to dynamically calculate them and pass them on the command-line. We'd want to leave them commented-out in config.properties, so a user copying that to locally change would see that they COULD be specified there to be different than the rake tasks calculated defaults!)

And then pre-set ENV['MARC_FILE'] to the location of the marc file to be loaded. And then it should Just Work.


I suggest this patch is ready to be applied to trunk, even before it's figured out what to do with the test index rake task idea. This is an improved version of the already existing solr:marc:index task, which includes SolrMarc 2.1 and just works better.

http://github.com/jrochkind/blacklight/tree/solr_marc_2_1



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Matt Mitchell added a comment - 19/Jan/10 05:21 PM
Nice to get SolrMarc 2.1 going here.

* When calling solr:marc:index without a MARC_FILE argument, it'd be nice to get a message saying that the MARC_FILE arg is missing above the config dump.

* When attempting to get this going in the root of the plugin (for development) I created a new config/SolrMarc/development.properties file. To run this, I executed:

  rake solr:marc:index MARC_FILE=../blacklight-data/test_data.utf8.mrc CONFIG_PATH=config/SolrMarc/development.properties

  I got an error -- not able to find the Solr home and solr.war file (I expected that). I ended up getting the indexing to work when I set the paths like:

  solrmarc.solr.war.path = ../../jetty/webapps/solr.war
  solr.path = ../jetty/solr

  Obviously this is a SolrMarc thing but, where is the reference root directory for SolrMarc?
  Because you can see that the depth of those paths is not the same; "jetty" is at the root of the plugin, where I'm executing Rake.

* I'd be interesting in knowing if you can override the solrmarc.solr.war.path and solr.path when executing the SolrMarc jar file? Along with that, the solr.hosturl and solr.indexer.properties?

Jonathan Rochkind added a comment - 19/Jan/10 05:37 PM
Thanks Matt.

"* When calling solr:marc:index without a MARC_FILE argument, it'd be nice to get a message saying that the MARC_FILE arg is missing above the config dump. "

The config dump does list "MARC_FILE: [marc file needed]" already. I was worrying that screen was getting too long already, but if you really think it needs an extra line at the very top saying the same thing, that can be done. You do?

"* When attempting to get this going in the root of the plugin (for development) I created a new config/SolrMarc/development.properties file. To run this, I executed: "

Do what works, but I think there's a confusion over multiple jetty instances being there. I was targetting the jetty instance that is optionally installed with the template installer, for your own data -- NOT the jetty instance (currently) included for testing purposes. So finding the "jetty at the root of the plugin" was not the intended goal, I would consider that a bug, heh. I think you must have been finding two different instances of jetty/solr with ../ vs ../..?

It may be hard to get the default config.properties to work for all cases, in some cases you'll just have to write your own -- all you've got to do is put a config/SolrMarc in your own Rails app including the BL plugin, and the rake task, when executed from your app (NOT from inside the plugin) will find that.

But here is the idea. The relative paths in config.properties basically end up being relative to config.properties itself. So the scenario I was coding the default relative paths for was:
1) You install the BL plugin via template installer.
2) You tell the template installer that, yes, you do want to install a jetty.
3) You then execute rake solr:marc:index from your app (NOT cd'ing into the plugin; that confuses rake, since rake will recognize the plugin AS a Rails app.
4) Outcome => It finds the jetty/solr installed by the template installer, at the relative paths specified.

As far as I could tell, I had the right relative paths for that. i tried to do that scenario as best i could, although it wasn't exact since I had to point at the existing 'template' using the existing trunk Blacklight. But can you try out that scenario and see if it works as intended?

But I think you'll get more ideas of possibilities for doing what you really want from the next point...

"I'd be interesting in knowing if you can override the solrmarc.solr.war.path and solr.path when executing the SolrMarc jar file? Along with that, the solr.hosturl and solr.indexer.properties?"

So solrmarc itself right now won't really allow an "over-ride", no. I tested that. If the paths were NOT in the config.properties, then they could be passed on the command line -- the rake task can't do that now, but it hypothetically could be written to do so. However, if the paths ARE in the config.properties, then SolrMarc doesn't allow a command line 'override'.

The anticipated use case is that rather than pass them on the command line to the rake task, you'll just make a different config.properties file and pass THAT on the command line, as you've done.

However, for the basic expected use cases, you don't even need to pass CONFIG_FILE on the command line. In your example, you had a "development.properties" config file. If you instead name this "config-development.properties", then the rake task will automatically find and use it if you execute:

rake solr:marc:index RAILS_ENV=development


So perhaps we should bundle and included "config-test.properties" that has the relative paths set to the embedded test data jetty/solr instead of the template installer optionally installed jetty/solr? Then you could just execute "rake solr:marc:index RAILS_ENV=test" to get the proper paths for the embedded test data jetty.

Hope this helps clear things up. Let me know what you think appropriate next steps are.

Jonathan Rochkind added a comment - 19/Jan/10 05:46 PM
Actually, since I think "development" is the default Rails env, if you had a config-development.rb, then simply running "rake solr:marc:index" would use that in preference over a generic "config.properties". If you instead manually specify RAILS_ENV=production, then the rake task would use config.properties -- so long as a config-production.properties didn't exist.

You could theoretically use this with any arbitrary 'environment' name.

rake solr:marc:index RAILS_ENV=my_made_up_env

will indeed find config-my_made_up_env.properties

Matt Mitchell made changes - 09/Feb/10 02:47 PM
Field Original Value New Value
Assignee Bess Sadler [ eos8d ] Matt Mitchell [ mmitchell ]
Matt Mitchell added a comment - 09/Feb/10 05:06 PM
Thanks Jonathan,

For the MARC_FILE, or any other required param, I'd prefer an explicit error report than a config dump/view. We can always get the config dump in the "info" task.

The paths/properties files to me are confusing, not because of your patch though. I did find out that overriding SolrMarc params from the command line does indeed work: http://groups.google.com/group/solrmarc-tech/browse_thread/thread/12005daeb1c50043

In my opinion, since we're using Rake to execute this stuff, we should completely override the path based properties that are in the config files using the command line opts. We could keep the path based properties in the config files, but add comments, "These paths are not used when running the solr:marc:* tasks" for example.

This would also allow developers to test within the plugin directory and not worry about relative path stuff. And the same code would work when used in an application... because we have a dynamic relative base: Rails.root

I'd do something similar for the SM solr.hosturl, fetching the value from Blacklight.solr_config[:url].

Of course this doesn't prevent people from using the properties files, it's only that our solr:marc:* Rake tasks would override in a useful way.

I like the SOLRMARC_MEM_ARGS, but how about a JAVA_OPTS variable instead so you could append arbitrary java opts:

  rake solr:marc:index MARC_FILE=data.marc JAVA_OPTS=xxx -- just a thought?

It's be nice also, to specify a MARC_FILE with a glob/* so you could index an entire directory of marc files

Definitely like the RAILS_ENV stuff you have in there. Makes a lot of sense to switch based on that value.

Jonathan Rochkind added a comment - 15/Feb/10 11:00 AM
Matt, thanks for taking a look at this.

I think most of the things you're asking for are changes from what is currently in trunk for SolrMarc 2.0. Is it possible to apply this patch now -- as it gets up to SolrMarc 2.1 and takes advantage of some new features in 2.1 to make things more convenient -- and file your other ideas as 'enhancements' that can be added subsequently?

The current behavior of spitting out the usage screen if you omit the MARC_FILE is actually there because it's what Bess requested when I wrote the previous version of original 2.0 task. It's easy enough to change though. What do you want it to look like?

I don't really like having the rake task _require_ that certain properties NOT be in the config.properties file. I don't like changing SolrMarc default behavior like this, I think it will lead to confusion if a properties file that works with 'raw' SolrMarc does NOT work or works differently with the rake task. However, we could add a JAVA_OPTS env argument to allow arbitrary solrmarc args _if_ you choose to include them on execution. It would end up looking like... JAVA_OPTS="-Dsome.property=value -Dother.property=value". In general, I like including actual individual env args for particular arguments we actually have a known use case for, I think it's less confusing; but the JAVA_OPTS one could be there 'just in case'. But I do see what you mean about fetching the value from Blacklight.solr_config -- although that would require the rake task to load the Rails environment, whcih it doesn't do now, which adds 2-3 seconds to startup time for the rake task on my machine. Perhaps it needs more consideration -- can we go ahead and apply the patch, and then consider that enhancement subsequently?

I _believe_ that you can supply a directory as the (possibly mis-named) MARC_FILE argument already, and it will Just Work to index all files in that directory, because SolrMarc itself is already prepared to recognize a directory in that slot. If you test this and it works, perhaps we should change the name of MARC_FILE or something. (Just "MARC_INPUT" ?). But since this wasn't there in the previous trunk version of the rake task either, again, I'd like it if we could go ahead and apply this patch and consider this subsequently.




Matt Mitchell added a comment - 15/Feb/10 11:26 AM
Hey Jonathan,

We can add enhancements later sure, this is a good base patch.

BTW, I wasn't suggesting that the Rake task require that the SolrMarc properties file be setup any different. It would just override the path/url based variables. So it's not changing SolrMarc behavior. Rake is just introducing some dynamically generated paths/urls for convenience, to send to SolrMarc.

Also, the solr.yml file wouldn't need to load the entire Rails environment. It could be parsed using the yaml lib, if start-up speed was a problem.

Glad to know the sub-dir will work for MARC_FILE... should have tested this before I mentioned it :) Maybe the arg should just be named "MARC"?

So I'll apply the patch and go from there. I have some ideas for getting multi-core working using some of this too.

Thanks again for your patch!

Jonathan Rochkind added a comment - 15/Feb/10 11:35 AM
Great, thanks. If you outline the changes in another ticket (or tickets), I can try to get to writing the code; of course you (or anyone else!) is also welcome to.

I'm just back from a week away from work on "Snow Holiday" (Ie, crazy snow here everything was closed), so kind of overwhelmed catching up at the moment, sorry!

I'm also not sure how long I'm going to stick with SolrMarc in general, I'm considering other indexing options, so if I end up not using SolrMarc at all here, my motivation to improve the rake task for SolrMarc will be somewhat diminished. :)

Jonathan Rochkind added a comment - 17/Feb/10 12:09 PM
Can you add a note on this ticket when you've applied the patch to trunk?

Naomi Dushay made changes - 05/Apr/10 12:14 PM
Fix Version/s 2.5 [ 10030 ]
Bess Sadler made changes - 05/Apr/10 12:23 PM
Affects Version/s Down the Road [ 10021 ]
Fix Version/s Down the Road [ 10021 ]
Fix Version/s 2.5 [ 10030 ]
Jonathan Rochkind made changes - 04/May/10 04:19 PM
Assignee Matt Mitchell [ mmitchell ] Jonathan Rochkind [ jrochkind ]
Fix Version/s 2.6 [ 10040 ]
Fix Version/s Down the Road [ 10021 ]
Jonathan Rochkind added a comment - 04/May/10 04:38 PM
Okay, this never got applied, but revisiting it now.

1. Added load of solr.yml file to use solr url for appropriate environment, put in -Dsolr.hosturl, over-riding config.properties.

2. Fixed paths in default config.properties to point to where the current template installer will install the optional jetty. config-test.properties points to the submodule testing jetty. The paths in config.properties are a little bit weird, because SolrMarc requires them to be relative to SolrMarc.jar, which we have in blacklight/solr_marc/SolrMarc.jar.

3. Updated included SolrMarc.jar to SolrMar tagged 2.1.1

4. Fixed solr:marc:index_test_data script to be written in terms of this solr:marc:index -- it simply sets the MARC_FILE arg and calls solr:marc:index. DRY, and more flexible -- now you can call index_test_data with whatever RAILS_ENV you want (test, development, production).

5. Fixed template.rb installer to NOT modify the config.properties solr.path before copying to local app. Is no longer neccesary and in fact messes things up. Also fixed template.rb to NOT make a copy of the distro sample data from plugin to local app, the one copy we already have in plugin is enough and rake task is perfectly happy eating it.

Jonathan Rochkind added a comment - 13/May/10 02:58 PM
My branch got all weird and out of sync with master, the change from direct files jetty/data to submodule confused it, and I couldn't figure out how to get it back in sync without causing weirdness in master.

So I am just committing these files straight, as one bulk commit.

Fixed up specs to pass, and test as best as I could that solr:marc:index produced correct java command.

Jonathan Rochkind added a comment - 13/May/10 03:03 PM
Committed!

Jonathan Rochkind made changes - 13/May/10 03:03 PM
Status Open [ 1 ] Closed [ 6 ]
Fix Version/s 2.5 [ 10030 ]
Fix Version/s 2.6 [ 10040 ]
Resolution Fixed [ 1 ]