Quantcast
Channel: MarcEdit – Terry's Worklog
Viewing all 243 articles
Browse latest View live

MarcEdit 6.1 Update

$
0
0

This update will have four significant changes to three specific algorithms that are high use — so I wanted to give folks a heads up.

1) Merge Records — I’ve updated the process in two ways.  

   a) Users can now change the data in the dropdown box to a user-defined field/subfield combination.  At present, you have defined options: 001, 020, 022, 035, marc21.  You will now be able to specify another field/subfield combination (must be the combination) for matching.  So say you exported your data from your ILS, and your bibliographic number is in a 907$b — you could change the textbox from 001 to 907$b and the tool will now utilize that data, in a control number context — to facilitate matching.  

   b) This meant making a secondary change.  When I shifted to using the MARC21 method, I removed the ability for the algorithm to collapse multiple records of the same type with the merge file into the source.  For example, after the change to the marc21 algorithm, in the following scenario, the following would be true:

 source 1 — record 1
merge 1 — matches record 1
merge 2 — matches record 2
merge 3 — matches record 3

 

The data moved into source 1 would be the data from merge1 — merge 3 wouldn’t be seen.  In the previous version prior to utilizing just the Marc21 option, users could collapse records when using the control number index match.  I’ve updated the merge algorithm, so that default is now to assume that all source data could have multiple merge matches.  This has the practical option of essentially allowing users to take a merge file with multiple duplicates, and merge all data into a single corresponding source file.  But this does represent a significant behavior change — so users need to be aware.

 

2) RDA Helper — 

   a) I’ve updated the error processing to ensure that the tool can fail a bit more gracefully

   b) Updating the abbreviation expansion because the expression I was using could miss values on occasion.  This will catch more content — it should also be a bit faster.

 

3) Linked Data tools — I included the ability to link to OCLC works ids — there were problems when the json outputted was too nested.  This has been corrected.

 

4) Bibframe tool — I’ve updated the mapping used to the current LC flavor.

 

Updates can be found on the downloads page (Windows/Linux) or via the automated update tool.

Direct Links:

 


Merge Record Changes

$
0
0

With the last update, I made a few significant modifications to the Merge Records tool, and I wanted to provide a bit more information around how these changes may or may not affect users.  The changes can be broken down into two groups:

  1. User Defined Merge Field Support
  2. Multiple Record merge support

Prior to MarcEdit 6.1, the merge records tool utilized 4 different algorithms for doing record merges.  These were broken down by field class, and as such, had specific functionality built around them since the limited scope of the data being evaluated, made it possible.  Two of these specific functions was the ability for users to change the value in a field group class (say, change control numbers from 001 to 907$b) and the ability for the tool to merge multiple records in a merge file, into the source.

When I made the update to 6.1, I tossed out the 3 field specific algorithms, and standardized on a single processing algorithm – what I call the MARC21 option.  This is an algorithm that processes data from a wide range of fields, and provides a high level of data evaluation – but in doing this, I set the fields that could be evaluated, and the function dropped the ability to merge multiple records into a single source file.  The effect of this was that:

  • Users could no longer change the fields/subfields used to evaluate data for merge outside of those fields set as part of the MARC21 option.
  • if a user had a file that looked like the following —
    sourcefile1 – record 1
    mergefile – record1 (matches source1)
    mergefile – record2
    mergefile – record3 (matches source1)

    Only data from the mergefile – record 1 would be merged.  The tool didn’t see the secondary data that might be in the merge file.  This has always been the case when working with the MARC21 merge option, but by making this the only option, I removed this functionality from the program (as the 3 custom field algorithms did make accommodations for merging data from multiple records into a single source).

With the last update, I’ve brought both of these to elements back to the tool.  When a user utilizes the Merge Records tool, they can change the textbox with the field data – and enter a new field/subfield combination for matching (at this point, it must be a field/subfield combination).  Secondly, the tool now handles the merging of multiple records if those data elements are matched via a title or control number.  Since MarcEdit will treat user defined fields as the same class as a standard number (ISBN technically) for matching – users will now see that the tool can merge duplicate data into a single source file.

Questions about this – just let me know.

–tr

MarcEdit 6 Updates

$
0
0

I hadn’t planned on putting together an update for the Windows version of MarcEdit this week, but I’ve been working with someone putting the Linked Data tools through their paces and came across instances where some of the linked data services were not sending back valid XML data – and I wasn’t validating it.  So, I took some time and added some validation.  However, because the users are processing over a million items through the linked data tool, I also wanted to provide a more user friendly option that doesn’t require opening the MarcEditor – so I’ve added the linked data tools to the command line version of MarcEdit as well. 

Linked Data Command Line Options:

The command line tool is probably one of those under-used and unknown parts of MarcEdit.  The tool is a shim over the code libraries – exposing functionality from the command line, and making it easy to integrate with scripts written for automation purposes.  The tool has a wide range of options available to it – and for users unfamiliar with the command line tool – they can get information about the functionality offered by querying help.  For those using the command line tool – you’ll likely want to create an environmental variable pointing to the MarcEdit application directory so that you can call the program without needing to navigate to the directory.  For example, on my computer, I have an environmental variable called: %MARCEDIT_PATH% which points to the MarcEdit app directory.  This means that if I wanted to run the help from my command line for the MarcEdit Command Line tool, I’d run the following and get the following results:

C:\Users\reese.2179>%MARCEDIT_PATH%\cmarcedit -help
***************************************************************
* MarcEdit 6.1 Console Application
* By Terry Reese
* email: reeset@gmail.com
* Modified: 2015/7/29
***************************************************************
Arguments:
        -s:     Path to file to be processed.
                        If calling the join utility, source must be files
                        delimited by the ";" character
        -d:     Path to destination file.
                          If call the split utility, dest should specify a fold
r
                        where split files will be saved.
                        If this folder doesn't exist, one will be created.
        -rules: Rules file for the MARC Validator.
        -mxslt: Path to the MARCXML XSLT file.
        -xslt:  Path to the XML XSLT file.
        -batch: Specifies Batch Processing Mode
        -character:     Specifies character conversion mode.
        -break: Specifies MarcBreaker algorithm
        -make:  Specifies MarcMaker algorithm
        -marcxml:       Specifies MARCXML algorithm
        -xmlmarc:       Specifics the MARCXML to MARC algorithm
        -marctoxml:     Specifies MARC to XML algorithm
        -xmltomarc:     Specifies XML to MARC algorithm
        -xml:   Specifies the XML to XML algorithm
        -validate:      Specifies the MARCValidator algorithm
        -join:  Specifies join MARC File algorithm
        -split: Specifies split MARC File algorithm
        -records:       Specifies number of records per file [used with split c
mmand].
        -raw:   [Optional] Turns of mnemonic processing (returns raw data)
        -utf8:  [Optional] Turns on UTF-8 processing
        -marc8: [Optional] Turns on MARC-8 processing
        -pd:    [Optional] When a Malformed record is encountered, it will modi
y the process from a stop process to one where an error is simply noted and a s
ub note is added to the result file.
        -buildlinks:    Specifies the Semantic Linking algorithm
This function needs to be paired with the -options parameter
        -options        Specifies linking options to use: example: lcid,viaf:lc
oclcworkid,autodetect           lcid: utilizes id.loc.gov to link 1xx/7xx data
                autodetect: autodetects subjects and links to know values
                oclcworkid: inserts link to oclc work id if present
                viaf: linking 1xx/7xx using viaf.  Specify index after colon. I
 no index is provided, lc is assumed.
                        VIAF Index Values:
                        all -- all of viaf
                        nla -- Australia's national index
                        vlacc -- Belgium's Flemish file
                        lac -- Canadian national file
                        bnc -- Catalunya
                        nsk -- Croatia
                        nkc -- Czech.
                        dbc -- Denmark (dbc)
                        egaxa -- Egypt
                        bnf -- France (BNF)
                        sudoc -- France (SUDOC)
                        dnb -- Germany
                        jpg -- Getty (ULAN)
                        bnc+bne -- Hispanica
                        nszl -- Hungary
                        isni -- ISNI
                        ndl -- Japan (NDL)
                        nli -- Israel
                        iccu -- Italy
                        LNB -- Latvia
                        LNL -- Lebannon
                        lc -- LC (NACO)
                        nta -- Netherlands
                        bibsys -- Norway
                        perseus -- Perseus
                        nlp -- Polish National Library
                        nukat -- Poland (Nukat)
                        ptbnp -- Portugal
                        nlb -- Singapore
                        bne -- Spain
                        selibr -- Sweden
                        swnl -- Swiss National Library
                        srp -- Syriac
                        rero -- Swiss RERO
                        rsl -- Russian
                        bav -- Vatican
                        wkp -- Wikipedia

        -help:  Returns usage information

The linked data option uses the following pattern: cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options [linkoptions]

As noted above in the list, –options is a comma delimited list that includes the values that the linking tool should query.  A user, for example, looking to generate workids and uris on the 1xx and 7xx fields using id.loc.gov – the command would look like:

<< cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options oclcworkid,lcid

Users interesting in building all available linkages (using viaf, autodetecting subjects, etc. would use:

<< cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options oclcworkid,lcid,autodetect,viaf:lc

Notice the last option – viaf. This tells the tool to utilize viaf as a linking option in the 1xx and the 7xx – the data after the colon identifies the index to utilize when building links.  The indexes are found in the help (see above).

Download information:

The update can be found on the downloads page: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit.  Direct links:

Mac Port Update:

Part of the reason I hadn’t planned on doing a Windows update of MarcEdit this week is that I’ve been heads down making changes to the Mac Port.  I’ve gotten good feedback from folks letting me know that so far, so good.  Over the past few weeks, I’ve been integrating missing features from the MarcEditor into the Port, as well as working on the Delimited Text Translation.  I’ll now have to go back and make a couple of changes to support some of the update work in the Linked Data tool – but I’m hoping that by Aug. 2nd, I’ll have a new Mac Port Preview that will be pretty close to completing (and expanding) the initial port sprint. 

Questions, let me know.

–tr

MarcEdit Mac Preview Update

$
0
0

MarcEdit Mac users, a new preview update has been made available.  This is getting pretty close to the first “official” version of the Mac version.  And for those that may have forgotten, the preview designation will be removed on Sept. 1, 2015.

So what’s been done since the last update?  Well, I’ve pretty much completed the last of the work that was scheduled for the first official release.  At this point, I’ve completed all the planned work on the MARC Tools and the MarcEditor functions.  For this release, I’ve completed the following:

****************************
** 1.0.9 ChangeLog
****************************

  • Bug Fix: Opening Files — you cannot select any files but a .mrc extension. I’ve changed this so the open dialog can open multiple file types.
  • Bug Fix: MarcEditor — when resizing the form, the filename in the status can disappear.
  • Bug Fix: MarcEditor — when resizing, the # of records per page moves off the screen.
  • Enhancement: Linked Data Records — Tool provides the ability to embed URI endpoints to the end of 1xx, 6xx, and 7xx fields.
  • Enhancement: Linked Data Records — Tool has been added to the Task Manager.
  • Enhancement: Generate Control Numbers — globally generates control numbers.
  • Enhancement: Generate Call Numbers/Fast Headings – globally generated call numbers/fast headings for selected records.
  • Enhancement: Edit Shortcuts — added back the tool to enabled Record Marking via a comment.

Over the next month, I’ll be working on trying to complete four other components prior to the first “official” release Sept. 1.  This means that I’m anticipating at least 1, maybe 2 more large preview releases before Sept. 1, 2015.  The four items I’ll be targeting for completion will be:

  1. Export Tab Delimited Records Feature — this feature allows users to take MARC data and create delimited files (often for reporting or loading into a tool like Excel).
  2. Delimited Text Translator — this feature allows users to generate MARC records from a delimited file.  The Mac version will not, at least initially, be able to work with Excel or Access data.  The tool will be limited to working with delimited data.
  3. Update Preferences windows to expose MarcEditor preferences
  4. OCLC Metadata Framework integration…specifically, I’d like to re-integrate the holdings work and the batch record download.

How do you get the preview?  If you have the current preview installed, just open the program and as long as you have the notifications turned on – the program will notify that an update is available.  Download the update, and install the new version.  If you don’t have the preview installed, just go to: http://marcedit.reeset.net/downloads and select the Mac app download.

If you have any questions, let me know.

–tr

MarcEdit 6 Wireframes — Validating Headings

$
0
0

Over the last year, I’ve spent a good deal of time looking for ways to integrate many of the growing linked data services into MarcEdit.  These services, mainly revolving around vocabularies, provide some interesting opportunities for augmenting our existing MARC data, or enhancing local systems that make use of these particular vocabularies.  Examples like those at the Bentley (http://archival-integration.blogspot.com/2015/07/order-from-chaos-reconciling-local-data.html) are real-world demonstrations of how computers can take advantage of these endpoints when they are available.

In MarcEdit, I’ve been creating and testing linking tools for close to a year now, and one of the areas I’ve been waiting to explore is whether libraries can utilize linking services to build their own authorities workflows.  Conceptually, it should be possible – the necessary information exists…it’s really just a matter of putting it together.  So, that’s what I’ve been working on.  Utilizing the linked data libraries found within MarcEdit, I’ve been working to create a service that will help users identify invalid headings and records where those headings reside.

Working Wireframes

Over the last week, I’ve prototyped this service.  The way that it works is pretty straightforward.  The tool extracts the data from the 1xx, 6xx, and 7xx fields, and if they are tagged as being LC controlled, I query the id.loc.gov service to see what information I can learn about the heading.  Additionally, since this tool is designed for work in batch, there is a high likelihood that headings will repeat – so MarcEdit is generating a local cache of headings as well – this way it can check against the local cache rather than the remote cache when possible.  The local cache will constantly be grown – with materials set to expire after a month.  I’m still toying with what to do with the local cache, expirations, and what the best way to keep it in sync might be.  I’d originally considered pulling down the entire LC names and subjects headings – but for a desktop application, this didn’t make sense.  Together, these files, uncompressed, consumed GBs of data.  Within an indexed database, this would continue to be true.  And again, this file would need to be updated regularly.  To, I’m looking for an approach that will give some local caching, without the need to make the user download and managed huge data files.

Anyway – the function is being implemented as a Report.  Within the Reports menu in the MarcEditor, you will eventually find a new item titled Validate Headings.

image

When you run the Validate Headings tool, you will see the following window:

image

You’ll notice that there is a Source file.  If you come from the MarcEditor, this will be prepopulated.  If you come from outside the MarcEditor, you will need to define the file that is being processed.  Next, you select the elements to authorize.  Then Click Process.  The Extract button will initially be enabled until after the data run.  Once completed, users can extract the records with invalid headings.

When completed, you will receive the following report:

image

This includes the total processing time, average response from LC’s id.loc.gov service, total number of records, and the information about how the data validated.  Below, the report will give you information about headings that validated, but were variants.  For example:

Record #846
Term in Record: Arnim, Bettina Brentano von, 1785-1859
LC Preferred Term: Arnim, Bettina von, 1785-1859

This would be marked as an invalid heading, because the data in the record is incorrect.  But the reporting tool will provide back the Preferred LC label so the user can then see how the data should be currently structured.  Actually, now that I’m thinking about it – I’ll likely include one more value – the URI to the dataset so you can actually go to the authority file page, from this report.

This report can be copied or printed – and as I noted, when this process is finished, the Extract button is enabled so the user can extract the data from the source records for processing.

Couple of Notes

So, this process takes time to run – there just isn’t any way around it.  For this set, there were 7702 unique items queried.  Each request from LC averaged 0.28 seconds.  In my testing, depending on the time of day, I’ve found that response rate can run between 0.20 seconds per request to 1.2 seconds per response.  None of those times are that bad when done individually, but when taken in aggregate against 7700 queries – it adds up.  If you do the math, 7702*0.2 = 1540 seconds to just ask for the data.  Divide that by 60 and you get 25.6 minutes.  The total time to process that means that there are 11 minutes of “other” things happening here.  My guess, that other 11 minutes is being eaten up by local lookups, character conversions (since LC request UTF8 and my data was in MARC8) and data normalization.  Since there isn’t anything I can do about the latency between the user and the LC site – I’ll be working over the next week to try and remove as much local processing time from the equation as possible.

Questions – let me know.

–tr

MarcEdit Validate Headings: Part 2

$
0
0

Last week, I posted an update that included the early implementation of the Validate Headings tool.  After a week of testing, feedback and refinement, I think that the tool now functions in a way that will be helpful to users.  So, let me describe how the tool works and what you can expect when the tool is run.

Background:

The Validate Headings tool was added as a new report to the MarcEditor to enable users to take a set of records and get back a report detailing how many records had corresponding Library of Congress authority headings.  The tool was designed to validate data in the 1xx, 6xx, and 7xx fields.  The tool has been set to only query headings and subjects that utilize the LC authorities.  At some point, I’ll look to expand to other vocabularies.

How does it work

Presently, this tool must be run from within the MarcEditor – though at some point in the future, I’ll extract this out of the MarcEditor, and provide a stand alone function and a integration with the command line tool.  Right now, to use the function, you open the MarcEditor and select the Reports/Validate Headings menu.

image

Selecting this option will open the following window:

image

Options – you’ll notice 3 options available to you.  The tool allows users to decide what values that they would like to have validated.  They can select names (1xx, 600,10,11, 7xx) or subjects (6xx).  Please note, when you select names, the tool does look up the 600,610,611 as part of the process because the validation of these subjects occurs within the name authority file.  The last option deals with the local cache.  As MarcEdit pulls data from the Library of Congress – it caches the data that it receives so that it can use it on subsequent headings validation checked.  The cache will be used until it expires in 30 days…however, a user at any time can check this option and MarcEdit will delete the existing cache and rebuild it during the current data run. 

Couple things you’ll also note on this screen. There is an extract button and it’s not enabled.  Once the Validate report is run, this button will become enabled if there are any records that are identified as having headings that could not be validated against the service. 

Running the Tool:

Couple notes about running the tool.  When you run the tool, what you are asking MarcEdit to do is process your data file and query the Library of Congress for information related to the authorized terms in your records.  As part of this process, MarcEdit sends a lot of data back and forth to the Library of Congress utilizing the http://id.loc.gov service.  The tool attempts to use a light touch, only pulling down headings for a specific request – but do realize that a lot of data requests are generated through this function.  You can estimate approximately how many requests might be made on a specific file by using the following formula: (number of records x 2)  + (number of records), assuming that most records will have 1 name to authorize and 1 subjects per record.  So a file with 2500 records would generate ~7500 requests to the Library of Congress.  Now, this is just a guess, in my tests, I’ve had some sets generate as many as 12,000 requests for 2500 records and as few as 4000 requests for 2500 records – but 7500 tended to be within 500 requests in most test files.

So why do we care?  Well, this report has the potential to generate a lot of requests to the Library of Congress’s identifier service – and while I’ve been told that there shouldn’t be any issues with this – I think that question won’t really be known until people start using it.  At the same time – this function won’t come as a surprise to the folks at the Library of Congress – as we’ve spoken a number of times during the development.  At this point, we are all kind of waiting to see how popular this function might be, and if MarcEdit usage will create any noticeable up-tick in the service usage.

Validation Results:

When you run the validation tool, the program will go through each record, making the necessary validation requests of the LC ID service.  When the service has completed, the user will receive a report with the following information:

Validation Results:
Process completed in: 121.546001431667 minutes. 
Average Response Time from LC: 0.847667984420415
Total Records: 2500
Records with Invalid Headings: 1464
**************************************************************
1xx Headings Found: 1403
6xx Headings Found: 4106
7xx Headings Found: 1434
**************************************************************
1xx Headings Not Found: 521
6xx Headings Not Found: 1538
7xx Headings Not Found: 624
**************************************************************
1xx Variants Found: 6
6xx Variants Found: 1
7xx Variants Found: 3
**************************************************************
Total Unique Headings Queried: 8604
Found in Local Cache: 1001
***************************************************************

This represents the header of the report.  I wanted users to be able to quickly, at a glance, see what the Validator determined during the course of the process.  From here, I can see a couple of things:

  1. The tool queried a total of 2500 records
  2. Of those 2500 records, 1464 of those records had a least one heading that was not found
  3. Within those 2500 records, 8604 unique headers were queried
  4. Within those 2500 records, there were 1001 duplicate headings across records (these were not duplicate headings within the same record, but for example, multiple records with the same author, subject, etc.)
  5. We can see how many Headings were found by the LC ID service within the 1xx, 6xx, and 7xx blocks
  6. Likewise, we can see how many headings were not found by the LC ID service within the 1xx, 6xx, and 7xx blocks.
  7. We can see number of Variants as well.  Variants are defined as names that resolved, but that the preferred name returned by the Library of Congress didn’t match what was in the record.  Variants will be extracted as part of the records that need further evaluation.

After this summary of information, the Validation report returns information related to the record # (record number count starts at zero) and the headings that were not found.  For example:

Record #0
Heading not found for: Performing arts--Management--Congresses
Heading not found for: Crawford, Robert W

Record #5
Heading not found for: Social service--Teamwork--Great Britain

Record #7
Heading not found for: Morris, A. J

Record #9
Heading not found for: Sambul, Nathan J

Record #13
Heading not found for: Opera--Social aspects--United States
Heading not found for: Opera--Production and direction--United States

The current report format includes specific information about the heading that was not found.  If the value is a variant, it shows up in the report as:

Record #612
Term in Record: bible.--criticism, interpretation, etc., jewish
LC Preferred Term: Bible. Old Testament--Criticism, interpretation, etc., Jewish
URL: http://id.loc.gov/authorities/subjects/sh85013771
Heading not found for: Bible.--Criticism, interpretation, etc

Here you see – the report returns the record number, the normalized form of the term as queried, the current LC Preferred term, and the URL to the term that’s been found.

The report can be copied and placed into a different program for viewing or can be printed (see buttons).

image

To extract the records that need work, minimize or close this window and go back to the Validate Headings Window.  You will now see two new options:

image

First, you’ll see that the Extract button has been enabled.  Click this button, and all the records that have been identified as having headings in need of work will be exported to the MarcEditor.  You can now save this file and work on the records. 

Second, you’ll see the new link – save delimited.  Click on this link, and the program will save a tab delimited copy of the validation report.  The report will have the following format:

Record ID [tab] 1xx [tab] 6xx [tab] 7xx [new line]

Each column will be delimited by a colon, so if two 1xx headings appear in a record, the current process would create a single column, but with the headings separated by a colon like: heading 1:heading 2. 

Future Work:

This function required making a number of improvements to the linked data components – and because of that, the linking tool should work better and faster now.  Additionally, because of the variant work I’ve done, I’ll soon be adding code that will give the user the option to update headings for Variants as is report or the linking tool is running – and I think that is pretty cool.  If you have other ideas or find that this is missing a key piece of functionality – let me know.

–tr

MarcEdit Mac–Release Version 1 Notes

$
0
0

This has been a long-time coming – making up countless hours and the generosity of a great number of people to test and provide feedback (not to mention the folks that crowd sourced the purchase of a Mac) – but MarcEdit’s Mac version is coming out of Preview and will be made available for download on Labor Day.  I’ll be putting together a second post officially announcing the new versions (all versions of MarcEdit are getting an update over labor day), so if this interests you – keep an eye out.

So exactly what is different from the Preview versions?  Well, at this point, I’ve completed all the functions identified for the first set of development tasks – and then some.  New to this version will be the new Validate Headings tool just added to the Windows version of MarcEdit, the new Build New Field utility (and inclusion into the Task Automation tool), updates to the Editor for performance, updates to the Linking tool due to the validator, inclusion of the Delimited Text Translator and the Export Tab Delimited Text Translator – and a whole lot more.

At this point, the build is made, the tests have been run – so keep and eye out tomorrow – I’ll definitely be making it available before the Ohio State/Virginia Tech football game (because everything is going to stop here once that comes on).  Smile

To everyone that has helped along the way, providing feedback and prodding – thanks for the help.  I’m hoping that the final result will be worth the wait and be a nice addition to the MarcEdit family.  And of course, this doesn’t end the development on the Mac – I have 3 additional sprints planned as I work towards functional parity with the Windows version of MarcEdit.

–tr

MarcEdit 6.1 (Windows/Linux)/MarcEdit Mac (1.1.25) Update

$
0
0

So, this update is a bit of a biggie.  If you are a Mac user, the program officially moves out of the Preview and into release.  If you are a Mac user, this version brings the following changes:

****************************
** 1.1.25 ChangeLog
****************************

  • Bug Fix: MarcEditor — changes may not be retained after save if you make manual edits following a global updated.
  • Enhancement: Delimited Text Translator completed.
  • Enhancement: Export Tab Delimited complete
  • Enhancement: Validate Headings Tool complete
  • Enhancement: Build New Field Tool Complete
  • Enhancement: Build New Field Tool added to the Task Manager
  • Update: Linked Data Tool — Added Embed OCLC Work option
  • Update: Linked Data Tool — Enhance pattern matching
  • Update: RDA Helper — Updated for parity with the Windows Version of MarcEdit
    * Update: MarcValidator — Enhancements to support better checking when looking at the mnemonic format.

If you are on the Windows/Linux version – you’ll see the following changes:

*************************************************
* 6.1.60 ChangeLog
*************************************************

  • Update: Validate Headings — Updated patterns to improve the process for handling heading validation.
  • Enhancement: Build New Field — Added a new global editing tool that provides a pattern-based approach to building new field data.
  • Update: Added the Build New Field function to the Task Management tool.
  • UI Updates: Specific to support Windows 10.

The Windows update is a significant one.  A lot of work went into the Validate Headings function, which impacts the Linked Data tools and the underlying linked data engine.  Additionally, the Build New Fields tool provides a new global editing function that should simplify complex edits.  If I can find the time, I’ll try to mark up a youtube video demoing the process.

You can get the updates from the MarcEdit downloads page: http://marcedit.reeset.net/downloads or if you have MarcEdit configured to check automated updates – the tool will notify you of the update and provide a method for you to download it.

If you have questions – let me know.

–tr


Automatic Headings Correction–Validate Headings

$
0
0

After about a month of working with the headings validation tool, I’m ready to start adding a few enhancements to provide some automated headings corrections.  The first change to be implemented will be automatic correction of headings where the preferred heading is different from the in-use headings.  This will be implemented as an optional element.  If this option is selected, the report will continue to note variants are part of the validation report – but when exporting data for further processing – automatically corrected headings will not be included in the record sets for further action.

image

Additionally – I’ll continue to be looking at ways to improve the speed of the process.  While there are some limits to what I can do since this tool relies on a web service (outside of providing an option for users to download the ~10GB worth of LC data locally), there are a few things I can to do continue to ensure that only new items are queried when resolving links.

These changes will be made available on the next update.

–tr

Validate Headings Update

$
0
0

MarcEdit’s Validate Headings tool is getting a refresh to add a few missing elements.  Two new features are being added to the tool – the ability to automatically correct variants when they are detected, and the ability to automatically generate preliminary authority records for personal (100/700) records. 

The new interface looks like:

image

 

Example of a sample generated authority record:

=LDR  00000nz\a2200000o\4500
=008  151016n|\acannaabn\\\\\\\\\\|n\a|d\\\\||
=100  10$aWillson, Meredith,$d1902-
=670  \\$aWillson, Meredith,1902-. $bWhat every young musician should know.

The records are generated directly off the data in the record.  This means that if the heading is coded incorrectly (dates not in the $d, etc.) – then the generated data will be as well, but this is a start.  You’ll notice that the data is coded as being preliminary because these are automated generated, and probably should be evaluated at some point.

–tr

Task Automation Modifications

$
0
0

An interesting question came up on the ListServ this week – a user was wondering if a task could be created with the option that data sorted in the task was variable.  An example user-case might be something like, a task where the replace all function may be variable depending which vendor file might be processed. 

By default, the Task Automation tool has been designed to be pretty much like a macro recorder.  You set values, it simply uses those values.  However, at it’s core, the task automation tool is just a script engine – the tasks represent a simple set of commands that get interpreted by the automation engine.  Given that, it would be pretty easy to provide the ability to support user defined values within a task.  So, I’m giving it a go.  I’ve defined a special mnemonic – {inputbox_[yourvalue]} which can be defined within a task – and when encountered, the task engine will prompt the user for data. 

The important part of the mnemonic – the part the tells the engine that user data is required, is the first part of the mnemonic: {inputbox_.  When this statement is seen, the engine pauses and passes the command to the pre-processor.  The pre-processor looks at the start of the mnemonic, and then pulls the data after the {inputbox_ to give the user a prompt regarding the data that is being requested.  

For example, say the user is creating a Replace All task and the program should request data for both the Find and the Replace strings.  The mnemonic should look like the following for the Find expression: {inputbox_Find} and for the replace: {inputbox_Replace}. 

image

When run, the pre-parser, when coming across these values, will break them down and prompt the user for input:

image

image

The pre-parser will then substitute the user provided values into the task and process the data accordingly.  If the user cancels the dialog – the pre-parser will take that as an indication that this process should be skipped, and will move on to the next operation in the task. 

This change will be part of the next update.

–tr

MarcEdit Windows/Linux Update Notes

$
0
0

I’ve posted a new MarcEdit update.  You can get the builds directly from: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit.  Direct links:

The change log follows:

–tr

***********************************************************************************************

MarcEdit Windows/Linux ChangeLog: 11/8/2015

MarcEdit Application Changes:
* Updates to the Build New Field Tool
** Code moved into meedit code library (for portability to the mac system)
** Separated options to provide an option to add new field only, add when not present, replace existing fields
** Updated Task Manager signatures — if you use this function in a task, you will need to update the task

* Updates to Linked Data tool
** Added option to select oclc number for work id embedding
** Updated Task Manager signatures
** Updated cmarcedit commandline options

* Edit Indicators
** Removed a blank space as legacy wildcard value.  Wildcards are now strictly “*”

Merge Records Tool
* Updated User defined fields options to allow 776$w to be used (fields used as part of the MARC21 option couldn’t previously be redefined to act as a single match point)

Validator
* Results page will print UTF8 characters (always) if present

Validate ISBN/ISSN
* Results page now includes the 001 if present in addition to the record # in the file

Sorting
* Adding an option so if selected, 880 will be sorted as part of their paired field.

Preferences:
* Added Sorting Preferences
* Added New Options Option, shifting the place where the folder settings are set.

UI Improvements
* Various UI improvements made to better support Windows 10.

MarcEdit Mac Updates

$
0
0

I’ve posted a new MarcEdit update.  You can get the builds directly from: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit.  Direct links:

The change log follows:

–tr

***********************************************************************************************

MarcEdit Mac ChangeLog: 11/8/2015

MarcEdit Applications Changes:
* Build New Field Tool Added
** Added Build New Field Tool to the Task Manager
* Validate Headings Tool Added
* Extract/Delete Selected Records Tool Added

* Updates to Linked Data tool
** Added option to select oclc number for work id embedding
** Updated Task Manager signatures

* Edit Indicators
** Removed a blank space as legacy wildcard value.  Wildcards are now strictly “*”

Merge Records Tool
* Updated User defined fields options to allow 776$w to be used (fields used as part of the MARC21 option couldn’t previously be redefined to act as a single match point)

Validator
* Results page will print UTF8 characters (always) if present

Sorting
* Adding an option so if selected, 880 will be sorted as part of their paired field.

Z39.50 Client
* Supports Single and Batch Search Options

MarcEdit: Build Links Data tool enhancements

$
0
0

I’ve been working with the PCC Linked Data in MARC Task Group over the past couple of months, and as part of this process, I’ve been working on expanding the values that can be recognized by the Linking tool in MarcEdit.  As those that have used it might remember, MarcEdit’s linking tool showed up about a year and a half ago, and leverages id.loc.gov, MESH, and VIAF (primarily).  As part of this process with the PCC – a number of new vocabularies and fields have been added to the tools capacity.  This also has meant created profiles for linking data in both bibliographic and authority data. 

The big changes come in the range of indexes now supported by the tool (if defined within the record).  At this point, the following vocabularies are profiled for use:

  1. NAF
  2. LCSH
  3. LCSH Children
  4. MESH
  5. ULAN
  6. AAT
  7. LCGFT
  8. AGROVOC
  9. LCMPT
  10. LCDGT
  11. TGM
  12. LCMPT
  13. LCDGT
  14. RDA Vocabularies

The data profiled has also expanded beyond just 1xx, 6xx, and 7xx data to include 3xx data and data unique to the authority data.

This has required changing the interface slightly:

image

But I believe that I have the bugs worked out.  This function will be changing often over the next month or so as the PCC utilizes this and other tools while piloting a variety of methods for embedding linked data into MARC records and considering the implications.  As such, I’ll be adding to the list of profiled data over the coming month – however, if you use a specific vocabulary and don’t see it in the list – let me know.  As long as the resource provides a set of APIs (it cannot be a data down – that doesn’t work for client applications – at this point, the profiled resources would require users to download almost 12 GB of data almost monthly if I went that route) that can support a high volume of queries.

Questions…let me know.

Happy Holidays: MarcEdit Update

$
0
0

Over the past few years, holiday updates have become a part of a MarcEdit tradition.  This year, I’ve been spending the past month working on two significant set of changes.  On the Windows side, I’ve been working on enhancing the Linked Data tools, profiling more fields and more services.  This update represents a first step in the process – as I’ll be working with the PCC to profile additional services and add new elements as we work through a pilot test around embedding linked data into MARC records and potential implications.  For a full change list, please see: http://blog.reeset.net/archives/1822

The Mac version has seen a lot of changes – and because of that, I’ve moved the version number from 1.3.35 to 1.4.5.  In addition to all the infrastructure changes made within the Windows/Linux program (the tools share a lot of code), I’ve also done significant work exposing preferences and re-enabling the ILS Integration.  I didn’t get to test the ILS integration well – so there may be a few updates to correct problems once people start working with them – but getting to this point took a lot of work and I’m glad to see it through.  For a full list of updates on the Mac Version, please see: http://blog.reeset.net/archives/1824

Before Christmas, I’d mentioned that I was working on three projects – with the idea that all would be ready by the time these updates were complete.  I was wrong – so it looks like I’ll have one more Christmas/New Years gift left to give – and I’ll be trying to wrap that work up this week.

Downloads – you can pick up the new downloads at: http://marcedit.reeset.net/downloads or if you have the automatic update notification enabled, the tool should provide you with an option to update from within the program.

This represents a lot of work, and a lot of changes.  I’ve tested to the best of my ability – but I’m expecting that I may have missed something.  If you find something, let me know.  I’m saving time over the next couple weeks to fix problems that might come up and turn around builds faster than normal.

Here’s looking forward to a wonderful 2016.

–tr


Heads Up: MarcEdit Linked Data Components Update (all versions) scheduled for this evening

$
0
0

A heads up to those folks using MarcEdit and using the following components:

  • Validate Headings
  • Build Links
  • Command-Line tool using the build links option

These components rely on MarcEdit’s linked data framework to retrieve semantic data from a wide range of vocabulary services.  I’ll be updating one of these components in order to improve the performance and how they interact with the Library of Congress’s id.loc.gov service.  This will provide a noticeable improvement on the MarcEdit side (with response time cut by a little over 2/3rds) and will make MarcEdit much more friendly to the LC id.loc.gov service.  Given the wide range of talks at Midwinter this year discussing experimentations related to embedding semantic data into MARC records and the role MarcEdit is playing in that work – I wanted to make sure this was available prior to ALA.

Why the change

When MarcEdit interacts with id.loc.gov, it’s communications are nearly always just HEAD requests.  This is because over the past year or so, the folks at LC have been incredibly responsive developing into their headers statements nearly all the information someone might need if they are just interested in looking up a controlled term and finding out if:

  1. It exists
  2. The preferred label
  3. Its URI

Prior to the HEADER lookup, this had to be done using a different API which resulted in two requests – one to the API, and then one to the XML representation of the document for parsing.  By moving the most important information into the document headers (X- elements), I can minimize the amount of data I’m having to request from LC.  And that’s a good thing – because LC tends to have strict guidelines around how often and how much data you are allowed to request from them at any given time.  In fact, were it not LC’s willingness to allow me to by-pass those caps when working this this service —  a good deal of the new functionality being developed into the tool simply wouldn’t exist.  So, if you find the linked data work in MarcEdit useful, you shouldn’t be thanking me – this work has been made possible by LC and their willingness to experiment with id.loc.gov. 

Anyway – the linked data tools have been available in MarcEdti for a while, and they are starting to generate significant traffic on the LC side of things.  Adding the Validate Headings tool only exasperated this – enough so that LC has been asking if I could do some things to help throttle the requests coming from MarcEdit.  So, we are working on some options – but in the mean time, LC noticed something odd in their logs.  While MarcEdit only makes HEAD requests, and only processes the information from that request – they were seeing 3 requests showing up in their logs. 

Some background on the LC service — it preforms a lot of redirection.  One request to the label service, results in ~3 redirects.  All the information MarcEdit need is found in the first request, but when looking at the logs, they can see MarcEdit is following the redirects, resulting in 2 more Head requests for data that the tool is simply throwing away.  This means that in most cases, a single request for information is generating 3 HEAD requests – an if you take a file of 2000 records, with ~5 headings to be validated (on average) – that means MarcEdit would generate ~30,000 requests (10,000 x 3).  That’s not good – and when LC approached me to ask why MarcEdit was asking for the other data files – I didn’t have an answer.  It wasn’t till I went to the .NET documentation that the answer became apparent.

As folks should know, MarcEdit is developed using C#, which means, it utilizes .NET.  The primary component for handling network interactions happens in the System.Net component – specifically, the System.Net.HttpWebRequest component.  Here’s the function:

       public System.Collections.Hashtable ReadUriHeaders(string uri, string[] headers)
        {
            System.Net.ServicePointManager.DefaultConnectionLimit = 10;
            System.Collections.Hashtable headerTable = new System.Collections.Hashtable();
            uri = System.Uri.EscapeUriString(uri);

            //after escape -- we need to catch ? and &
            uri = uri.Replace("?", "%3F").Replace("&", "%26");

            System.Net.WebRequest.DefaultWebProxy = null;
            System.Net.HttpWebRequest objRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(MyUri(uri));
            objRequest.UserAgent = "MarcEdit 6.2 Headings Retrieval";
            objRequest.Proxy = null;
            
            //Changing the default timeout from 100 seconds to 30 seconds.
            objRequest.Timeout = 30000;
            
            

            //System.Net.HttpWebResponse objResponse = null;
            //.Create(new System.Uri(uri));


            objRequest.Method = "HEAD";


            try
            {
                using (var objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse())
                {
                    //objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
                    if (objResponse.StatusCode == System.Net.HttpStatusCode.NotFound)
                    {
                        foreach (string name in headers)
                        {
                            headerTable.Add(name, "");
                        }
                    }
                    else
                    {

                        foreach (string name in headers)
                        {
                            if (objResponse.Headers.AllKeys.Contains(name))
                            {
                                
                                string orig_header = objResponse.Headers[name];
                                byte[] b = System.Text.Encoding.GetEncoding(28591).GetBytes(orig_header);

                                headerTable.Add(name, System.Text.Encoding.UTF8.GetString(b));
                                
                            }
                            else
                            {
                                headerTable.Add(name, "");
                            }
                        }
                    }
                }
                
                return headerTable;
            }
            catch (System.Exception p)
            {
                foreach (string name in headers)
                {
                    headerTable.Add(name, "");
                }
                headerTable.Add("error", p.ToString());
                return headerTable;
            }
        }

It’s a pretty straightforward piece of code – the tool looks up a URI, reads the header, and outputs a hash of the values.  There doesn’t appear to be anything in the code that would explain why MarcEdit was generating so many requests (because this function was only being called once per item).  But looking at the documentation – well, there is.  The HttpWebRequest object has a property – AllowAutoRedirect, and it’s set to true by default.  This tells the component that a web request can be automatically redirected up to the value set in MaxRedirections (by default, I think it’s 5).  Since every request to the LC service generates redirects – MarcEdit was following them and just tossing the data.  So that was my problem.  Allowing redirects is a fine assumption to make for a lot of things – but for my purposes – not so much.  It’s an easy fix – I added a value to the function header – something that is set to false by default, and then use that value to set the AllowAutoRedirect bit.  This way I can allow redirects when I need them, but turn it off when by default when I don’t (which is almost always).  Once finished, I tested against LC’s service and they confirmed that this reduced the number of HEAD requests.  On my side – I noticed that things were much, much faster.  On the LC side, they are pleased because MarcEdit is generating a lot of traffic, and this should help to reduce and focus that traffic.  So win, win, all around.

What does this mean

So what this means – I’ll be posting an update this evening.  It will include a couple tweaks based on feedback from the update this past Sunday – but most importantly, it will include this change.  If you use the linked data tools or the Validate Headings tools – you will want to update.  I’ve updated MarcEdit’s user agent string, so LC will now be able to tell if a user is using a version of MarcEdit that is fixed.  If you aren’t and you are generating a lot of traffic – don’t be surprised if they ask you to update. 

The other thing that I think that it shows (and this I’m excited about), is that LC really has been incredibly accommodating when it has come to using this service, and rather than telling me that MarcEdit needed to start following LC’s data request guidelines for the id.loc.gov service (which would make this service essentially useless), they worked with me to figure out what was going on so we could find a solution that everyone is happy with.  And like I said, we both are concerned that as more users hit the service, there will be a need to do spot throttle those requests globally, so we are talking about how that might be done. 

For me, this type of back and forth has been incredibly refreshing and somewhat new.  It certainly has never happened when I’ve spoken to any ILS vendor or data provider (save for members of the Koha and OLE communities) – and gives me some hope that just maybe we can all come together and make this semantic web thing actually work.  The problem with linked data is that unless there is trust: trust in the data and trust in the service providing the data – it just doesn’t work.  And honestly, I’ve had concerns that in Library land, there are very few services that I feel you could actually trust (and that includes OCLC at this point).  Service providers are slowly wading in – but these types of infrastructure components take resources – lots of resources, and they are invisible to the user…or, when they are working, they are invisible.  Couple that with the fact that these services are infrastructure components, not profit engines – its not a surprise that so few services exist, and the ones that do, are not designed to support real-time, automated look up.  When you realize that this is the space we live in, right now, It makes me appreciate the folks at LC, and especially Nate Trail, all the more.  Again, if you happen to be at ALA and find these services useful, you really should let them know.

Anyway – I started the process to run tests and then build this morning before heading off to work.  So, sometime this evening, I’ll be making this update available.  However, given that these components are becoming more mainstream and making their way into authority workflows – I wanted to give a heads up.

Questions – let me know.

–tr

MarcEdit updates

$
0
0

I noted earlier today that I’d be making a couple MarcEdit updates.  You can see the change logs here:

Please note – if you use the Linked data tools, it is highly recommended that you update.  This update was done in part to make the interactions with LC more efficent on all sides.

You can get the download from the automated update mechanism in MarcEdit or from the downloads page: http://marcedit.reeset.net/downloads

Questions, let me know.

–tr

Build New Field Enhancements

$
0
0

Couple of interesting questions this week got me thinking about a couple of enhancements to MarcEdit.  I’m not sure these are things that other folks will make use of often, but I can see these being really useful answering questions that come up on the listserv.

The particular question that got me thinking about this today was the following scenario:

The user has two fields – an 099 that includes data that needs to be retained, and then an 830$v that needs to be placed into the 099.  The 830$v has trailing punctuation that will need to be removed. 

Example data:
=099  \\$aELECTRONIC DATA
=830  \\$aSeries Title $v 12-031.

The final data output should be:
=099  \\$aELECTRONIC RESOURCE 12-013
=830  \\$aSeries Title $v 12-031.

With the current tools, you can do this but it would require multiple steps.  Using the current build new field tool, you could create the pattern for the data:
=099  \\$a{099$a} {830$v}

This would lead to an output of:
=099  \\$aELECTRONIC RESOURCE 12-031.

To remove the period – you could use a replace function and fix the $a at the same time.  You could have also made the ELECTRONIC RESOURCE string a constant in the build new field – but the problem is that you’d have to know that this was the only data that ever showed up in the 099$a (and it probably won’t be).

So thinking about this problem, I’ve been thinking about how I might be able to add a few processing “macros” into the pattern language – and that’s what I’ve done.  At this point, I’ve added the following commands:

  • replace(find,replace)
  • trim(chars)
  • trimend(chars)
  • trimstart(chars)
  • substring(start,length)

The way that these have been implemented – these commands are stackable – they are also very ridged in structure.  These commands are case sensitive (command labels are all lower case), and in the places where you have multiple parameters – there are no spaces between the commas. 

So how does this work – here’s some examples (not full patterns):
{099$a.trim(“.”)}
{050$b.replace(“1950”,”1980”).trim(“.”)}
{LDR.substring(6,1)}

As you can see in the patterns, the commands are initialized by adding “.command” to the end of the field pattern.  So how we would apply this to the user story above.  It’s easy:
=099  \\$a{099$a.replace(“DATA”,”RESOURCE”)} {830$v.trimend(“.”)}

And that would be it.  With this single pattern, we can run the replacement on the data in the 099$a and trim the data in the 830$v. 

Now, I realize that this syntax might not be the easiest for everyone right out of the gate, but as I said, I’m hoping this will be useful for folks interested in learning the new options, but am really excited to have this in my toolkit for answering questions posed on the listserv.

This has been implemented in all versions of MarcEdit, and will be part of this weekend’s update.

–tr

MarcEdit Mac: Edit 006/008 data

$
0
0

One of the functions that didn’t make the initial migration cut in the MarcEditor was the ability to edit the 006/008 in a graphical interface.  I’ve added this back into the OSX version.  You can find it in the Edit Menu:

MarcEdit Mac -- Edit 006/008 Menu Location

MarcEdit Mac — Edit 006/008 Menu Location

Invoking the tool works a little differently than the windows/linux version.  Just put your cursor into the field that you want to edit, and the select Edit.  MarcEdit will then read your record data and generate an edit form based on the material format selected (or the material format from the record if editing).

MarcEdit Mac -- Edit 006/008 Screen

MarcEdit Mac — Edit 006/008 Screen

Questions — let me know.

–tr

MarcEdit Mac: Verify URLs

$
0
0

In the Windows/Linux version — on of the oldest tools has been the ability to validate URLs.  This tool generates a report providing the HTTP status codes returned for URLs in a record set.  This didn’t make the initial migration  — but has been added to the current OSX version of MarcEdit.

To find the resource, you open the main window and select the menu:

MarcEdit Mac: Main Window Menu -- Verify URLs

MarcEdit Mac: Main Window Menu — Verify URLs

Once selected, if works a lot like the Windows/Linux version.  You have two report types (HTML/XML), you can define a title field, you can also set the fields to check.  By default, MarcEdit selects all.  To change this — you just need to add each new field/subfield combination in a new line.

MarcEdit Mac: Verify URLs screen

MarcEdit Mac: Verify URLs screen

Questions, let me know.

–tr

Viewing all 243 articles
Browse latest View live