Upgrade tika to 1.14

Description

https://dist.apache.org/repos/dist/release/tika/CHANGES-1.13.txt

Release 1.13 - 05/08/2016

  • Upgrade to PDFBox 2.0.1 (TIKA-1285/TIKA-1959).
    MAJOR CHANGES in PDFParser:

  • The classic sequential parser is no longer available.

  • Tiff files are no longer extracted by default. See
    https://pdfbox.apache.org/2.0/dependencies.html#optional-components
    for optional components to process Tiff files.

  • Some truncated/corrupted files that had some content extracted
    with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912).

  • The MIT-NLP Information Extraction (MITIE) Named Entity
    Recognition (NER) system is now supported in Tika
    (TIKA-1913, GitHub-108).

  • Tika now supports the use of the Yandex translation
    service (TIKA-1943, GitHub-106).

  • Tika now uses NER to extract scientific measurements
    from text using either GROBID Quantities which uses
    conditional random fields and NLTK which uses regular
    expressesions (TIKA-1917, GitHub-104).

  • Fixed JournalParser to handle null responses from
    GROBID and to log a message (TIKA-1925).

  • Refactored Language Detector into tika-landetect module,
    added default N-Gram implementation, Optimaize Lang
    Detector and MIT Text.jl implementation
    (TIKA-1872, TIKA-1696, TIKA-1723).

  • Extract metadata from MP4 videos whether or not the
    PooledTimeSeries parser is available via Aditya Dhulipala
    (TIKA-1844).

  • Fix NPE when trying to get embedded image identifier in
    WordParser (TIKA-1956).

  • Improvements to MIME database for detection of Scientific
    and other formats present in the TREC-DD-Polar dataset
    (TIKA-1881, GitHub-85, TIKA-1883, TIKA-1884, TIKA-1886,
    TIKA-1882).

  • LinkContentHandler now extracts links from script tags
    via Joseph Naegele (TIKA-1937).

  • Handle per page IOExceptions more robustly in PDFParser (TIKA-1948).

  • Upgrade commons-compress to 1.11 (TIKA-1949).

  • Add detection for embedded MSChart.Graph files (TIKA-1033).

  • Fix NPE in Sqlite parser from Nick C (TIKA-1927).

  • Fix NPE in Open Document parser from Nick C (TIKA-1916).

  • Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931).

  • Upgrade BouncyCastle to 1.54 (TIKA-1923).

  • Upgrade Jackcess to 2.1.3 (TIKA-1922).

  • Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921).

  • Upgrade Gson in tika-serialization to 2.6.2 (TIka-1920).

  • Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919).

  • Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894).

  • Move serialization of TikaConfig to tika-core and enable dumping
    of the config file via tika-app (TIKA-1657).

  • Tika now incorporates the Natural Language Toolkit (NLTK) from the
    Python community as an option for Named Entity Recognition (TIKA-1876).

  • Add support for XFA extraction via Pascal Essiembre (TIKA-1857).

  • Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861). NOTE: this dependency
    is still <scope>provided</scope>. You need to include this dependency
    in order to parse sqlite files.

  • Upgrade to POI 3.15-beta1 (TIKA-1895).

  • Upgrade to Jackson 2.7.1 (TIKA-1869).

  • Upgrade to Apache SIS 0.6 (TIKA-1878).

  • RichTextContentHandler moved from the Server package to Core (TIKA-1870).

  • Added ZeroSizeFileDetector to support application/x-zerovalue via
    Adesh Gupta (TIKA-1885).

  • Addition of types information to Grobid quantities parser via
    Can Menekse (TIKA-1965).

Activity

Show:

Matthew Jones February 9, 2017 at 10:44 AM

I ran these tests locally and didn't see these errors. Since this 1.14 has been released so I just put in a PR for that.

David Horwitz October 4, 2016 at 5:31 AM

tika:

Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.576 sec - in org.sakaiproject.content.impl.test.SortTest
testSorts(org.sakaiproject.content.impl.test.SortTest) Time elapsed: 0.037 sec
testNameCompare(org.sakaiproject.content.impl.test.SortTest) Time elapsed: 0.491 sec
testLocaleSorts(org.sakaiproject.content.impl.test.SortTest) Time elapsed: 0 sec
Running org.sakaiproject.content.impl.test.RoleAccessTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.851 sec <<< FAILURE! - in org.sakaiproject.content.impl.test.RoleAccessTest
org.sakaiproject.content.impl.test.RoleAccessTest Time elapsed: 2.851 sec <<< ERROR!
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.sakaiproject.antivirus.api.VirusScanner' defined in file /home/dhorwitz/git/sakai/kernel/kernel-component/src/main/webapp/WEB-INF/antivirus-components.xml: Cannot resolve reference to bean 'org.sakaiproject.content.api.ContentHostingService' while setting bean property 'contentHostingService'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.sakaiproject.content.api.ContentHostingService' defined in file /home/dhorwitz/git/sakai/kernel/kernel-component/src/main/webapp/WEB-INF/content-components.xml: Instantiation of bean failed; nested exception is java.lang.ExceptionInInitializerError
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)
Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.sakaiproject.content.api.ContentHostingService' defined in file /home/dhorwitz/git/sakai/kernel/kernel-component/src/main/webapp/WEB-INF/content-components.xml: Instantiation of bean failed; nested exception is java.lang.ExceptionInInitializerError
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)
Caused by: java.lang.ExceptionInInitializerError
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)
Caused by: java.lang.RuntimeException: Unable to parse the default media type registry
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)
Caused by: org.apache.tika.mime.MimeTypeException: Invalid type configuration
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
at org.sakaiproject.content.impl.test.RoleAccessTest.beforeClass(RoleAccessTest.java:47)

Fixed

Details

Priority

Affects versions

Components

Assignee

Reporter

Created October 4, 2016 at 4:55 AM
Updated April 25, 2018 at 3:18 PM
Resolved February 13, 2017 at 9:25 AM