Apache Tika 1.5
The most notable changes in Tika 1.5 over the previous release are:
- Fixed bug in handling of embedded file processing in PDFs (TIKA-1228).
- Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224).
- Updated Tika Server to support multipart/form-data payloads (TIKA-1198).
- Updated Tika Server to CXF 2.7.8 (TIKA-1197).
- Updated Tika Server to accept requests over wildcard addresses (TIKA-1196).
- Added option to use alternate NonSequentialPDFParser (TIKA-1201).
- Content from PDF AcroForms is now extracted (TIKA-973).
- Fixed invalid asterisks from master slide in PPT (TIKA-1171).
- Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817).
- Text from tables in PPT files is once again extracted correctly (TIKA-1076).
- Text is extracted from text boxes in XLSX (TIKA-1100).
- Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132).
- Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792).
- Upgraded POI to 3.10-beta2 (TIKA-1173) (TIKA-1173).
- Upgraded PDFBox to 1.8.4 (TIKA-1230) (TIKA-1230).
- Made HtmlEncodingDetector more flexible in finding meta header charset (TIKA-1001).
- Added sanitized test HTML file for local file test (TIKA-1139). (TIKA-1139).
- Fixed bug that prevented attachments within a PDF from being processed if the PDF itself was an attachment (TIKA-1124).
- Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130). (TIKA-1130).
- RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192).
- CLI: TikaCLI now escapes invalid filename characters as hex characters (TIKA-1078).
The following people have contributed to Tika 1.5 by submitting or commenting on the issues resolved in this release:
- Albert L.
- Andrew Jackson
- Andrzej Bialecki
- Boris Naguet
- Chris A. Mattmann
- Curtis Warner
- Damien Dykman
- Daniel Bonniot de Ruisselet
- Daniel Gibby
- Dave Kincaid
- Dave Meikle
- Dietmar Glachs
- Emil Burzo
- Gaurav
- Giuseppe Totaro
- Grzegorz Kaczmarczyk
- Hong-Thai Nguyen
- Jason Sherman
- Jeremy
- Jukka Zitting
- Kabron Kline
- Kai-Uwe Schmidt
- Kazuaki Matsuba
- Ken Krugler
- Lewis John McGibbney
- Lutz Theurer
- Marius Dumitru Florea
- Markus Jelsma
- Michael Graessle
- Michael McCandless
- Nick Burch
- Niels Beekman
- Oliver Heger
- Paul Brinich
- Ralf Schmitt
- Ray Gauss II
- Rian Stockbower
- Ryan Krueger
- Sergey Beryozkin
- Stefano Fornari
- Sumeet Gorab
- Tim Allison
- Timo Boehme
- Uwe Schindler
- Vadim Roizman
- Yegor Kozlov
- brat
- David Rapin
- Gunter Rombauts
- Isha Marwah
See http://s.apache.org/oQ for more details on these contributions.