Welcome, Guest!

Here are some links you may find helpful

Apache Tika and tikatree

Zeigren

Member
Original poster
Registered
May 3, 2019
20
37
13
www.zeigren.com
AGName
Zeigren
AG Join Date
01/19/2014
Apache Tika is a nice little tool from the Apache Software Foundation
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)

As part of the
Keep Dreaming Project I wrote tikatree which is a little command line tool that uses Apache Tika to parse metadata from all files in a directory. It creates a json file of all the metadata, another json file with the file tree and some basic file information, a graphical representation of the directory, and a checksum.

Basically shoot it at a directory and a bunch of neat information about everything in it pops out.

As an example here's a metadata tidbit from some Dreamcast documentation, I've omitted most of it as it's really long

"E_DC_HW_outline.doc": {
"Application-Name": [
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0",
"Microsoft Word 8.0"
],
"Author": [
"TOY開発生産本部",
"Shuji Hori",
"Shuji Hori",
"TOY開発生産本部",
"",
"TOY開発生産本部",
"TOY開発生産本部",
"TOY開発生産本部",
"TOY開発生産本部",
"TOY開発生産本部",
"TOY開発生産本部"
],
"Character Count": [
"48232",
"80",
"16",
"39",
"19",
"108",
"88",
"33",
"23"
],
"Comments": [
" CPU:SH4\rSH4外部インターラプト\r\r表 3-1\r\r\r\r\r\r\r\r\r システムROM\r SH4の設定\r",
"",
"",
"",
"",
"",
"",
""
],
"Company": [
"セガ・エンタープライゼス",
"SEGA",
"SEGA",
"(株)セガ・エンタープライゼス",
"sega europe",
"(株)セガ・エンタープライゼス",
"セガ・エンタープライゼス",
"セガ・エンタープライゼス",
"Sega of America",
"(株)セガ・エンタープライゼス",
"セガ・エンタープライゼス"
],
"Content-Type": [
"application/msword",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"image/wmf",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"application/msword",
"image/bmp",
"application/msword",
"image/bmp"
],
"Creation-Date": [
"1999-10-11T08:52:00Z",
"1999-10-15T07:40:00Z",
"1999-10-15T08:31:00Z",
"1999-10-18T08:59:00Z",
"1999-10-18T09:17:00Z",
"1999-10-20T06:03:00Z",
"1999-10-18T08:17:00Z",
"1999-10-15T07:51:00Z",
"1999-10-15T06:49:00Z",
"1999-10-15T07:25:00Z",
"1999-09-03T04:36:00Z"
],
"Edit-Time": [
"1043400000000",
"600000000",
"600000000",
"1200000000",
"600000000",
"600000000",
"600000000",
"600000000",
"3600000000",
"600000000"
],
"Keywords": [
"sh holly 接続 データ システム ",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"Last-Author": [
"CatherineBarnfather",
"CatherineBarnfather",
"CatherineBarnfather",
"CatherineBarnfather",
"",
"CatherineBarnfather",
"CatherineBarnfather",
"CatherineBarnfather",
"CatherineBarnfather",
"CatherineBarnfather",
"Shuji Hori"
],
"Last-Modified": [
"1999-10-25T08:31:00Z",
"1999-10-15T07:41:00Z",
"1999-10-15T08:32:00Z",
"1999-10-18T09:06:00Z",
"1999-10-18T09:18:00Z",
"1999-10-20T06:04:00Z",
"1999-10-18T08:19:00Z",
"1999-10-15T07:52:00Z",
"1999-10-15T06:49:00Z",
"1999-10-15T07:31:00Z",
"1999-09-03T04:37:00Z"
],



I think you can see how this would be useful lol
I'm surprised Tika has never been mentioned here before, I didn't find out about it myself until recently though.

If you're dealing with a really big directory or the files in it have tons of metadata you can use tika-python as it will create a metadata file for each individual file, instead of one big metadata file. It's what tikatree uses internally. Tika itself can be ran as a java jar with a GUI but the GUI is limited to a single file and can't save anything.

Alternative tikatree links:

GitHub
Pypi
 
  • Like
Reactions: Traace

Make a donation