Discussion:
Tika maxStringLength limit reached
zabrane Mikael
2010-04-21 08:24:10 UTC
Permalink
Hi List,

While extracting text&metadta from memory string, I got this exception:
.org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException

Googling a bit, I found that this limit is set to 100K with the variable
"maxStringLength":
http://www.mail-archive.com/tika-commits-PPu3vs9EauNd/SJB6HiN2Ni2O/***@public.gmane.org/msg00468.html

How can I increse this limit to 10Mb please?

Here is my code:

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
BasicConfigurator.configure(new WriterAppender(new SimpleLayout(),
System.err));
Logger.getRootLogger().setLevel(Level.INFO);
context.set(Parser.class, parser);

InputStream input = new ByteArrayInputStream(data);
textHandler = new BodyContentHandler();
metadata = new Metadata();
parser.parse(input, textHandler, metadata, context);
input.close();

Thanks
Zabrane
Jukka Zitting
2010-04-21 08:36:34 UTC
Permalink
Hi,
Post by zabrane Mikael
.org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException
Googling a bit, I found that this limit is set to 100K with the variable
How can I increse this limit to 10Mb please?
You can pass a custom string length limit (or -1 to disable the limit)
to the BodyContentHandler constructor. See
http://lucene.apache.org/tika/0.7/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler(int)
Post by zabrane Mikael
ParseContext    context     = new ParseContext();
Parser              parser      = new AutoDetectParser();
BasicConfigurator.configure(new WriterAppender(new SimpleLayout(),
System.err));
Logger.getRootLogger().setLevel(Level.INFO);
context.set(Parser.class, parser);
InputStream input = new ByteArrayInputStream(data);
textHandler = new BodyContentHandler();
metadata = new Metadata();
parser.parse(input, textHandler, metadata, context);
input.close();
Have you checked out the new Tika facade class [1]? You can use it to
simplify this code (minus the logging configuration) to:

Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);

InputStream input = new ByteArrayInputStream(data);
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);

[1] http://lucene.apache.org/tika/0.7/api/org/apache/tika/Tika.html

BR,

Jukka Zitting
zabrane Mikael
2010-04-21 09:19:42 UTC
Permalink
Hi Jukka,

Thanks for the simplification. The code works as expected now.

One more question about UTF8 output encoding please.

In my previous code (which was heavily inspired from TikaCLI.java), I was
unable to set my extracted text/metadata encoding to UTF8?

Is there a way to achieve that?

Regards
Zabrane
Post by Jukka Zitting
Hi,
Post by zabrane Mikael
.org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException
Googling a bit, I found that this limit is set to 100K with the variable
How can I increse this limit to 10Mb please?
You can pass a custom string length limit (or -1 to disable the limit)
to the BodyContentHandler constructor. See
http://lucene.apache.org/tika/0.7/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler(int)
Post by zabrane Mikael
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
BasicConfigurator.configure(new WriterAppender(new SimpleLayout(),
System.err));
Logger.getRootLogger().setLevel(Level.INFO);
context.set(Parser.class, parser);
InputStream input = new ByteArrayInputStream(data);
textHandler = new BodyContentHandler();
metadata = new Metadata();
parser.parse(input, textHandler, metadata, context);
input.close();
Have you checked out the new Tika facade class [1]? You can use it to
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
InputStream input = new ByteArrayInputStream(data);
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);
[1] http://lucene.apache.org/tika/0.7/api/org/apache/tika/Tika.html
BR,
Jukka Zitting
Jukka Zitting
2010-04-21 10:44:11 UTC
Permalink
Hi,
Post by zabrane Mikael
One more question about UTF8 output encoding please.
In my previous code (which was heavily inspired from TikaCLI.java), I was
unable to set my extracted text/metadata encoding to UTF8?
Is there a way to achieve that?
The Java strings returned by Tika.parseToString() and Metadata.get()
are Unicode strings and you only need an encoding like UTF-8 when you
convert the strings to bytes for example when writing them to an
OutputStream. See the standard Java API docs for that.

BR,

Jukka Zitting
zabrane Mikael
2010-04-21 14:01:13 UTC
Permalink
Hi Jukka,

The Java strings returned by Tika.parseToString() and Metadata.get()
Post by Jukka Zitting
are Unicode strings and you only need an encoding like UTF-8 when you
convert the strings to bytes for example when writing them to an
OutputStream. See the standard Java API docs for that.
As I'm very new to Java, could you help me a bit please to convert Unicode
to UTF8?

Regards
Zabrane!
Thilo Goetz
2010-04-21 14:19:31 UTC
Permalink
Try google. E.g.,
http://www.roseindia.net/java/example/java/io/WriteUTF8.shtml
Post by zabrane Mikael
Hi Jukka,
The Java strings returned by Tika.parseToString() and Metadata.get()
are Unicode strings and you only need an encoding like UTF-8 when you
convert the strings to bytes for example when writing them to an
OutputStream. See the standard Java API docs for that.
As I'm very new to Java, could you help me a bit please to convert
Unicode to UTF8?
Regards
Zabrane!
Jukka Zitting
2010-04-21 14:20:14 UTC
Permalink
Hi,
Post by zabrane Mikael
As I'm very new to Java, could you help me a bit please to convert Unicode
to UTF8?
To get started, see the java.lang.String.getBytes(String) method [1]
and the java.io.OutputStreamWriter(OutputStream, String) constructor
[2].

For more background on Unicode and character encodings, see the article at [3].

[1] http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes(java.lang.String)
[2] http://java.sun.com/j2se/1.5.0/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream,%20java.lang.String)
[3] http://www.joelonsoftware.com/articles/Unicode.html

BR,

Jukka Zitting
zabrane Mikael
2010-04-21 14:52:19 UTC
Permalink
Thanks Jukka !
Post by Jukka Zitting
Hi,
Post by zabrane Mikael
As I'm very new to Java, could you help me a bit please to convert
Unicode
Post by zabrane Mikael
to UTF8?
To get started, see the java.lang.String.getBytes(String) method [1]
and the java.io.OutputStreamWriter(OutputStream, String) constructor
[2].
For more background on Unicode and character encodings, see the article at [3].
[1]
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes(java.lang.String)
[2]
http://java.sun.com/j2se/1.5.0/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream,%20java.lang.String)
[3] http://www.joelonsoftware.com/articles/Unicode.html
BR,
Jukka Zitting
zabrane Mikael
2010-04-21 21:33:57 UTC
Permalink
That's seems to work Jukka :

try { // Convert from Unicode to UTF-8 String string = "abc\u5639\u563b";
byte[] utf8 = string.getBytes("UTF-8"); // Convert from UTF-8 to Unicode
string = new String(utf8, "UTF-8"); } catch (UnsupportedEncodingException e)
{ }
Post by Jukka Zitting
Hi,
Post by zabrane Mikael
As I'm very new to Java, could you help me a bit please to convert
Unicode
Post by zabrane Mikael
to UTF8?
To get started, see the java.lang.String.getBytes(String) method [1]
and the java.io.OutputStreamWriter(OutputStream, String) constructor
[2].
For more background on Unicode and character encodings, see the article at [3].
[1]
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes(java.lang.String)
[2]
http://java.sun.com/j2se/1.5.0/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream,%20java.lang.String)
[3] http://www.joelonsoftware.com/articles/Unicode.html
BR,
Jukka Zitting
Loading...