Mixing large files with JAVA processing text and binary data __java

Source: Internet
Author: User
Tags int size mixed processing text java fileinputstream
Our common files are mainly three types: text files, binary data files, mixed files. As a mixed document processing, especially the processing of large mixed documents, developers face a special challenge:   First, the binary data needs to be positioned, and the binary data content for the next processing;   Secondly, to base the context of binary data To get the background information of binary data, it is possible to proceed to the next step.   Third, in the face of large files, it is often unrealistic to read all the data into memory at once. How to control the continuity and correctness of binary data is a problem that can not be neglected when reading files in a time.   This paper discusses some of the above three problems, and puts forward a feasible method to deal with the large binary data and text mixed files effectively with Java.   The processing method of XML mixed file has been discussed in this paper. and, for XML document processing, we can use JAXP and so on, and this article does not dwell on these issues. Interested readers can refer to http://www-128.ibm.com/developerworks/cn/xml/x-wxxm33/. Here, for example, we'll discuss the processing of large mixed documents in PDF documents.   1, problem description:   After pdf-1.3, the PDF document can be embedded in the attachment annotation, as shown in the following figure. If you cannot extract the contents of an attachment annotation when you are full-text searching a PDF document, you may not be able to retrieve a PDF document with a keyword. Based on this situation, it is necessary to extract the contents of the attachment annotation in the PDF document, according to the actual demand, to decide how to do the Full-text search for the content of the attachment annotation. A case of pdf-1.3 's source file is shown in the following figure.   2, about the choice of Java classes.   for mixed document processing, we have to consider two aspects: first, the text content of the information extraction, the second binary data data extraction. For both of these, we need to use different Java classes for processing.   2.1 handles text content.   Processing text content We can use the Java BufferedReader class, using the following case:   file F = new file (); bufferedreader  br = new BufferedReader (new InputStreamReader (new FileInputStream (f)));   for the location of the attachment, the document definition structure of the PDF document should be analyzed. by dividedAnalyze each line of the PDF document, find the location that contains the attachment, and the attachment name, and so on. So as to prepare for the next step of binary data extraction.   * See the Acrobat home page for the PDF document structure.   2.2 Processing binary data   processing binary data can not use BufferedReader, otherwise the extracted data will not remain intact. So we need the FileInputStream class at this point. The use case for this class is as follows:   file F = new file (); FileInputStream fin = new FileInputStream (f);   However, we still need to locate the binary data to be extracted, otherwise the data will be extracted incorrectly. At this point, you need to compare the resulting file input stream to the beginning and end of the binary data you want to extract. For character comparisons, we need to construct a method similar to the following:    int findstringinbuffer (byte[] buffer, char[] search, int buffersize, int offset)
{
int len = search.length;
int pos = 0, I;
Boolean fnd = false;
while (!FND)
{
FND = true;
for (i=0; i<len; i++)
{
if ((char) Buffer[offset + pos + i]!=search[i])
{
FND = false;
Break
}
}
if (FND) {
return POS;
}
pos++;
if (POS >= buffersize-len-offset) return-1;
}
return-1;
} 3.   Challenges and countermeasures when dealing with large mixed files. 3.1 Challenges for large mixed files if the mixed file is small, you can quickly locate the attachment annotation location and extract the binary data by reading the entire document into memory. However, in the face of large mixed files, we will have new challenges. The main challenges include the following: First, how to choose the size of the buffer. If the buffer is too small, it is possible that the data extraction algorithm is invalidated because the text data is too long to be placed in the buffer, and if the buffer is too large, it may overflow the memory; second, how to segment the target document, and correctly determine the location of the binary file, so that data extraction, is also a problem that cannot be neglected.   Indicates that the text data may be in different segments, so how to recover the indicated text data is key; Thirdly, a binary data stream may belong to a continuous document segment, and how to fit binary data streams into the original data is also a key issue.   In view of these three questions, we may have the following choice countermeasure.   3.2 Reasonable choice of buffer size. When choosing buffer, we want to determine the length of the indicated text data for each binary data stream. Theoretically, as long as the length of the buffer is greater than the length of the indicated text data, it is the appropriate buffer data length.  A case in which the input file stream's data is read into the buffer is as follows: byte[] buffer = new Byte[buffer_size];   int size = fin.read (buffer);   The size holds the number of characters actually read in buffer.   3.3 Indicates the stitching of text data. According to the length of the indicated text data, it is necessary to design the top and end buffer to read the data in the document, and write only the fixed-length binary data to the output file at a time. The purpose of the overlapping buffer is to correctly implement the correct concatenation of the indicated text data, because each read input file stream may truncate the indicated text data to two halves. One use case for the overlapping buffer is as follows: for (int j = Tmp_start; j < size; J +) {
Buffer[j-tmp_start] = Tmp[j];
}
size = 0;
int BT;
for (int j = 0; (J < Tmp_start) && ((BT = Fin.read ())!=-1); J + +) {
size++;
Buffer[buffer_size-tmp_start + j] = (byte) bt;
}
Size = size + Buffer_size-tmp_start; 3.4 The binary data in continuous buffer is spliced into full binary data. First, determine the starting position of the binary data. Since binary data lengths may be much larger than the chosen buffer length, it is necessary to determine the end position of binary data when reading binary data. This involves the location of the two-level data at the end of the position, which may require a concatenation of the text data to be indicated beforehand. Also be sure not to read the wrong end data. This requires determining the length of the end-indicating text data and, when writing binary data to the output file every time, reserves the length of the end indicating text data for the next read. A binary data read case is as follows: for (int j = 0; J < Endstream_length; J + +) {
TMP[J] = Buffer[size-endstream_length + j];
}
for (int j = 0; J < Endstream_length; J + +) {
BUFFER[J] = Tmp[j];
}
Size = Fin.read (Ass_buf_4_endstream);
for (int j = 0; J < size; J +) {
Buffer[endstream_length + j] = Ass_buf_4_endstream[j];
}
Size = size + endstream_length; 4. The conclusion synthesizes the Java FileInputStream and the BufferedReader class, and carries on the reasonable structure analysis and the algorithm design to the large binary and the text mixed file, then can complete the task which handles the binary data in the large binary and the text mixed file. This paper presents a feasible scheme to deal with large binary and text mixed files, which is of great practical value. Contact Author: wtcforever (a) 163.com
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.