How to do a "Clean" text extraction from an EMF file?

Last post 07-19-2008, 4:59 AM by NKuzmin. 3 replies.
Sort Posts: Previous Next
  •  06-10-2008, 1:38 PM 130811

    How to do a "Clean" text extraction from an EMF file?

    Hi,

    I'm looking at your product for a potential customer. But the key point to the buy decision is how to extract clean text out of the EMF file.

    I'm using the following code snipet:

            FileInputStream file = new FileInputStream(fileName);
            EmfMetafile efile = new EmfMetafile(file);
            MetafileComment[] metaFileComment = efile.getComments();
            for (int i = 0; i < metaFileComment.length; i++) {
                String temp = Charset.defaultCharset().decode(ByteBuffer.wrap(metaFileComment[i].getCommentData())).toString().replaceAll("[\u0000-\u001F]", "").replaceAll("[\u007A-\uFFFF]", "");
                if (temp.indexOf("F+@") == -1 && temp.indexOf("*>*>?@,") == -1) {

                    temp = temp.substring(temp.indexOf("*>*>?@") + 6);
                    System.out.println(" -- metaFileComment " + metaFileComment[i].getRecordIndex() + ": " +
                                        temp );
                }
            }

    By doing the above text stripping, I was able to remove some of the junks text. However, there are still alot more which I have no idea how to filter it out.

    Here's a sample output when I ran it (please note, I'd like to extract "cleanly" only the text in RED color below).

         [java] Reading file C:\project\Aspose.Total.Java\Aspose.Metafiles\.\{3258E1A7-0A80-40DA-967B-D7232536DC7F}.emf
         [java]  File content: com.aspose.metafiles.EmfMetafile@341960
         [java]  -- metaFileComment 3767: 4(!E!s%DDWG NO
         [java]  -- metaFileComment 3789: 0$7E!=DREV
         [java]  -- metaFileComment 3798: 4(E'D402711
         [java]  -- metaFileComment 3949: 4(EyIDTITLE
         [java]  -- metaFileComment 3958: 0$EWIDDATE
         [java]  -- metaFileComment 3965: 4(D7NDSTRESS
         [java]  -- metaFileComment 3972: 8,DxLDCHECKED
         [java]  -- metaFileComment 3979: 4(DJDDRAWN
         [java]  -- metaFileComment 3986: <0%D4IDAPPROVALS
         [java]  -- metaFileComment 3993: @4DPEDCONTRACT NO.
         [java]  -- metaFileComment 4000: 0$EPDSIZE
         [java]  -- metaFileComment 4007: <0EvPDCAGE CODE
         [java]  -- metaFileComment 4014: 8,ESDSCALE 
         [java]  -- metaFileComment 4021: 0$nESD1/4
         [java]  -- metaFileComment 4028: 8,EwPDDWG. NO.
         [java]  -- metaFileComment 4035: 0$3EoPDREV
         [java]  -- metaFileComment 4042: D8.E SDSHEET       
         [java]  -- metaFileComment 4049: 8,oE SD1     
         [java]  -- metaFileComment 4056: 4(4E SDOF  
         [java]  -- metaFileComment 4070: THEEDSARGENT FLETCHER INC.
         [java]  -- metaFileComment 4079: TH@EGD9400 E. FLAIR DR. - 
         [java]  -- metaFileComment 4088: D8p=EGDEL MONTE, CA 
         [java]  -- metaFileComment 4095: 4(EGD91731
         [java]  -- metaFileComment 4102: 4(E"KQD72429
         [java]  -- metaFileComment 4118: 8,ESDCALC.WT.
         [java]  -- metaFileComment 4127: \PD2EDUNLESS OTHERWISE SPECIFIED
         [java]  -- metaFileComment 4134: XLDtWFDDIMENSIONS ARE IN INCHES
         [java]  -- metaFileComment 4141: H<D*GDTOLERANCES ARE:
         [java]  -- metaFileComment 4148: D8DGDANGLES       
         [java]  -- metaFileComment 4155: 8,pYDGDDECIMALS
         [java]  -- metaFileComment 4162: PD9DLDDO NOT SCALE DRAWING
         [java]  -- metaFileComment 4169: @4DSDAPPLICATION
         [java]  -- metaFileComment 4176: <0DQDNEXT ASSY
         [java]  -- metaFileComment 4183: 8,jDRrQDUSED ON
         [java]  -- metaFileComment 4190: 0$DQDPART
         [java]  -- metaFileComment 4197: 0$DRDDASH
         [java]  -- metaFileComment 4204: 0$D\SDNO.
         [java]  -- metaFileComment 4211: <0EMCDPARTS LIST
         [java]  -- metaFileComment 4220: 8,(DBDQTY REQD
         [java]  -- metaFileComment 4229: <04PD_BDCAGE CODE
         [java]  -- metaFileComment 4238: XL-DhxBDPART OR IDENTIFYING NO.
         [java]  -- metaFileComment 4247: `TEjBDNOMENCLATURE OR DESCRIPTION
         [java]  -- metaFileComment 4256: THEn_BDMATERIAL SPECIFICATION
         [java]  -- metaFileComment 4263: 0$ETBDZONE
         [java]  -- metaFileComment 4270: 0$DgEtADFIND
         [java]  -- metaFileComment 4277: 0$EBDNO.
         [java]  -- metaFileComment 4284: THOD5JDDO NOT REVISE MANUALLY
         [java]  -- metaFileComment 4316: 0$&DHD30'
         [java]  -- metaFileComment 4323: 0$$TDID.X
         [java]  -- metaFileComment 4344: 0$D;ID.XX
         [java]  -- metaFileComment 4358: 0$D;ID.03
         [java]  -- metaFileComment 4365: 4(TDJD.XXX
         [java]  -- metaFileComment 4379: 0$DJD.010
         [java]  -- metaFileComment 4386: D8DQODMANUFACTURING
         [java]  -- metaFileComment 4393: @4DQDPROJECT ENGR
         [java]  -- metaFileComment 4400: <0D+jSDCHIEF ENGR
         [java]  -- metaFileComment 4407: 0$eDGSDDATE
         [java]  -- metaFileComment 4414: PDjDSDADDITIONAL APPROVALS
         [java]  -- metaFileComment 4421: 8,DQODDESIGNER
         [java]  -- metaFileComment 4428: L@pDQDQUALITY ASSURANCE
         [java]  -- metaFileComment 4435: THcD,LDTHIRD ANGLE PROJECTION
         [java]  -- metaFileComment 4444: H<Dk6NDDESIGN ACTIVITY
         [java]  -- metaFileComment 4453: TH=EKDSHELL, CENTER SECTION
         [java]  -- metaFileComment 4462: 4(QEgPD402711
         [java]  -- metaFileComment 4469: <0ERqAREVISIONS
         [java]  -- metaFileComment 4478: @4E0JADESCRIPTION
         [java]  -- metaFileComment 4487: 0$\EADATE
         [java]  -- metaFileComment 4494: 8,sE0AAPPROVED
         [java]  -- metaFileComment 4501: 0$sDFAZONE
         [java]  -- metaFileComment 4508: 0$hEnAREV
         [java]  -- metaFileComment 4515: 4(3E+AVDCATIA
         [java]  -- metaFileComment 4524: 4(VDRDDWG NO
         [java]  -- metaFileComment 4540: 0$-DRDREV
         [java]  -- metaFileComment 4563: 4(%DuRD402711
         [java]  -- metaFileComment 4570: @@ AAICA
         [java]  -- metaFileComment 4610: ?A\ATHIS REPRODUCTION IS A PROPRIETARY DESIGN AND IS A CONFIDENTIAL
         [java]  -- metaFileComment 4619: @A<ADISCLOSURE BY SARGENT FLETCHER INC., EL MONTE, CALIFORNIA. IT IS
         [java]  -- metaFileComment 4626: t+AALOANED SUBJECT TO THE CONDITIONS THAT IT: 
         [java]  -- metaFileComment 4633: PD4CA1) SHALL BE USED FOR
         [java]  -- metaFileComment 4640: h\ABRECORD AND REFERENCE PURPOSE; 
         [java]  -- metaFileComment 4647: l`"uKBB2) SHALL NOT BE USED NOR CAUSED TO
         [java]  -- metaFileComment 4654: BA+BBE USED FOR PROCUREMENT OR IN ANY OTHER WAY PREJUDICIAL TO SARGENT
         [java]  -- metaFileComment 4661: H<AnABFLETCHER INC.; 
         [java]  -- metaFileComment 4668: /JBnAB3)SHALL NOT BE REPRODUCED OR COPIED IN WHOLE OR
         [java]  -- metaFileComment 4675: <0AnWBIN PART; 
         [java]  -- metaFileComment 4682: 5,=BnWB4) SHALL NOT BE USED TO PRODUCE OR MANUFACTURE ITEMS
         [java]  -- metaFileComment 4689: AA^mBEXCEPT WITH THE EXPRESS WRITTEN CONSENT OF SARGENT FLETCHER INC.;
         [java]  -- metaFileComment 4696: =AB5) SHALL NOT BE RELEASED TO A THIRD PARTY WITHOUT THE EXPRESS
         [java]  -- metaFileComment 4703: BA'BWRITTEN CONSENT OF SARGENT FLETCHER INC.; AND 6) SHALL BE RETURNED
         [java]  -- metaFileComment 4710: @4ABUPON DEMAND.
         [java]  -- metaFileComment 4717: @@ D[:DE[:D
         [java]  -- metaFileComment 4866: 8,D;:DER 4043
         [java]  -- metaFileComment 4873: @4hD;:DWELDING ROD
         [java]  -- metaFileComment 4880: H<D;:DANSI/AWS A 5.10
         [java]  -- metaFileComment 4887: 0$E;:D1G15
         [java]  -- metaFileComment 4915: 0$hD:@<DSKIN
         [java]  -- metaFileComment 4922: <0D:@<DSEE NOTE 3
         [java]  -- metaFileComment 4929: 0$E:@<D1B5
         [java]  -- metaFileComment 4957: 0$hD9=DSKIN
         [java]  -- metaFileComment 4964: <0D9=DSEE NOTE 3
         [java]  -- metaFileComment 4971: 0$E9=D1B7
         [java]  -- metaFileComment 4999: 0$hD7F?DSKIN
         [java]  -- metaFileComment 5006: <0D7F?DSEE NOTE 3
         [java]  -- metaFileComment 5013: 0$E7F?D1B9
         [java]  -- metaFileComment 5027: 4(15ETQD-    
         [java]  -- metaFileComment 5036: 8,XABNOTES: 
         [java]  -- metaFileComment 5045: \PNBBBUNLESS OTHERWISE SPECIFIED
         [java]  -- metaFileComment 5052: 4(XAB1.  
         [java]  -- metaFileComment 5059: pd#BBINTERPRET DRAWING PER ASME Y14.100.
         [java]  -- metaFileComment 5066: 4(XAM>B2.  
         [java]  -- metaFileComment 5073: 3BM>BDIMENSIONING AND TOLERANCING PER ASME Y14.5M -1994.
         [java]  -- metaFileComment 5080: 4(XAOC3   
         [java]  -- metaFileComment 5087: 4BOCMATERIAL: .090 AL ALY SH 6061-0 PER AMS-QQ-A-250/11.
         [java]  -- metaFileComment 5094: 4(XAjC4.  
         [java]  -- metaFileComment 5101: 8BjCAFTER FORMING FIND NO 1, 2, & 3, SOLUTION HEAT TREAT AND
         [java]  -- metaFileComment 5108: 1'&BCARTIFICIALLY AGE TO T42 CONDITION PER MIL-H-6088.
         [java]  -- metaFileComment 5115: 4(XAZ'C5   
         [java]  -- metaFileComment 5122: 8BZ'CFUSION WELD PER AWS D17.1:2001 CLASS B USING FIND NO. 4.
         [java]  -- metaFileComment 5129: 4(XA95C6.  
         [java]  -- metaFileComment 5136: dXB95CWELDING SYMBOLS PER AWS A2.4.
         [java]  -- metaFileComment 5143: 4(XABC7   
         [java]  -- metaFileComment 5150: =BBCRUBBER STAMP OR STENCIL WITH 72429/402711 AND APPLICABLE DASH
         [java]  -- metaFileComment 5157: <'&BICNO PER MIL-STD-130, CHARACTER SIZE OPTIONAL. USE CONTRASTING
         [java]  -- metaFileComment 5164: 0'&B:PCCOLOR INK PER A-A-56032. LOCATE APPROX AS SHOWN.
         [java]  -- metaFileComment 5189: 4(DxOD402771
         [java]  -- metaFileComment 5196: 0$4DxODF-2
         [java]  -- metaFileComment 5203: 8,H.DJDJ. GRAF
         [java]  -- metaFileComment 5210: 8,\EJD03-07-12
         [java]  -- metaFileComment 5219: L@%EjAPRODUCTION RELEASE
         [java]  -- metaFileComment 5235: @@ K[CeAKCeA
         [java]  -- metaFileComment 5275: 4K[CRfATHIS DRAWING IS SUBJECT TO THE INTERNATIONAL TRAFFIC
         [java]  -- metaFileComment 5284: dXK[CiBIN ARMS REGULATIONS (ITAR). 
         [java]  -- metaFileComment 5291: THXgCiBIT MAY NOT BE EXPORTED
         [java]  -- metaFileComment 5298: 2K[Cu,BFROM THE UNITED STATES OR TRANSFERRED TO A FOREIGN
         [java]  -- metaFileComment 5305: 0K[CGBPERSON WITHOUT THE PRIOR WRITTEN APPROVAL OF THE
         [java]  -- metaFileComment 5312: \PK[C)8cBU.S. DEPARTMENT OF STATE.
         [java]  -- metaFileComment 5319: @

         [java] Reading file C:\project\Aspose.Total.Java\Aspose.Metafiles\.\{8C9BB600-6126-4050-BBAE-1CFE6B8669C6}.emf
         [java]  File content: com.aspose.metafiles.EmfMetafile@1d2fc36
         [java]  -- metaFileComment 190: @ :ac/@ C-@ :aBcC@:[BAbC@@@@
         [java]  -- metaFileComment 200: @@ :YB^C:YBiC
         [java]  -- metaFileComment 334: @ :ac/@ C-@ :aBcC@:[BAbC@@@@
         [java]  -- metaFileComment 344: 0$SABNOTE
         [java]  -- metaFileComment 352: THSAI[B1. END SHAPE OF TUBE:
         [java]  -- metaFileComment 359: l`"ABTOLERANCE SHALL BE APPLIED TO AREA
         [java]  -- metaFileComment 366: dXABWITHIN 20mm FROM CANISTER END
         [java]  -- metaFileComment 373: l`"AUtBWITHIN 17mm FROM TWO WAY VALVE END
         [java]  -- metaFileComment 380: L@SA!C2. PRINT ON TUBE:
         [java]  -- metaFileComment 387: t+A'CMANUFACTURER'S TRADEMARK, NOMINAL DIA., AND
         [java]  -- metaFileComment 394: PDAg=,CMANUFACTURED DATE 
         [java]  -- metaFileComment 401: h\Bg=,COR LOT NUMBER (ABBREVIATION) TO
         [java]  -- metaFileComment 408: THA>l1CBE PRINTED REPEATEDLY.
         [java]  -- metaFileComment 415: XLSA6C3. MARKING OF TUBE END:
         [java]  -- metaFileComment 422: /A;CMARKING POSITION AND ANGLE DEPENDS ON 3D MODEL.
         [java]  -- metaFileComment 429: pd$A@CSHAPE AND COLOR TO BE SHOWN AS NOTE.
         [java]  -- metaFileComment 436: 1SA@eC4. BEND RADII SHALL BE R15 ALONG TUBE CENTERLINE.
         [java]  -- metaFileComment 443: `TSAojC5. ASSEMBLY CONDITION:CLAMP
         [java]  -- metaFileComment 450: 3AToCASSEMBLY LOCATION BEFORE DELIVERY TO BE COORDINATED
         [java]  -- metaFileComment 457: `TA+tCBETWEEN SUPPLIER AND PLANT.
         [java]  -- metaFileComment 464: 3AyCASSEMBLY LOCATION IN THIS DWG. SHOWS ACTUAL VEHICLE
         [java]  -- metaFileComment 471: PDA*CASSEMBLY CONDITION.
         [java]  -- metaFileComment 478: 0SA,C6. PROTECTION FOR TUBE ENDS AND DUST PROOF TO BE
         [java]  -- metaFileComment 485: xl(ADCCOORDINATED BETWEEEN SUPPLIER AND PLANT.
         [java]  -- metaFileComment 492: @@ gBZCBZC
         [java]  -- metaFileComment 502: H<gB0OCMARKING (WHITE)
         [java]  -- metaFileComment 525: @4DBTC3 OR EQUIV)
         [java]  -- metaFileComment 534: @@ ifBbCgBZC
         [java]  -- metaFileComment 644: 0$!5,Bl2C5.5
         [java]  -- metaFileComment 664: @@, @0$BOvCBO6CB. *CBJC
         [java]  -- metaFileComment 744: 0$B!C10.5
         [java]  -- metaFileComment 765: @@, @0$:YBGUC:]BGUCg*AGUCBGUC
         [java]  -- metaFileComment 854: 4(g*ACPCAPPRX.
         [java]  -- metaFileComment 861: @

     

    Thanks.

    Brandon

     
  •  06-10-2008, 5:15 PM 130839 in reply to 130811

    Re: How to do a "Clean" text extraction from an EMF file?

    Hello Brandon.

    Thanks for considering Aspose.Metafiles.


    Metafile comments are the special records in a metafile, where application can store any data. These records usually simply ignored by a reader unless it knows something about their format. You cannot extract text from them unless author stored it there.
    As for your EMF file - looks like this is a EMF+dual, a standard EMF file with "fallback" GDI records to render it and advanced GDI+ records incapsulated in EMFComment records. So you have extracted other metafile.
    Currently there is no way to extract a text from a metafile, but I'll add this feature to our todo list.

    Nikolay Kuzmin
    Developer
    Aspose.Slides Team
    http://www.aspose.com
    Your File Format Experts
    Keep in touch! We're on Twitter and Facebook
     
  •  06-10-2008, 5:35 PM 130843 in reply to 130839

    Re: How to do a "Clean" text extraction from an EMF file?

    Attachment: Present (inaccessible)

    Hi Nikolay,

    Thanks for a speedy reply. It will be nice to be able to extract text from an EMF file. I'll check back with Aspose.Metafiles in the future for this update.

    I'm attaching my sample EMF file (they're part of a CAD Drawing files), maybe that'll be helpful for you determine what kind of information you want to capture from it.

    Thanks,

    Brandon Nguyen

     

     

     

     
  •  07-19-2008, 4:59 AM 136179 in reply to 130843

    Re: How to do a "Clean" text extraction from an EMF file?

    Hello Brandon,
    Please check Aspose.Metafiles 1.3.0.0.

    Nikolay Kuzmin
    Developer
    Aspose.Slides Team
    http://www.aspose.com
    Your File Format Experts
    Keep in touch! We're on Twitter and Facebook
     
View as RSS news feed in XML