java利用POI将word转换成html实现在线阅读

一、POI转HTML分析
通过网上找资料，发现用java实现word在线阅读有以下的实现方式：

1Word=>PDF(OpenOffice+JodConverter)=>SWF(pdf2swf)=>FlexPaper浏览
2Word=>PDF(MSOffice+JACOB)=>SWF(pdf2swf)=>FlexPaper浏览
3Word =>SWF (FlashPaper)=> FlexPaper浏览
4Word=>SWF(print2flash)=> FlexPaper浏览
5用第三方收费组件：PageOffice
6
    1）利用 POI把 Word2003转换成 html；
    2）利用OpenOffice+JodConverter将word2003转换成html
  前4种方式，目标都是一致的，就是都将word文档转换成flash文件，只是中间的实现不大一样。前两种方式比较麻烦，都是先转成PDF，再转成SWF，最后用FlexPaper浏览。两种比较快捷，可直接将源文件转为SWF，用FlexPaper浏览。第二种方式用到的jacob是微软的组件，在linux平台下基本是无望的了，第一个淘汰。由于FlashPaper不是开源工具，加之Win8系统不兼容(我现在用的系统)，所以就没采用第三种实现方式。Print2flash是开源工具，即使公司产品中用到也不会出现版权纠纷，遗憾的是没找到如何用程序控制该工具转换文件的命令。所以第3，4种方式也淘汰了。通过下载，预使用，发现第5种方式用PageOffice是最省时省力的，也能将word文档完美的展现，但是，要钱！！好吧，一提到钱，此种实现只能暂作废。
后面一开始是想用OpenOffice+JodConverter实现转swf的，后面在逛百度文库的时候，发现一个让我很好奇的东西。就是，百度文库里的文档基本上都用html进行展示了，也就是说，我们上传的word文档，百度对其做了html转换的处理，与页面的嵌合也相当的好。这让我想到，我们的项目中是否也可以用此方式实现word的在线预览呢。
      基于这个想法，我到谷歌找相关的资料，发现将word转html的开源工具没几个。其中，介绍得比较多的就是用POI进行转换，但是，由于POI对word的处理功能相当的弱，因此，开启了使用POI将wordàhtml的艰苦历程（后面发现网上有介绍用OpenOffice+JodConverter将word2003转换成html的方式，但是，我没有深究）：
二、POI转HTML实现

1. POI介绍：
Apache POI 是用Java编写的免费开源的跨平台的 java api，Apache POI提供API给Java程式对Microsoft Office格式档案读和写的功能。POI为“Poor Obfuscation Implementation”的首字母缩写，意为“可怜的模糊实现”。
Apache POI 是创建和维护操作各种符合Office Open XML（OOXML）标准和微软的OLE 2复合文档格式（OLE2）的Java API。用它可以使用Java读取和创建,修改MS Excel文件.而且,还可以使用Java读取和创建MS Word和MSPowerPoint文件。Apache POI 提供Java操作Excel解决方案（适用于Excel97-2008）。
基本结构：
HSSF －提供读写Microsoft Excel XLS格式档案的功能。
XSSF －提供读写Microsoft Excel OOXML XLSX格式档案的功能。
HWPF －提供读写Microsoft Word DOC格式档案的功能。
HSLF －提供读写Microsoft PowerPoint格式档案的功能。
HDGF －提供读Microsoft Visio格式档案的功能。
HPBF －提供读Microsoft Publisher格式档案的功能。
HSMF －提供读Microsoft Outlook格式档案的功能。
其实，POI比较拿手的是处理Excel表格，即上面的HSSF及XSSF，我们的很多项目，只要涉及报表的，基本上都有用到它吧。用对于HWPF即处理DOC的包，功能就没有那么健全了，且API也不完善。

三. 代码实现
具体解释看注释
1.读取word

package com;  
import java.awt.image.BufferedImage;  
import java.io.BufferedWriter;  
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileNotFoundException;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.OutputStream;  
import java.io.OutputStreamWriter;  
import javax.imageio.ImageIO;  
import org.apache.poi.hwpf.HWPFDocument;  
import org.apache.poi.hwpf.model.PicturesTable;  
import org.apache.poi.hwpf.usermodel.CharacterRun;  
import org.apache.poi.hwpf.usermodel.Paragraph;  
import org.apache.poi.hwpf.usermodel.Picture;  
import org.apache.poi.hwpf.usermodel.Range;  
import org.apache.poi.hwpf.usermodel.Table;  
import org.apache.poi.hwpf.usermodel.TableCell;  
import org.apache.poi.hwpf.usermodel.TableIterator;  
import org.apache.poi.hwpf.usermodel.TableRow;  
import org.apache.xmlbeans.impl.piccolo.io.FileFormatException;  
/** 
* @Description: 利用poi将word简单的转换成html文件 
* @author 柯颖波 
* @date 2013-12-20 上午09:32:44 
* @version v1.0 
*/  
public class Word2Html {  
    /** 
     * 回车符ASCII码 
     */  
    private static final short ENTER_ASCII = 13;  
    /** 
     * 空格符ASCII码 
     */  
    private static final short SPACE_ASCII = 32;  
    /** 
     * 水平制表符ASCII码 
     */  
    private static final short TABULATION_ASCII = 9;  
    private static String htmlText = "";  
    private static String htmlTextTbl = "";  
    private static int counter = 0;  
    private static int beginPosi = 0;  
    private static int endPosi = 0;  
    private static int beginArray[];  
    private static int endArray[];  
    private static String htmlTextArray[];  
    private static boolean tblExist = false;  
    /** 
     * 项目路径 
     */  
    private static String projectRealPath = "";  
    /** 
     * 临时文件路径 
     */  
    private static String tempPath = "/upfile/" + File.separator + "transferFile" + File.separator;  
    /** 
     * word文档名称 
     */  
    private static String wordName = "";  
    public static void main(String argv[]) {  
        try {  
            wordToHtml("F:\\SVN\\BobUtil\\web\\", "2012年高考广东数学（文）试卷解析（精析word版）（学生版）.doc");  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
    }  
    /** 
     * 读取每个文字样式 
     *  
     * @param fileName 
     * @throws Exception 
     */  
    private static void getWordAndStyle(String fileName) throws Exception {  
        FileInputStream in = new FileInputStream(new File(fileName));  
        HWPFDocument doc = new HWPFDocument(in);  
        Range rangetbl = doc.getRange();// 得到文档的读取范围  
        TableIterator it = new TableIterator(rangetbl);  
        int num = 100;  
        beginArray = new int[num];  
        endArray = new int[num];  
        htmlTextArray = new String[num];  
        tblExist = false;  
        // 取得文档中字符的总数  
        int length = doc.characterLength();  
        // 创建图片容器  
        PicturesTable pTable = doc.getPicturesTable();  
        // 创建段落容器  
        htmlText = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" /><title>"  
                + doc.getSummaryInformation().getTitle()  
                + "</title></head><body><div style='margin:60px;text-align:center;'><div style='width:620px;text-align:left;line-height:24px;'>";  
        // 创建临时字符串,好加以判断一串字符是否存在相同格式  
        if (it.hasNext()) {  
            readTable(it, rangetbl);  
        }  
        int cur = 0;  
        String tempString = "";  
        for (int i = 0; i < length - 1; i++) {  
            // 整篇文章的字符通过一个个字符的来判断,range为得到文档的范围  
            Range range = new Range(i, i + 1, doc);  
            CharacterRun cr = range.getCharacterRun(0);  
            // beginArray=new int[num];  
            // endArray=new int[num];  
            // htmlTextArray=new String[num];  
            if (tblExist) {  
                if (i == beginArray[cur]) {  
                    htmlText += tempString + htmlTextArray[cur];  
                    tempString = "";  
                    i = endArray[cur] - 1;  
                    cur++;  
                    continue;  
                }  
            }  
            if (pTable.hasPicture(cr)) {  
                htmlText += tempString;  
                // 读写图片  
                try {  
                    readPicture(pTable, cr);  
                } catch (Exception e) {  
                    e.printStackTrace();  
                }  
                tempString = "";  
            } else {  
                Range range2 = new Range(i + 1, i + 2, doc);  
                // 第二个字符  
                CharacterRun cr2 = range2.getCharacterRun(0);  
                char c = cr.text().charAt(0);  
                // System.out.println(c);  
                // /System.out.println(i+"::"+range.getEndOffset()+"::"+range.getStartOffset()+"::"+c);  
                // 判断是否为回车符  
                if (c == ENTER_ASCII) {  
                    tempString += "<br/>";  
                }  
                // 判断是否为空格符  
                else if (c == SPACE_ASCII)  
                    tempString += " ";  
                // 判断是否为水平制表符  
                else if (c == TABULATION_ASCII)  
                    tempString += "    ";  
                // 比较前后2个字符是否具有相同的格式  
                boolean flag = compareCharStyle(cr, cr2);  
                if (flag)  
                    tempString += cr.text();  
                else {  
                    String fontStyle = "<span style=\"font-family:" + cr.getFontName() + ";font-size:"  
                            + cr.getFontSize() / 2 + "pt;";  
                    if (cr.isBold())  
                        fontStyle += "font-weight:bold;";  
                    if (cr.isItalic())  
                        fontStyle += "font-style:italic;";  
                    if (cr.isStrikeThrough())  
                        fontStyle += "text-decoration:line-through;";  
                    int fontcolor = cr.getIco24();  
                    int[] rgb = new int[3];  
                    if (fontcolor != -1) {  
                        rgb[0] = (fontcolor >> 0) & 0xff; // red;  
                        rgb[1] = (fontcolor >> 8) & 0xff; // green  
                        rgb[2] = (fontcolor >> 16) & 0xff; // blue  
                    }  
                    fontStyle += "color: rgb(" + rgb[0] + "," + rgb[1] + "," + rgb[2] + ");";  
                    htmlText += fontStyle + "\">" + tempString + cr.text() + "</span>";  
                    tempString = "";  
                }  
            }  
        }  
        htmlText += tempString + "</div></div></body></html>";  
        // System.out.println(htmlText);  
    }  
    /** 
     * 读写文档中的表格 
     *  
     * @param pTable 
     * @param cr 
     * @throws Exception 
     */  
    private static void readTable(TableIterator it, Range rangetbl) throws Exception {  
        htmlTextTbl = "";  
        // 迭代文档中的表格  
        counter = -1;  
        while (it.hasNext()) {  
            tblExist = true;  
            htmlTextTbl = "";  
            Table tb = (Table) it.next();  
            beginPosi = tb.getStartOffset();  
            endPosi = tb.getEndOffset();  
            // System.out.println("............"+beginPosi+"...."+endPosi);  
            counter = counter + 1;  
            // 迭代行，默认从0开始  
            beginArray[counter] = beginPosi;  
            endArray[counter] = endPosi;  
            htmlTextTbl += "<table border='1' cellpadding='0' cellspacing='0' >";  
            for (int i = 0; i < tb.numRows(); i++) {  
                TableRow tr = tb.getRow(i);  
                htmlTextTbl += "<tr align='center'>";  
                // 迭代列，默认从0开始  
                for (int j = 0; j < tr.numCells(); j++) {  
                    TableCell td = tr.getCell(j);// 取得单元格  
                    int cellWidth = td.getWidth();  
                    // 取得单元格的内容  
                    for (int k = 0; k < td.numParagraphs(); k++) {  
                        Paragraph para = td.getParagraph(k);  
                        CharacterRun crTemp = para.getCharacterRun(0);  
                        String fontStyle = "<span style=\"font-family:" + crTemp.getFontName() + ";font-size:"  
                                + crTemp.getFontSize() / 2 + "pt;color:" + crTemp.getColor() + ";";  
                        if (crTemp.isBold())  
                            fontStyle += "font-weight:bold;";  
                        if (crTemp.isItalic())  
                            fontStyle += "font-style:italic;";  
                        String s = fontStyle + "\">" + para.text().toString().trim() + "</span>";  
                        if (s == "") {  
                            s = " ";  
                        }  
                        // System.out.println(s);  
                        htmlTextTbl += "<td width=" + cellWidth + ">" + s + "</td>";  
                        // System.out.println(i + ":" + j + ":" + cellWidth + ":" + s);  
                    } // end for  
                } // end for  
            } // end for  
            htmlTextTbl += "</table>";  
            htmlTextArray[counter] = htmlTextTbl;  
        } // end while  
    }  
    /** 
     * 读写文档中的图片 
     *  
     * @param pTable 
     * @param cr 
     * @throws Exception 
     */  
    private static void readPicture(PicturesTable pTable, CharacterRun cr) throws Exception {  
        // 提取图片  
        Picture pic = pTable.extractPicture(cr, false);  
        BufferedImage image = null;// 图片对象  
        // 获取图片样式  
        int picHeight = pic.getHeight() * pic.getAspectRatioY() / 100;  
        int picWidth = pic.getAspectRatioX() * pic.getWidth() / 100;  
        if (picWidth > 500) {  
            picHeight = 500 * picHeight / picWidth;  
            picWidth = 500;  
        }  
        String style = " style='height:" + picHeight + "px;width:" + picWidth + "px'";  
        // 返回POI建议的图片文件名  
        String afileName = pic.suggestFullFileName();  
        //单元测试路径  
        String directory = "images/" + wordName + "/";  
        //项目路径  
        //String directory = tempPath + "images/" + wordName + "/";  
        makeDir(projectRealPath, directory);// 创建文件夹  
        int picSize = cr.getFontSize();  
        int myHeight = 0;  
        if (afileName.indexOf(".wmf") > 0) {  
            OutputStream out = new FileOutputStream(new File(projectRealPath + directory + afileName));  
            out.write(pic.getContent());  
            out.close();  
            afileName = Wmf2Png.convert(projectRealPath + directory + afileName);  
            File file = new File(projectRealPath + directory + afileName);  
            try {  
                image = ImageIO.read(file);  
            } catch (Exception e) {  
                e.printStackTrace();  
            }  
            int pheight = image.getHeight();  
            int pwidth = image.getWidth();  
            if (pwidth > 500) {  
                htmlText += "<img style='width:" + pwidth + "px;height:" + myHeight + "px'" + " src=\"" + directory  
                        + afileName + "\"/>";  
            } else {  
                myHeight = (int) (pheight / (pwidth / (picSize * 1.0)) * 1.5);  
                htmlText += "<img style='vertical-align:middle;width:" + picSize * 1.5 + "px;height:" + myHeight  
                        + "px'" + " src=\"" + directory + afileName + "\"/>";  
            }  
        } else {  
            OutputStream out = new FileOutputStream(new File(projectRealPath + directory + afileName));  
            // pic.writeImageContent(out);  
            out.write(pic.getContent());  
            out.close();  
            // 处理jpg或其他（即除png外）  
            if (afileName.indexOf(".png") == -1) {  
                try {  
                    File file = new File(projectRealPath + directory + afileName);  
                    image = ImageIO.read(file);  
                    picHeight = image.getHeight();  
                    picWidth = image.getWidth();  
                    if (picWidth > 500) {  
                        picHeight = 500 * picHeight / picWidth;  
                        picWidth = 500;  
                    }  
                    style = " style='height:" + picHeight + "px;width:" + picWidth + "px'";  
                } catch (Exception e) {  
                    // e.printStackTrace();  
                }  
            }  
            htmlText += "<img " + style + " src=\"" + directory + afileName + "\"/>";  
        }  
        if (pic.getWidth() > 450) {  
            htmlText += "<br/>";  
        }  
    }  
    private static boolean compareCharStyle(CharacterRun cr1, CharacterRun cr2) {  
        boolean flag = false;  
        if (cr1.isBold() == cr2.isBold() && cr1.isItalic() == cr2.isItalic()  
                && cr1.getFontName().equals(cr2.getFontName()) && cr1.getFontSize() == cr2.getFontSize()) {  
            flag = true;  
        }  
        return flag;  
    }  
    /** 
     * 写文件（成功返回true，失败则返回false） 
     *  
     * @param s 
     *            要写入的内容 
     * @param filePath 
     *            文件 
     */  
    private static boolean writeFile(String s, String filePath) {  
        FileOutputStream fos = null;  
        BufferedWriter bw = null;  
        s = s.replaceAll("EMBED", "").replaceAll("Equation.DSMT4", "");  
        try {  
            makeDir(projectRealPath, tempPath);// 创建文件夹  
            File file = new File(filePath);  
            if (file.exists()) {  
                return false;  
            }  
            fos = new FileOutputStream(file);  
            bw = new BufferedWriter(new OutputStreamWriter(fos, "utf-8"));  
            bw.write(s);  
            // System.out.println(filePath + "文件写入成功！");  
        } catch (FileNotFoundException fnfe) {  
            fnfe.printStackTrace();  
        } catch (IOException ioe) {  
            ioe.printStackTrace();  
        } finally {  
            try {  
                if (bw != null)  
                    bw.close();  
                if (fos != null)  
                    fos.close();  
            } catch (IOException ie) {  
                ie.printStackTrace();  
            }  
        }  
        return true;  
    }  
    /** 
     * 根据路径名生成多级路径 
     *  
     * @param url 
     *            参数要以"\classes\cn\qtone\"或者"/classes/cn/qtone/" 
     */  
    private static String makeDir(String root, String url) {  
        String[] sub;  
        url = url.replaceAll("\\/", "\\\\");  
        if (url.indexOf("\\") > -1) {  
            sub = url.split("\\\\");  
        } else {  
            return "-1";  
        }  
        File dir = null;  
        try {  
            dir = new File(root);  
            for (int i = 0; i < sub.length; i++) {  
                if (!dir.exists() && !sub
.equals("")) {  
                    dir.mkdir();  
                }  
                File dir2 = new File(dir + File.separator + sub
);  
                if (!dir2.exists()) {  
                    dir2.mkdir();  
                }  
                dir = dir2;  
            }  
        } catch (Exception e) {  
            e.printStackTrace();  
            return "-1";  
        }  
        return dir.toString();  
    }  
    /** 
     * 将word文档转化,返回转化后的文件路径 
     *  
     * @param projectPath 
     *            项目路径 
     * @param relativeFilePath 
     *            文件相对路径 
     * @return 返回生成的htm路径（如果出错，则返回null） 
     */  
    public static String wordToHtml(String projectPath, String relativeFilePath) {  
        String resultPath = null;  
        projectRealPath = projectPath;// 项目路径  
        String filePath = "";  
        // System.out.println(projectRealPath + tempPath);  
        // System.out.println(makeDir(projectRealPath, tempPath));  
        try {  
            File file = new File(projectPath + relativeFilePath);  
            if (file.exists()) {  
                if (file.getName().indexOf(".doc") == -1 || file.getName().indexOf(".docx") > 0) {  
                    throw new FileFormatException("请确认文件格式为doc!");  
                } else {  
                    wordName = file.getName();  
                    wordName = wordName.substring(0, wordName.indexOf("."));  
                    filePath = projectRealPath + tempPath + wordName + ".htm";  
                    synchronized (relativeFilePath) {// 处理线程同步问题  
                        File ff = new File(filePath);  
                        if (!ff.exists()) {// 如果不存在则进行转换  
                            getWordAndStyle(projectPath + relativeFilePath);  
                            writeFile(htmlText, filePath);  
                        }  
                    }  
                    resultPath = tempPath + wordName + ".htm";  
                }  
            } else {  
                throw new FileNotFoundException("没找到相关文件！");  
            }  
        } catch (NullPointerException e) {  
            e.printStackTrace();  
        } catch (FileNotFoundException e) {  
            e.printStackTrace();  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
        return resultPath;  
    }  
}

2.图片处理

package com;  
import java.io.ByteArrayInputStream;  
import java.io.ByteArrayOutputStream;  
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;  
import java.io.InputStream;  
import java.io.OutputStream;  
import java.util.Scanner;  
import java.util.zip.GZIPOutputStream;  
import javax.xml.parsers.DocumentBuilder;  
import javax.xml.parsers.DocumentBuilderFactory;  
import javax.xml.transform.OutputKeys;  
import javax.xml.transform.Transformer;  
import javax.xml.transform.TransformerFactory;  
import javax.xml.transform.dom.DOMSource;  
import javax.xml.transform.stream.StreamResult;  
import net.arnx.wmf2svg.gdi.svg.SvgGdi;  
import net.arnx.wmf2svg.gdi.wmf.WmfParser;  
import org.apache.batik.transcoder.TranscoderInput;  
import org.apache.batik.transcoder.TranscoderOutput;  
import org.apache.batik.transcoder.TranscodingHints;  
import org.apache.batik.transcoder.image.PNGTranscoder;  
import org.apache.batik.transcoder.wmf.tosvg.WMFTranscoder;  
import org.apache.commons.lang.StringUtils;  
import org.w3c.dom.Document;  
import org.w3c.dom.Element;  
public class Wmf2Png {  
    public static void main(String[] args) throws Exception {  
        // convert("F:\\SVN\\BobUtil\\web\\25177.wmf");  
        // System.out.println((20 / (21 * 1.0)));  
        // svgToPng("F:\\SVN\\BobUtil\\web\\25177.svg", "F:\\SVN\\BobUtil\\web\\25177.png");  
    }  
    /** 
     * @Description: 进行转换 
     * @param filePath 
     *            文件路径 
     * @return 设定文件 
     */  
    public static String convert(String filePath) {  
        String pngFile = "";  
        File wmfFile = new File(filePath);  
        try {  
            if (!wmfFile.getName().contains(".wmf")) {  
                throw new Exception("请确认输入的文件类型是wmf");  
            }  
            // wmf -> svg  
            String svgFile = filePath.replace("wmf", "svg");  
            wmfToSvg(filePath, svgFile);  
            // 对svg做预出理  
            PreprocessSvgFile(svgFile);  
            // svg -> png  
            pngFile = filePath.replace("wmf", "png");  
            svgToPng(svgFile, pngFile);  
            // 删除 svg  
            File file = new File(svgFile);  
            if (file.exists()) {  
                file.delete();  
            }  
            // 删除 wmf  
            if (wmfFile.exists()) {  
                wmfFile.delete();  
            }  
        } catch (Exception e) {  
            try {  
                e.printStackTrace();  
                wmfToJpg(filePath);  
            } catch (Exception e1) {  
                e1.printStackTrace();  
            }  
        }  
        return wmfFile.getName().replace("wmf", "png");  
    }  
    /** 
     * 将wmf转换为svg 
     *  
     * @param src 
     * @param dest 
     */  
    public static void wmfToSvg(String src, String dest) throws Exception {  
        boolean compatible = false;  
        try {  
            InputStream in = new FileInputStream(src);  
            WmfParser parser = new WmfParser();  
            final SvgGdi gdi = new SvgGdi(compatible);  
            parser.parse(in, gdi);  
            Document doc = gdi.getDocument();  
            OutputStream out = new FileOutputStream(dest);  
            if (dest.endsWith(".svgz")) {  
                out = new GZIPOutputStream(out);  
            }  
            output(doc, out);  
        } catch (Exception e) {  
            throw e;  
        }  
    }  
    /** 
     * @Description: 输出svg文件 
     * @param doc 
     * @param out 
     * @throws Exception 
     *             设定文件 
     */  
    private static void output(Document doc, OutputStream out) throws Exception {  
        TransformerFactory factory = TransformerFactory.newInstance();  
        Transformer transformer = factory.newTransformer();  
        transformer.setOutputProperty(OutputKeys.METHOD, "xml");  
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");  
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");  
        transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD SVG 1.0//EN");  
        transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,  
                "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd");  
        transformer.transform(new DOMSource(doc), new StreamResult(out));  
        out.flush();  
        out.close();  
        out = null;  
    }  
    /** 
     * @Description:对svg文件做预处理(这里主要是调整大小，先缩小10倍，如果还大于默认值，则按比例缩小) 
     * @param svgFile 
     * @throws Exception 
     *             设定文件 
     */  
    private static void PreprocessSvgFile(String svgFile) throws Exception {  
        int defaultWeight = 500;// 默认宽度  
        FileInputStream inputs = new FileInputStream(svgFile);  
        Scanner sc = new Scanner(inputs, "UTF-8");  
        ByteArrayOutputStream os = new ByteArrayOutputStream();  
        while (sc.hasNextLine()) {  
            String ln = sc.nextLine();  
            if (!ln.startsWith("<!DOCTYPE")) {  
                os.write((ln + "\r\n").getBytes());  
            }  
        }  
        os.flush();  
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();  
        DocumentBuilder builder;  
        builder = factory.newDocumentBuilder();  
        Document doc = null;  
        try {  
            doc = builder.parse(new ByteArrayInputStream(os.toByteArray()));  
        } catch (Exception e) {  
            inputs = new FileInputStream(svgFile);  
            os = new ByteArrayOutputStream();  
            int noOfByteRead = 0;  
            while ((noOfByteRead = inputs.read()) != -1) {  
                os.write(noOfByteRead);  
            }  
            os.flush();  
            doc = builder.parse(new ByteArrayInputStream(os.toByteArray()));  
        } finally {  
            os.close();  
            inputs.close();  
        }  
        int height = Integer.parseInt(((Element) doc.getElementsByTagName("svg").item(0)).getAttribute("height"));  
        int width = Integer.parseInt(((Element) doc.getElementsByTagName("svg").item(0)).getAttribute("width"));  
        int newHeight = 0;// 新高  
        int newWidth = 0;// 新宽  
        newHeight = height / 10;// 高缩小10倍  
        newWidth = width / 10; // 宽缩小10倍  
        // 如果缩小10倍后宽度还比defaultHeight大，则进行调整  
        if (newWidth > defaultWeight) {  
            newWidth = defaultWeight;  
            newHeight = defaultWeight * height / width;  
        }  
        ((Element) doc.getElementsByTagName("svg").item(0)).setAttribute("width", String.valueOf(newWidth));  
        ((Element) doc.getElementsByTagName("svg").item(0)).setAttribute("height", String.valueOf(newHeight));  
        OutputStream out = new FileOutputStream(svgFile);  
        output(doc, out);  
    }  
    /** 
     * 将svg图片转成png图片 
     *  
     * @param filePath 
     * @throws Exception 
     */  
    public static void svgToPng(String svgPath, String pngFile) throws Exception {  
        File svg = new File(svgPath);  
        FileInputStream wmfStream = new FileInputStream(svg);  
        ByteArrayOutputStream imageOut = new ByteArrayOutputStream();  
        int noOfByteRead = 0;  
        while ((noOfByteRead = wmfStream.read()) != -1) {  
            imageOut.write(noOfByteRead);  
        }  
        imageOut.flush();  
        imageOut.close();  
        wmfStream.close();  
        ByteArrayOutputStream jpg = new ByteArrayOutputStream();  
        FileOutputStream jpgOut = new FileOutputStream(pngFile);  
        byte[] bytes = imageOut.toByteArray();  
        PNGTranscoder t = new PNGTranscoder();  
        TranscoderInput in = new TranscoderInput(new ByteArrayInputStream(bytes));  
        TranscoderOutput out = new TranscoderOutput(jpg);  
        t.transcode(in, out);  
        jpgOut.write(jpg.toByteArray());  
        jpgOut.flush();  
        jpgOut.close();  
        imageOut = null;  
        jpgOut = null;  
    }  
    /** 
     * 将wmf图片转成png图片(备用方法，即当上面的转换失败时用这个) 
     *  
     * @param filePath 
     * @throws Exception 
     */  
    public static String wmfToJpg(String wmfPath) throws Exception {  
        //先wmf-->svg  
        File wmf = new File(wmfPath);  
        FileInputStream wmfStream = new FileInputStream(wmf);  
        ByteArrayOutputStream imageOut = new ByteArrayOutputStream();  
        int noOfByteRead = 0;  
        while ((noOfByteRead = wmfStream.read()) != -1) {  
            imageOut.write(noOfByteRead);  
        }  
        imageOut.flush();  
        imageOut.close();  
        wmfStream.close();  
        // WMFHeaderProperties prop = new WMFHeaderProperties(wmf);  
        WMFTranscoder transcoder = new WMFTranscoder();  
        TranscodingHints hints = new TranscodingHints();  
        transcoder.setTranscodingHints(hints);  
        TranscoderInput input = new TranscoderInput(new ByteArrayInputStream(imageOut.toByteArray()));  
        ByteArrayOutputStream svg = new ByteArrayOutputStream();  
        TranscoderOutput output = new TranscoderOutput(svg);  
        transcoder.transcode(input, output);  
        //再svg-->png  
        ByteArrayOutputStream jpg = new ByteArrayOutputStream();  
        String jpgFile = StringUtils.replace(wmfPath, "wmf", "png");  
        FileOutputStream jpgOut = new FileOutputStream(jpgFile);  
        byte[] bytes = svg.toByteArray();  
        PNGTranscoder t = new PNGTranscoder();  
        TranscoderInput in = new TranscoderInput(new ByteArrayInputStream(bytes));  
        TranscoderOutput out = new TranscoderOutput(jpg);  
        t.transcode(in, out);  
        jpgOut.write(jpg.toByteArray());  
        jpgOut.flush();  
        jpgOut.close();  
        return jpgFile;  
    }  
}

四，重点难点解释探讨：
1）  读取表格部分：
    a)       找出表格的开始与结束标记；
    b)       遍历整个表格内容，逐个单元格的内容取出并追加到变量中。
2）  读取图片部分
    a)       图片文件的格式问题。
如果图片格式为png或者jpg，则可以直接进行处理并加入标签中，前台的html展示没有问题，但是，如果图片格式为wmf（详细看附录1），则html无法对基解释，那么我们只能对其进行转换格式：
百度后，网上很多说法都建议用batik工具包进行格式转换，其实思路就是：wmfàsvgàpng。查阅相关资料（如附录2），发现其处理svg文件的能力相当的强，即从svg—>png这一步是比较完美的。但是，在处理wmf—>svg这一步却导致部分图像丢失，即失真的情况，且很严重。查看相关的api看是否参数设置问题，但是无论怎么设置，结果还是不尽人意。一度想放弃，找别的包。
后来，无意中，在csdn中有网友建议先用wmf2svg工具类将wmf转换为svg，再用batik将svg转换为png。Very good!!有了这个思路，感觉已经看到署光了。
类写出来后，进行类型转换测试，确实效果很好，完全没有失真。于是将其嵌入word—>html这个工具类中。再用各种包含了wmf图片的文档进行测试。生成的html文件，基本没有问题，当时那个开心啊！！（我去，程序员也就这德行）
好景不长，放到正式项目进行测试过程中，发现有个别文档一进行转换，服务器就跨了，直接报内存溢出。通过排查检测，原来就是进行图片转换过程中，将内存给挤爆了。奇怪了，虽然知道图片处理是比较耗内存，但也没想到1G的内存，一下子就被挤爆（刚跑起来占去300M左右，一跑word转换功能，不过一会就报OutOfMemorry）。
一度怀疑，是不是batik这个工具包是不是有bug，处理不了大的svg。还将问题放上了bakit的官网。后来，查看相关资料后，发现是wmf2svg工具生成的svg的高与宽都太大了，举个例子：15040* 13088，宽高都达到上万级别，结果得到的象素是上亿的，不爆内存才怪。
用dom工具，将每一个生成的svg文件再进行预处理，即将其高与宽都先缩小一倍，如果宽度依然比500要大，则将其设成500，并将高也按比例缩小。经过此步骤生成的svg再用batik进行转换就没有任何问题了。
到这里，差不多已经解决图片转换的问题了，但是，在使用过程中，发现wmf2svg这个工具也不是很稳定，偶尔会报异常，并且，我测试发现，报异常的这个wmf用之前batik直接进行wmf—>svgàpng的方案可以成功生成没有失真的png，于是，在wmf2svg的产生异常进行捕捉，并调用了wmfToJpg（String wmfPath）的备用方法。到此，大部分的wmf转换问题已经解决。
    b)       生成html文本的<img />标签的width与height问题。
如果图片格式原本为png的话，直接用

// 获取图片样式
intpicHeight = pic.getHeight() * pic.getAspectRatioY() / 100;
intpicWidth = pic.getAspectRatioX() * pic.getWidth() / 100;

即可以将图片的宽与高设置与word文档一致；但是，发果wmf格式，要分两种情况分析：
Ø  如果转换生成的png宽度不小于500，则将期作为一般图片处理：

BufferedImage  image = ImageIO.read(file);
int pheight = image.getHeight();
int pwidth = image.getWidth();
Ø  如果转换生成的png宽度小于500，则认为是一般的公式，则应该与它旁边的字体宽度相近，这里设成字体的1.5倍宽度，高度为：

myHeight= (int) (pheight / (pwidth / (picSize * 1.0)) * 1.5);
如果图片即非wmf与非png（如jpg）的情况下，上面获取高与宽的方法不起作用，不知道是不是POI的bug。只能按以下方式处理：

BufferedImage  image = ImageIO.read(file);
int pheight = image.getHeight();
int pwidth = image.getWidth();
即跟上面处理wmf的第一种方式一致。
四、结束语
讲到这，将word转换成html的处理也大体上讲完了。这几天的边学边用，特别是真正能解决问题的时候，非常有成就感。其实，上面的处理还存在以下的问题待解决的：
1）读取表格部分：
a)       表格中如果再含有表格，POI无法进行很好的区分，比如，有一个两行两列的表格中，第一行第一列中又包含了一个两行两列的表格，那POI会将此表格解释成：第一行为2+2*2 = 6个单元格；第二行为2个单元格，这样解释出来的表格就很怪异了。
b)       表格中有果有合并单格的情况，程序暂未做此处理（后续看不能优化），表格也很怪异。
c)       表格中如果有图像，程序没有做相应的处理。
2）读取图片部分：
a) 有部分wmf->png的方式有个别图片还是没有转换成功，会报异常，但没有影响整体的功能；
b) word有部分公式生成的图片无法识别模式，不知道是不是POI无法将其解释，还是其他原因，就是有文档，生成没有后缀的图片文件，且这部分文件无法读取，用图片工具也打不开，暂时未找到很好的解决方案。
3）读取word的目录：
在读取目录会出现将格式化符号也解释出来。
4）其他未知的一些问题，反正，就觉得用POI来解释word是件很坚苦的事情，如果全是文本还好，如果里面包含图片，表格，公式等这些对象的时候，POI就显得太弱了。

附：
1.    wmf文件：
MicrosoftOffice 的剪贴画使用的就是这个格式。
Wmf是WindowsMetafile 的缩写，简称图元文件，它是微软公司定义的一种Windows平台下的图形文件格式。
wmf格式文件的特点如下：
1）                wmf格式文件是MicrosoftWindows操作平台所支持的一种图形格式文件，目前，其它操作系统尚不支持这种格式，如Unix、Linux等。
2）                与bmp格式不同，wmf格式文件是和设备无关的，即它的输出特性不依赖于具体的输出设备。
3）                其图象完全由Win32 API所拥有的GDI函数来完成。
4）                wmf格式文件所占的磁盘空间比其它任何格式的图形文件都要小得多。
5）                在建立图元文件时，不能实现即画即得，而是将GDI调用记录在图元文件中，之后，在GDI环境中重新执行，才可显示图象。
6）                显示图元文件的速度要比显示其它格式的图象文件慢，但是它形成图元文件的速度要远大于其它格式。
2.    Batik介绍
Batik是使用svg格式图片来实现各种功能的应用程序以及Applet提供的一个基于java的工具包。
通过Batik,你可以在JAVA可以使用的地方操作SVG文档，您还可以在你的应用程序使用Batik模块来生成，处理和转码SVG图像。Batik很容易让基于Java的应用程序或小程序来处理SVG内容。例如，使用Batik的SVG的发生器模块，Java应用程序或小程序可以很轻松地导出SVG格式的图形到。用Batik的SVG的查看组件，应用程序或小程序可以很容易地集成SVG的浏览和交互功能。另一种可能性是使用Batik的模块转换成各种格式SVG的通过，如光栅图像（JPEG，PNG或TIFF格式）或其它矢量格式（EPS或PDF格式，后两者由于转码器由Apache FOP提供）。 Batik工程创建的目的是为开发者提供一系列可以结合或单独使用来支持特殊的svg解决方案的核心模块。模块主要有SVGParser,SVGGernerator,SVGDOM。Batik工程的其他目的是使它具有高度的扩展性。
（SVG的规范：可缩放矢量图形（SVG），是一个W3C的推荐标准。它定义了丰富的2D图形的XML语法，其中包括诸如透明度功能，几何形状，滤镜效果（阴影，灯光效果等），脚本和动画）