poi 读大数据

2018-06-15 12:09:46 zl_momomo 阅读数 8388

 

前言

poi的读取的三种模式

模式 说明 读写性
SXSSF 内存中保留一定行数数据,超过行数,将索引最低的数据刷入硬盘 只写
eventmodel 基于事件驱动,SAX的方式解析excel,cup和内存消耗低 只读
usermodel 传统方式,cpu和内存消耗大 可读可写

 

依赖包(3.17版本)

<!-- apache poi 操作Microsoft Document -->
		<dependency>
		    <groupId>org.apache.poi</groupId>
		    <artifactId>poi</artifactId>
		    <version>3.17</version>
		</dependency>
	  	<!-- Apache POI - Java API To Access Microsoft Format Files -->
		<dependency>
		    <groupId>org.apache.poi</groupId>
		    <artifactId>poi-ooxml</artifactId>
		    <version>3.17</version>
		</dependency>
		<!-- Apache POI - Java API To Access Microsoft Format Files -->
	  	<dependency>
		    <groupId>org.apache.poi</groupId>
		    <artifactId>poi-ooxml-schemas</artifactId>
		    <version>3.17</version>
		</dependency>
	  	<!-- poi eventmodel方式 依赖包 -->
	  	<dependency>
		    <groupId>xerces</groupId>
		    <artifactId>xercesImpl</artifactId>
		    <version>2.11.0</version>
		</dependency>

 

一、SXSSF (Since POI 3.8 beta3)

 

说明

3.8-beta3以来,POI提供了一个低内存占用的SXSSF API,它构建在XSSF之上。

SXSSF是一个兼容于api的XSSF的流扩展,当需要生成非常大的电子表格时,它将被使用,而堆空间是有限的。SXSSF通过限制对滑动窗口中的行的访问来实现它的低内存占用,而XSSF允许访问文档中的所有行。不再出现在窗口中的较老的行变得不可访问,因为它们被写到磁盘上

在自动刷新模式中,可以指定存取窗口的大小,以便在内存中持有一定数量的行。当达到这个值时,额外一行的创建会导致从存取窗口删除最低索引的行,并将其写到磁盘上。或者,窗口大小可以被设置为动态增长;根据需要,可以通过显式调用flushRows(int keepRows)定期对其进行修剪。

由于实现的流特性,与XSSF相比有以下限制:

  • 只有有限数量的行可以在某个时间点访问。
  • 不支持Sheet.clone()。
  • 不支持公式评估

更多细节

下面的表格对POI的电子表格API的比较特性进行了比较:

示例-写数据到excel

下面的例子写了一张有百行窗口的表格。当行数达到101时,rownum=0的行会被刷新到磁盘,并从内存中删除,当rownum达到102时,          rownum=1的行被刷新

@Test
	public void test() throws Exception{
		Long start = System.currentTimeMillis();
		//内存最大存放100行数据 超过100自动刷新到硬盘中
		SXSSFWorkbook wb = new SXSSFWorkbook(100); // keep 100 rows in memory, exceeding rows will be flushed to disk
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 500000; rownum++){
            Row row = sh.createRow(rownum);//一行
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell = row.createCell(cellnum); //一行中一个方格
                String address = new CellReference(cell).formatAsString();
                cell.setCellValue(address);
            }

        }  
		
        FileOutputStream out = new FileOutputStream("f:/temp/sxssf.xlsx");
        wb.write(out);
        out.close();

        // dispose of temporary files backing this workbook on disk
        //处理在磁盘上支持本工作簿的临时文件
        wb.dispose();
        wb.close();
        Long end = System.currentTimeMillis();
        System.out.println(end - start + "ms"); //50万条数据写入大概在16秒
		
	}

 

下一个例子关闭了自动刷新(windows size=-1),代码手动控制将数据的部分写入磁盘

 

@Test
	public void test3() throws IOException{
		Long start = System.currentTimeMillis();
		SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 1000; rownum++){
            Row row = sh.createRow(rownum);
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell = row.createCell(cellnum);
                String address = new CellReference(cell).formatAsString();
                cell.setCellValue(address);
            }

           // manually control how rows are flushed to disk 
           if(rownum % 100 == 0) {
                ((SXSSFSheet)sh).flushRows(100); // retain 100 last rows and flush all others

                // ((SXSSFSheet)sh).flushRows() is a shortcut for ((SXSSFSheet)sh).flushRows(0),
                // this method flushes all rows
           }

        }

        FileOutputStream out = new FileOutputStream("f:/temp/sxssf.xlsx");
        wb.write(out);
        out.close();

        // dispose of temporary files backing this workbook on disk
        wb.dispose();
        Long end = System.currentTimeMillis();
        System.out.println(end - start + "ms"); //100条数据 650ms
	}

小结

其核心是减少存储在内存当中的数据,达到一定行数就存储到硬盘的临时文件中

 

二、XSSF and SAX (Event API)

说明

如果内存占用是一个问题,那么对于XSSF来说,您可以获得底层XML数据,并自己处理它。

要使用这个API,您可以构建一个org.apache.poi.xssf.eventmodel.xssfreader的实例。这将在共享字符串表和样式上提供一个不错的接口。它提供了从文件的其余部分获取原始xml数据的方法,然后您将把这些数据传递给SAX。

背景

 

Excel2003与Excel2007

两个版本的最大行数和列数不同,2003版最大行数是65536行,最大列数是256列,2007版及以后的版本最大行数是1048576行,最大列数是16384列。

excel2003是以二进制的方式存储,这种格式不易被其他软件读取使用;而excel2007采用了基于XML的ooxml开放文档标准,ooxml使用XML和ZIP技术结合进行文件存储,XML是一个基于文本的格式,而且ZIP容器支持内容的压缩,所以其一大优势是可以大大减小文件的尺寸。

把xlsx后缀改为zip,打开文件发现目录

打开xl

sharedStrings.xml  共享字符串

styles.xml  excel的样式数据

workbooks.xml excel的sheet

示例

package com.java.poi;

import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

import com.mysql.jdbc.util.LRUCache;

public class ExampleEventUserModel {
	public void processFirstSheet(String filename) throws Exception {
		try(OPCPackage pkg = OPCPackage.open(filename,PackageAccess.READ);){
			XSSFReader r = new XSSFReader( pkg );
			SharedStringsTable sst = r.getSharedStringsTable();
			
			XMLReader parser = fetchSheetParser(sst);
			//process the first sheet
			try(InputStream sheet = r.getSheetsData().next()){
				InputSource sheetSource = new InputSource(sheet);
				parser.parse(sheetSource);
			}
		}

	}
	
	
	public void processAllSheets(String filename) throws Exception {
		try (OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ)) {
            XSSFReader r = new XSSFReader(pkg);
            SharedStringsTable sst = r.getSharedStringsTable();

            XMLReader parser = fetchSheetParser(sst);

            Iterator<InputStream> sheets = r.getSheetsData();
            while (sheets.hasNext()) {
                System.out.println("Processing new sheet:\n");
                try (InputStream sheet = sheets.next()) {
					InputSource sheetSource = new InputSource(sheet);
					parser.parse(sheetSource);
				}
                System.out.println("");
            }
        }
	}
	
	public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
		XMLReader parser =SAXHelper.newXMLReader();
		ContentHandler handler = new SheetHandler(sst);
		parser.setContentHandler(handler);
		return parser;
	}
	
	
	
	/** 
	 * See org.xml.sax.helpers.DefaultHandler javadocs 重写 startElement characters endElements方法 
	 */
	private static class SheetHandler extends DefaultHandler {
		private SharedStringsTable sst;
		private String lastContents;
		private boolean nextIsString; //是否为string格式标识
	    private final LruCache<Integer,String> lruCache = new LruCache<>(60);
		/*private int sheetIndex = -1;
		private int curRow = 0;
		private int curCol = 0;
		private List<String> rowlist = new ArrayList<String>(); */  
		
	    /**
	     * 缓存
	     * @author Administrator
	     *
	     * @param <A>
	     * @param <B>
	     */
		private static class LruCache<A,B> extends LinkedHashMap<A, B> {
            private final int maxEntries;

            public LruCache(final int maxEntries) {
                super(maxEntries + 1, 1.0f, true);
                this.maxEntries = maxEntries;
            }

            @Override
            protected boolean removeEldestEntry(final Map.Entry<A, B> eldest) {
                return super.size() > maxEntries;
            }
        }
		
		private SheetHandler(SharedStringsTable sst) {
			this.sst = sst;
		}
		
		/**  
	     * 该方法自动被调用,每读一行调用一次,在方法中写自己的业务逻辑即可 
	     * @param sheetIndex 工作簿序号 
	     * @param curRow 处理到第几行 
	     * @param rowList 当前数据行的数据集合 
	     */  
	   /* public void optRow(int sheetIndex, int curRow, List<String> rowList) {   
	        String temp = "";   
	        for(String str : rowList) {   
	            temp += str + "_";   
	        } 
	        this.rowlist.clear();
	        this.curRow++;
	        this.curCol=0;
	        System.out.println(temp);   
	    } */
		
		@Override
		public void startElement(String uri, String localName, String name,
				Attributes attributes) throws SAXException {
			// c => cell 代表单元格
			if(name.equals("c")) {
				// Print the cell reference
				//获取单元格的位置,如A1,B1
				System.out.print(attributes.getValue("r") + " - "); 
				// Figure out if the value is an index in the SST 如果下一个元素是 SST 的索引,则将nextIsString标记为true
				//单元格类型
				String cellType = attributes.getValue("t"); 
				//cellType值 s:字符串 b:布尔 e:错误处理
				if(cellType != null && cellType.equals("s")) { 
					//标识为true 交给后续endElement处理
					nextIsString = true;
				} else {
					nextIsString = false;
				}
			}
			// Clear contents cache
			lastContents = "";
		}
		
		/**
		 * 得到单元格对应的索引值或是内容值
		 * 如果单元格类型是字符串、INLINESTR、数字、日期,lastIndex则是索引值
		 * 如果单元格类型是布尔值、错误、公式,lastIndex则是内容值
		 */
		@Override
		public void characters(char[] ch, int start, int length)
				throws SAXException {
			lastContents += new String(ch, start, length);
		}
		
		@Override
		public void endElement(String uri, String localName, String name)
				throws SAXException {
			// Process the last contents as required.
			// Do now, as characters() may be called more than once
			if(nextIsString) {
				int idx = Integer.parseInt(lastContents);
				lastContents = lruCache.get(idx);
				//如果内容为空 或者Cache中存在相同key 不保存到Cache中
				if(lastContents == null &&!lruCache.containsKey(idx)){
					lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
					lruCache.put(idx, lastContents);
				}
				nextIsString = false;
			}

			// v => contents of a cell
			// Output after we've seen the string contents
			if(name.equals("v")) {
				System.out.println(lastContents);
				//rowlist.add(curCol++,lastContents);
			}else{
				//如果标签名称为 row , 已到行尾
				if(name.equals("row")){
					//optRow(sheetIndex, curRow, rowlist);
					System.out.println(lruCache);
					lruCache.clear();
				}
			}
		}

		
	}
	
	public static void main(String[] args) throws Exception {
		new ExampleEventUserModel().processFirstSheet("F:/temp/template.xlsx");
	}
	
}

SheetHandler类说明:程序依次调用重写的startElement,characters,endElement方法

 

小结

excel2007后采用了基于XML的ooxml开放文档标准,通过操作原始xml数据的方法获得数据。

 

三、User API (HSSF and XSSF)

 

 

New Workbook 

Workbook wb = new HSSFWorkbook();
    ...
    try  (OutputStream fileOut = new FileOutputStream("workbook.xls")) {
        wb.write(fileOut);
    }

    Workbook wb = new XSSFWorkbook();
    ...
    try (OutputStream fileOut = new FileOutputStream("workbook.xlsx")) {
        wb.write(fileOut);
    }

 

Files vs InputStreams

在打开一本工作簿时,不管是一个.xls HSSFWorkbook,还是一个.xlsx XSSFWorkbook,工作簿都可以从文件或InputStream中加载。使用File对象可以降低内存消耗,而InputStream则需要更多的内存,因为它必须缓冲整个文件。

如果使用WorkbookFactory,它很容易使用其中一个或另一个:

// Use a file
  Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));

  // Use an InputStream, needs more memory
  Workbook wb = WorkbookFactory.create(new FileInputStream("MyExcel.xlsx"));

 

如果直接使用HSSFWorkbook或XSSFWorkbook,您通常应该通过 NPOIFSFileSystemOPCPackage的使用来完全控制生命周期(包括在完成时关闭文件):

 

// HSSFWorkbook, File
  NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("file.xls"));
  HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
  ....
  fs.close();

  // HSSFWorkbook, InputStream, needs more memory
  NPOIFSFileSystem fs = new NPOIFSFileSystem(myInputStream);
  HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);

  // XSSFWorkbook, File
  OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
  XSSFWorkbook wb = new XSSFWorkbook(pkg);
  ....
  pkg.close();

  // XSSFWorkbook, InputStream, needs more memory
  OPCPackage pkg = OPCPackage.open(myInputStream);
  XSSFWorkbook wb = new XSSFWorkbook(pkg);
  ....
  pkg.close();

 行列操作

 Workbook wb = WorkbookFactory.create(inputStream);
 Sheet sheet = wb.getSheetAt(0);
 //行数
 int lastRowNum = sheet.getPhysicalNumberOfRows();
 //读取body
 for(int rowNum = 1;rowNum <lastRowNum;rowNum++){
     //获得当前行
     Row row = sheet.getRow(rowNum); //index序列号从0开始,跳过首行,取1
     //当前列数
     int lastCellNum = row.getPhysicalNumberOfCells();
     for(int cellNum=0;cellNum<lastCellNum;cellNum++){
         //当前行的某列,确定小方格
         Cell cell = row.getCell(cellNum);
         String str = cell.getStringCellValue();//获取字符值,当然提供不同的接口获得不同类型值
         // TODO
     }
 wb.close();
 }

 

小结:

数据量比较小使用NPOIFSFileSystem或OPCPackage来操作excel,并尽可能使用文件对象参数

 

参考 apache poi官网文档 (随版本更新,api可能有变动)

 

 

 

 

 

 

 

 

 

2016-01-12 16:11:33 huoyizu 阅读数 953
package org.hzjun.hlt.excel;


import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;


import javax.xml.parsers.ParserConfigurationException;


import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;


/**
 * 针对poi加载xlsx大数据量的内存溢出,用此方法加载数据保证不溢出
 */
public class XlsxProcessHandler {
/**
* Uses the XSSF Event SAX helpers to do most of the work of parsing the
* Sheet XML, and outputs the contents as a (basic) CSV.
*/
private class SheetToCSV implements SheetContentsHandler {
private boolean firstCellOfRow = false;
private int currentRow = -1;
private int currentCol = -1;
private List<String> rowList = null;


private void outputMissingRows(int number) {
// System.out.println( "number=" + number );
for (int i = 0; i < number; i++) {
for (int j = 0; j < minColumns; j++) {
// output.append( ',' );
}
// output.append( '\n' );
}
}


public void startRow(int rowNum) {
// If there were gaps, output the missing rows
outputMissingRows( rowNum - currentRow - 1 );
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
}


public void endRow(int rowNum) {
// Ensure the minimum number of columns
for (int i = currentCol; i < minColumns; i++) {
// output.append( ',' );
}
// output.append( '\n' );
XlsxProcessHandler.this.processRow( rowList );
}


public void cell(String cellReference, String formattedValue, XSSFComment comment) {
// System.out.println( comment );
if (firstCellOfRow) {
firstCellOfRow = false;
rowList = new ArrayList<String>();
} else {
// output.append( ',' );
}


// Did we miss any cells?
int thisCol = (new CellReference( cellReference )).getCol();
int missedCols = thisCol - currentCol - 1;
for (int i = 0; i < missedCols; i++) {
// output.append( ',' );
}
currentCol = thisCol;


// Number or string?
try {
rowList.add( formattedValue );
// Double.parseDouble( formattedValue );
// output.append( formattedValue );
} catch (NumberFormatException e) {
// output.append( '"' );
// output.append( formattedValue );
// output.append( '"' );
}
}


public void headerFooter(String text, boolean isHeader, String tagName) {
// Skip, no headers or footers in CSV
}
}


// /////////////////////////////////////


private OPCPackage xlsxPackage;


/**
* Number of columns to read starting with leftmost
*/
private int minColumns;


/**
* Destination for data
*/
// private PrintStream output;
/**
* Creates a new XLSX -> CSV converter

* @param pkg
*            The XLSX package to process
* @param output
*            The PrintStream to output the CSV to
* @param minColumns
*            The minimum number of columns to output, or -1 for no minimum
*/
// public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
// }
/**
* Parses and shows the content of one sheet using the specified styles and
* shared-strings tables.

* @param styles
* @param strings
* @param sheetInputStream
*/
public void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings, SheetContentsHandler sheetHandler, InputStream sheetInputStream)
throws IOException, ParserConfigurationException, SAXException {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource( sheetInputStream );
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler( styles, null, strings, sheetHandler, formatter, false );
sheetParser.setContentHandler( handler );
sheetParser.parse( sheetSource );
} catch (ParserConfigurationException e) {
throw new RuntimeException( "SAX parser appears to be broken - " + e.getMessage() );
}
}


/**
* Initiates the processing of the XLS workbook file to CSV.

* @throws IOException
* @throws OpenXML4JException
* @throws ParserConfigurationException
* @throws SAXException
*/
public void process() throws IOException, OpenXML4JException, ParserConfigurationException, SAXException {
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable( this.xlsxPackage );
XSSFReader xssfReader = new XSSFReader( this.xlsxPackage );
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
// 暂时只处理一个sheet
if (index > 0)
continue;
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
// this.output.println();
// this.output.println( sheetName + " [index=" + index + "]:" );
processSheet( styles, strings, new SheetToCSV(), stream );
stream.close();
++index;
}
}


public void processExcel(String filePath) {
// "f:/workbook1452231301852.xlsx"
File xlsxFile = new File( filePath );


try {
OPCPackage p = OPCPackage.open( xlsxFile.getPath(), PackageAccess.READ );
// XLSX2CSV xlsx2csv = new XLSX2CSV( p, System.out, minColumns );
this.xlsxPackage = p;
// this.output = output;
// this.minColumns = minColumns;
process();
p.close();
} catch (Exception e) {
e.printStackTrace();
}


}


public interface RowHandler {
public void processRow(List<String> rowList);
}


public void processRow(List<String> rowList) {


handler.processRow( rowList );
}


private RowHandler handler;


public XlsxProcessHandler(RowHandler handler) {
this.handler = handler;
}


public static void main(String[] args) throws Exception {


XlsxProcessHandler a = new XlsxProcessHandler( new RowHandler() {
@Override
public void processRow(List<String> row) {
System.out.println( row );
}


} );
a.processExcel( "f:/workbook1452231301852.xlsx" );


}
}
2018-05-07 17:38:19 qq_37511875 阅读数 2959

另一篇文章http://www.cnblogs.com/tootwo2/p/8120053.html里面有xml的一些解释。

大数据量的excel一般都是.xlsx格式的,网上使用POI读写的例子比较多,但是很少提到读写非常大数据量的excel的例子,POI官网上提到XSSF有三种读写excel,POI地址:http://poi.apache.org/spreadsheet/index.html。官网的图片:

可以看到有三种模式:

1、eventmodel方式,基于事件驱动,SAX的方式解析excel(.xlsx是基于OOXML的),CPU和内存消耗非常低,但是只能读不能写

2、usermodel,就是我们一般使用的方式,这种方式可以读可以写,但是CPU和内存消耗非常大

3、SXSSF,POI3.8以后开始支持,这种方式只能写excel

下面介绍下使用方式(官网地址:http://poi.apache.org/spreadsheet/how-to.html):

第一种方式:

pom文件需要添加依赖:

复制代码
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xerces</artifactId>
    <version>2.4.0</version>
</dependency>
复制代码

java官网示例代码:

复制代码
package excel;

import java.io.InputStream;
import java.util.Iterator;

import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class ExampleEventUserModel {
    
    
    public void processOneSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        // To look up the Sheet Name / Sheet Order / rID,
        //  you need to process the core Workbook stream.
        // Normally it's of the form rId# or rSheet#
        InputStream sheet2 = r.getSheet("rId2");
        InputSource sheetSource = new InputSource(sheet2);
        parser.parse(sheetSource);
        sheet2.close();
    }

    public void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();
        
        XMLReader parser = fetchSheetParser(sst);

        Iterator<InputStream> sheets = r.getSheetsData();
        while(sheets.hasNext()) {
            System.out.println("Processing new sheet:\n");
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
            System.out.println("");
        }
    }

    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
        XMLReader parser =
            XMLReaderFactory.createXMLReader(
                    "com.sun.org.apache.xerces.internal.parsers.SAXParser"
            );
        ContentHandler handler = new SheetHandler(sst);
        parser.setContentHandler(handler);
        return parser;
    }

    /** 
     * See org.xml.sax.helpers.DefaultHandler javadocs 
     */
    private static class SheetHandler extends DefaultHandler {
        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;
        
        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }
        
        public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
            // c => cell
            if(name.equals("c")) {
                // Print the cell reference
                System.out.print(attributes.getValue("r") + " - ");
                // Figure out if the value is an index in the SST
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // Clear contents cache
            lastContents = "";
        }
        
        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                System.out.println(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            lastContents += new String(ch, start, length);
        }
    }
    
    public static void main(String[] args) throws Exception {
        ExampleEventUserModel example = new ExampleEventUserModel();
        System.out.println("11");
        example.processOneSheet(args[0]);
        example.processAllSheets(args[0]);
    }
}
复制代码

运行的时候使用本地的文件地址替代main函数里面的参数就可以运行(亲测可以)。

第三种方式:

其核心是减少存储在内存当中的数据,达到一定行数就存储到硬盘的临时文件中。

pom文件需要增加依赖:

<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
</dependency>

java代码如下:

复制代码
package excel;


//import junit.framework.Assert;
import java.io.FileOutputStream;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

public class SXSSFDemo {
    public static void main(String[] args) throws Throwable {
        
        SXSSFWorkbook wb = new SXSSFWorkbook(100); // 在内存当中保持 100 行 , 超过的数据放到硬盘中
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 10000; rownum++){
            Row row = sh.createRow(rownum);
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell = row.createCell(cellnum);
                String address = new CellReference(cell).formatAsString();
                cell.setCellValue(address);
            }

        }     
        
        FileOutputStream out = new FileOutputStream("/Users/tootwo2/Documents/sxssf.xlsx");
        wb.write(out);
        out.close();

        // dispose of temporary files backing this workbook on disk
        wb.dispose();
    }

}
转载:

另一篇文章http://www.cnblogs.com/tootwo2/p/8120053.html里面有xml的一些解释。

大数据量的excel一般都是.xlsx格式的,网上使用POI读写的例子比较多,但是很少提到读写非常大数据量的excel的例子,POI官网上提到XSSF有三种读写excel,POI地址:http://poi.apache.org/spreadsheet/index.html。官网的图片:

可以看到有三种模式:

1、eventmodel方式,基于事件驱动,SAX的方式解析excel(.xlsx是基于OOXML的),CPU和内存消耗非常低,但是只能读不能写

2、usermodel,就是我们一般使用的方式,这种方式可以读可以写,但是CPU和内存消耗非常大

3、SXSSF,POI3.8以后开始支持,这种方式只能写excel

下面介绍下使用方式(官网地址:http://poi.apache.org/spreadsheet/how-to.html):

第一种方式:

pom文件需要添加依赖:

复制代码
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xerces</artifactId>
    <version>2.4.0</version>
</dependency>
复制代码

java官网示例代码:

复制代码
package excel;

import java.io.InputStream;
import java.util.Iterator;

import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class ExampleEventUserModel {
    
    
    public void processOneSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        // To look up the Sheet Name / Sheet Order / rID,
        //  you need to process the core Workbook stream.
        // Normally it's of the form rId# or rSheet#
        InputStream sheet2 = r.getSheet("rId2");
        InputSource sheetSource = new InputSource(sheet2);
        parser.parse(sheetSource);
        sheet2.close();
    }

    public void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();
        
        XMLReader parser = fetchSheetParser(sst);

        Iterator<InputStream> sheets = r.getSheetsData();
        while(sheets.hasNext()) {
            System.out.println("Processing new sheet:\n");
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
            System.out.println("");
        }
    }

    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
        XMLReader parser =
            XMLReaderFactory.createXMLReader(
                    "com.sun.org.apache.xerces.internal.parsers.SAXParser"
            );
        ContentHandler handler = new SheetHandler(sst);
        parser.setContentHandler(handler);
        return parser;
    }

    /** 
     * See org.xml.sax.helpers.DefaultHandler javadocs 
     */
    private static class SheetHandler extends DefaultHandler {
        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;
        
        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }
        
        public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
            // c => cell
            if(name.equals("c")) {
                // Print the cell reference
                System.out.print(attributes.getValue("r") + " - ");
                // Figure out if the value is an index in the SST
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // Clear contents cache
            lastContents = "";
        }
        
        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                System.out.println(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            lastContents += new String(ch, start, length);
        }
    }
    
    public static void main(String[] args) throws Exception {
        ExampleEventUserModel example = new ExampleEventUserModel();
        System.out.println("11");
        example.processOneSheet(args[0]);
        example.processAllSheets(args[0]);
    }
}
复制代码

运行的时候使用本地的文件地址替代main函数里面的参数就可以运行(亲测可以)。

第三种方式:

其核心是减少存储在内存当中的数据,达到一定行数就存储到硬盘的临时文件中。

pom文件需要增加依赖:

<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
</dependency>

java代码如下:

复制代码
package excel;


//import junit.framework.Assert;
import java.io.FileOutputStream;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

public class SXSSFDemo {
    public static void main(String[] args) throws Throwable {
        
        SXSSFWorkbook wb = new SXSSFWorkbook(100); // 在内存当中保持 100 行 , 超过的数据放到硬盘中
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 10000; rownum++){
            Row row = sh.createRow(rownum);
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell = row.createCell(cellnum);
                String address = new CellReference(cell).formatAsString();
                cell.setCellValue(address);
            }

        }     
        
        FileOutputStream out = new FileOutputStream("/Users/tootwo2/Documents/sxssf.xlsx");
        wb.write(out);
        out.close();

        // dispose of temporary files backing this workbook on disk
        wb.dispose();
    }

}
转载:https://www.cnblogs.com/tootwo2/p/6683143.html
2017-07-06 10:46:01 u012809273 阅读数 5892
package com.jk.controller.user;

import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.ss.formula.functions.T;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.tomcat.jni.User;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

import com.jk.model.user.UserBean;
import com.jk.service.user.UserService;

public class ReadExcel extends DefaultHandler {
	private SharedStringsTable sst;
	private String lastContents;
	private boolean nextIsString;
	private int sheetIndex = -1;
	private List<String> rowlist = new ArrayList<String>();
	private int curRow = 0;
	private int curCol = 0;
	public List<UserBean> dataList = new ArrayList<UserBean>();
	private UserService userService;
	public ReadExcel(UserService userService) {
		super();
		this.userService = userService;
	}
    /**  
     * 读取所有工作簿的入口方法  
     * @param path  
     * @throws Exception 
     */  
    public void process(InputStream inputStream) throws Exception {   
        OPCPackage pkg = OPCPackage.open(inputStream);
        XSSFReader r = new XSSFReader(pkg);   
        SharedStringsTable sst = r.getSharedStringsTable();   
        XMLReader parser = fetchSheetParser(sst);   
        Iterator<InputStream> sheets = r.getSheetsData();   
        while (sheets.hasNext()) {   
            curRow = 0;   
            sheetIndex++;   
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);   
            parser.parse(sheetSource);   
            sheet.close();   
        }   
    }   
    /**  
     * 该方法自动被调用,每读一行调用一次,在方法中写自己的业务逻辑即可 
     * @param sheetIndex 工作簿序号 
     * @param curRow 处理到第几行 
     * @param rowList 当前数据行的数据集合 
     */  
    public void optRow(int sheetIndex, int curRow, List<String> rowList) {   
    	if (curRow >=1) {
    		if (dataList.size() >= 500 || rowlist.size() == 0) {
    			userService.addUserBatch(dataList);
    			dataList.clear();
    		}
    		if (rowlist.size() > 0) {
    			UserBean userBean = new UserBean();
        		userBean.setName(rowList.get(0));
        		userBean.setSex(rowList.get(1).trim().equals("男") ? 1:0);
        		userBean.setRegisTime(rowList.get(2));
        		userBean.setLoginNumer(rowList.get(3));
        		userBean.setPassword(rowList.get(4));
        		userBean.setHeadImg(rowList.get(5));
        		String status = rowList.get(6);
        		if (status.trim().equals("正常")) {
        			userBean.setStatus(1);
        		}else if(status.trim().equals("冻结")){
        			userBean.setStatus(2);
        		}else if(status.trim().equals("账户异常")){
        			userBean.setStatus(3);
        		}else if(status.trim().equals("锁定")){
        			userBean.setStatus(4);
        		}
        		userBean.setRemark(rowList.get(7));
        		dataList.add(userBean);
			}
		}
    }   
       
    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {   
        XMLReader parser = XMLReaderFactory   
                .createXMLReader("org.apache.xerces.parsers.SAXParser");   
        this.sst = sst;   
        parser.setContentHandler(this);   
        return parser;   
    }   
       
    public void startElement(String uri, String localName, String name,   
            Attributes attributes) throws SAXException {   
        // c => 单元格  
        if (name.equals("c")) {   
            // 如果下一个元素是 SST 的索引,则将nextIsString标记为true  
            String cellType = attributes.getValue("t");   
            if (cellType != null && cellType.equals("s")) {   
                nextIsString = true;   
            } else {   
                nextIsString = false;   
            }   
        }   
        // 置空   
        lastContents = "";   
    }   
       
    public void endElement(String uri, String localName, String name)   
            throws SAXException { 
    	if (localName.equals("worksheet")) {
    		rowlist.clear();
    		optRow(sheetIndex, 1, rowlist);
    		curRow = 0;
		}
        // 根据SST的索引值的到单元格的真正要存储的字符串  
        // 这时characters()方法可能会被调用多次  
        if (nextIsString) {   
            try {   
                int idx = Integer.parseInt(lastContents);   
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx))   
                        .toString();   
            } catch (Exception e) {   
  
            }   
        }   
        // v => 单元格的值,如果单元格是字符串则v标签的值为该字符串在SST中的索引  
        // 将单元格内容加入rowlist中,在这之前先去掉字符串前后的空白符  
        if (name.equals("v")) {   
            String value = lastContents.trim();   
            value = value.equals("") ? " " : value;   
            rowlist.add(curCol, value);   
            curCol++;   
        } else {   
            // 如果标签名称为 row ,这说明已到行尾,调用 optRows() 方法  
            if (name.equals("row")) {   
                optRow(sheetIndex, curRow, rowlist);   
                rowlist.clear();   
                curRow++;   
                curCol = 0;   
            }   
        }   
    }   
    public void characters(char[] ch, int start, int length)   
            throws SAXException {   
        // 得到单元格内容的值  
        lastContents += new String(ch, start, length);   
    }
}

2018-05-22 00:33:37 sinat_30314715 阅读数 15043

java代码使用poi的API解决在读取大数据量的Excel数据时候内存溢出的问题:首先我需要声明下面的工具类是在老袁博客(https://laoyuan.me/posts/java-read-big-excel-with-poi.html)基础上做了稍微的改造,我将老袁的的工具类需要2个参数改成只需要一个参数就可以完成调用,当然你可以根据你自己的情况使用。
下面是一个工具类,复制到自己的项目中直接调用即可:

1、工具类

package com.xxx.xxx.xxx;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
/**
 * 解析大数据量Excel工具类
 * @author RobinTime
 *
 */
@Component
public class ExcelParser {
    private static final Logger logger = LoggerFactory.getLogger(ExcelParser.class);
    /**
     * 表格默认处理器
     */
    private ISheetContentHandler contentHandler = new DefaultSheetHandler();
    /**
     * 读取数据
     */
    private List<String[]> datas = new ArrayList<String[]>();

    /**
     * 转换表格,默认为转换第一个表格
     * @param stream
     * @return
     * @throws InvalidFormatException
     * @throws IOException
     * @throws ParseException
     */
    public ExcelParser parse(InputStream stream)
            throws InvalidFormatException, IOException, ParseException {
        return parse(stream, 1);
    }


    /**
     * 
     * @param stream
     * @param sheetId:为要遍历的sheet索引,从1开始
     * @return
     * @throws InvalidFormatException
     * @throws IOException
     * @throws ParseException
     */
    public synchronized ExcelParser parse(InputStream stream, int sheetId)
            throws InvalidFormatException, IOException, ParseException {
        // 每次转换前都清空数据
        datas.clear();
        // 打开表格文件输入流
        OPCPackage pkg = OPCPackage.open(stream);
        try {
            // 创建表阅读器
            XSSFReader reader;
            try {
                reader = new XSSFReader(pkg);
            } catch (OpenXML4JException e) {
                logger.error("读取表格出错");
                throw new ParseException(e.fillInStackTrace());
            }

            // 转换指定单元表
            InputStream shellStream = reader.getSheet("rId" + sheetId);
            try {
                InputSource sheetSource = new InputSource(shellStream);
                StylesTable styles = reader.getStylesTable();
                ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
                getContentHandler().init(datas);// 设置读取出的数据
                // 获取转换器
                XMLReader parser = getSheetParser(styles, strings);
                parser.parse(sheetSource);
            } catch (SAXException e) {
                logger.error("读取表格出错");
                throw new ParseException(e.fillInStackTrace());
            } finally {
                shellStream.close();
            }
        } finally {
            pkg.close();

        }
        return this;

    }

    /**
     * 获取表格读取数据,获取数据前,需要先转换数据<br>
     * 此方法不会获取第一行数据
     * 
     * @return 表格读取数据
     */
    public List<String[]> getDatas() {
        return getDatas(true);

    }

    /**
     * 获取表格读取数据,获取数据前,需要先转换数据
     * 
     * @param dropFirstRow
     *            删除第一行表头记录
     * @return 表格读取数据
     */
    public List<String[]> getDatas(boolean dropFirstRow) {
        if (dropFirstRow && datas.size() > 0) {
            datas.remove(0);// 删除表头
        }
        return datas;

    }

    /**
     * 获取读取表格的转换器
     * 
     * @return 读取表格的转换器
     * @throws SAXException
     *             SAX错误
     */
    protected XMLReader getSheetParser(StylesTable styles, ReadOnlySharedStringsTable strings) throws SAXException {
        XMLReader parser = XMLReaderFactory.createXMLReader();
        parser.setContentHandler(new XSSFSheetXMLHandler(styles, strings, getContentHandler(), false));
        return parser;
    }

    public ISheetContentHandler getContentHandler() {
        return contentHandler;
    }

    public void setContentHandler(ISheetContentHandler contentHandler) {
        this.contentHandler = contentHandler;
    }

    /**
     * 表格转换错误
     */
    public class ParseException extends Exception {
        private static final long serialVersionUID = -2451526411018517607L;

        public ParseException(Throwable t) {
            super("表格转换错误", t);
        }

    }

    public interface ISheetContentHandler extends SheetContentsHandler {

        /**
         * 设置转换后的数据集,用于存放转换结果
         * 
         * @param datas
         *            转换结果
         */
        void init(List<String[]> datas);
    }

    /**
     * 默认表格解析handder
     */
    class DefaultSheetHandler implements ISheetContentHandler {
        /**
         * 读取数据
         */
        private List<String[]> datas;
        private int columsLength;
        // 读取行信息
        private String[] readRow;
        private ArrayList<String> fristRow = new ArrayList<String>();

        @Override
        public void init(List<String[]> datas) {
            this.datas = datas;
//          this.columsLength = columsLength;
        }

        @Override
        public void startRow(int rowNum) {
            if (rowNum != 0) {
                readRow = new String[columsLength];
            }
        }

        @Override
        public void endRow(int rowNum) {
        //将Excel第一行表头的列数当做数组的长度,要保证后续的行的列数不能超过这个长度,这是个约定。
            if (rowNum == 0) {
                columsLength = fristRow.size();
                readRow = fristRow.toArray(new String[fristRow.size()]);
            }else {
                readRow = fristRow.toArray(new String[columsLength]);
            }
            datas.add(readRow.clone());
            readRow = null;
            fristRow.clear();
        }

        @Override
        public void cell(String cellReference, String formattedValue, XSSFComment comment) {
            int index = getCellIndex(cellReference);//转换A1,B1,C1等表格位置为真实索引位置
            try {
                fristRow.set(index, formattedValue);
            } catch (IndexOutOfBoundsException e) {
                int size = fristRow.size();
                for (int i = index - size+1;i>0;i--){
                    fristRow.add(null);
                }
                fristRow.set(index,formattedValue);
            }
        }

        @Override
        public void headerFooter(String text, boolean isHeader, String tagName) {
        }

        /**
         * 转换表格引用为列编号
         * 
         * @param cellReference
         *            列引用
         * @return 表格列位置,从0开始算
         */
        public int getCellIndex(String cellReference) {
            String ref = cellReference.replaceAll("\\d+", "");
            int num = 0;
            int result = 0;
            for (int i = 0; i < ref.length(); i++) {
                char ch = cellReference.charAt(ref.length() - i - 1);
                num = (int) (ch - 'A' + 1);
                num *= Math.pow(26, i);
                result += num;
            }
            return result - 1;
        }
    }
}

2、调用

File tempFile = new File(this.getClass().getClassLoader().getResource("").getPath() + "tempFile\\" + (new Date()).getTime() + ".xlsx");
//传入一个路径产生流再将流传入工具类,返回解析对象,Excel的所有数据就被解析到List<String[]> 里面,遍历list任由你处置。
FileInputStream inputStream = new FileInputStream(tempFile);
ExcelParser parse = excelParser.parse(inputStream);
List<String[]> datas = parse.getDatas();