精华内容
下载资源
问答
  • unicode 分为utf-32 (占4个字节),utf-16(占两个字节),utf-8(占1-4个字节),所以utf-16 是最常用的unicode版本,但是在文件里存的还是utf-8,因为utf8省空间 在python 3,encode编码的同时会把stringl变成bytes类型,...
  • python3读取utf-8gbk文件、编码转换、测试 执行环境: cmd运行环境编码为936也就是gbk pycharm运行环境为utf-8(多字节编码) Windows 下的cmd命令行中设置环境编码: 在命令行中,输入chcp 显示当前活动代码页编号...

    执行环境:

    cmd运行环境编码为936也就是gbk

    pycharm运行环境为utf-8(多字节编码)

    Windows 下的cmd命令行中设置环境编码:

    在命令行中,输入chcp 显示当前活动代码页编号,也可以理解为当前环境的编码,

    可以看出当前ANSI的编码环境的代号是936-gbk,

    utf-8编码的65001,那我们执行:chcp 65001就设置成utf-8编码的了,

    同样执行 chcp 936 设置成gbk编码

    python3中处理字符默认用的编码是utf-8,

    1、测试utf-8文本文件

    在桌面创建一个文本文件编码为utf-8里面写几个汉字

    在这里插入图片描述
    在pycharm以二进制读取后解码并打印

    # 文件读取
    with open('../test.txt','rb') as f:
        data = f.read()
        print(data)
        print(data.decode('utf-8')) #解码
    

    运行代码:
    在这里插入图片描述
    cmd 运行代码:
    在这里插入图片描述
    如果把上面decode(‘utf-8’)改成decode(‘gbk’)则全部都显示错
    UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xaf in position 2: illegal multibyte sequence

    2、测试GBK文本文件同上

    test.txt文本内容为 gbk 编码方式

    只需改一下解码代码:

    print(data.decode('gbk')) #解码

    剩下代码一样按上边步骤测试成功

    如果把上面decode(‘gbk’)改成decode(‘utf-8’)则全部都显示错

    UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xbf in position 0: invalid start byte

    3、对utf-8文本进行编码转换再输出

    # 文件读取
    with open('../test.txt','rb') as f:
        data = f.read()
        print("utf-8 bytes:",data)
        data1 =data.decode('utf-8')
        print("utf-8 str:",data1)
        s = data1.encode('gb18030')
        print("gbk bytes:",s)
        s1 = s.decode('gb18030')
        print("gbk str:",s1)
    

    pycharm输出结果:
    在这里插入图片描述

    4、测试从网络读取数据显示

    抓取百度的utf-8网页

    # 网络读取
    import urllib.request
    url="http://www.baidu.com/"
    # 获取请求
    request = urllib.request.Request(url)
    # 爬取结果
    response = urllib.request.urlopen(request)
    s = response.read()
    print(type(s))
    s= s.decode('utf-8')
    print(s)
    

    pycharm 和cmd 都正常运行!!!

    展开全文
  • Python utf-8gbk转换

    2019-09-24 21:19:07
    文件声明为utf-8编码保存的源文件,但是中文windows的本地默认编码是cp936(gbk编码),Windows中直接打印utf-8的字符串是乱码。 解决方法: 在控制台打印的地方用一个转码就可以 print str.decode('UTF-8').encode...

    文件声明为utf-8编码保存的源文件,但是中文windows的本地默认编码是cp936(gbk编码),Windows中直接打印utf-8的字符串是乱码。
    解决方法:

    在控制台打印的地方用一个转码就可以
    print str.decode('UTF-8').encode('GBK')

    1:  #coding:utf-8
    2:  '''
    3:  Created on 2012-11-1
    4:  
    5:  @author: Administrator
    6:  '''
    7:  ff=raw_input('输入:')
    8:  print ff.decode('utf-8').encode('gbk')

    转载于:https://www.cnblogs.com/sunny5156/archive/2012/11/01/2749669.html

    展开全文
  • 通过python实现对文件转码,其实处理很简单: 1.打开读取文件内容到一个字符串变量中,把gbk编码文件,对字符串进行decode转换成unicode 2.然后使用encode转换成utf-8格式。 3.最后把字符串重新写入到文件中...

    通过python实现对文件转码,其实处理很简单:

    1.打开读取文件内容到一个字符串变量中,把gbk编码文件,对字符串进行decode转换成unicode
    2.然后使用encode转换成utf-8格式。
    3.最后把字符串重新写入到文件中即可。
     
    在对文件进行转码之前,需要先对文件的编码格式进行校验,如果已经是utf-8格式的文件,不做decode转码处理,否则会报错。
    因此这里使用chardet包进行返回文件的编码格式。
    使用 pip install chardet 安装即可引入使用。
     
    脚本如下:
    convergbk2utf.py
    # -*- coding:utf-8 -*-
    __author__ = 'tsbc'
    
    import os,sys
    import chardet
    
    def convert( filename, in_enc = "GBK", out_enc="UTF8" ):
        try:
            print "convert " + filename,
            content = open(filename).read()
            result = chardet.detect(content)#通过chardet.detect获取当前文件的编码格式串,返回类型为字典类型
            coding = result.get('encoding')#获取encoding的值[编码格式]
            if coding != 'utf-8':#文件格式如果不是utf-8的时候,才进行转码
                print coding + "to utf-8!",
                new_content = content.decode(in_enc).encode(out_enc)
                open(filename, 'w').write(new_content)
                print " done"
            else:
                print coding
        except IOError,e:
        # except:
            print " error"
    
    
    def explore(dir):
        for root, dirs, files in os.walk(dir):
            for file in files:
                path = os.path.join(root, file)
                convert(path)
    
    def main():
        for path in sys.argv[1:]:
            if os.path.isfile(path):
                convert(path)
            elif os.path.isdir(path):
                explore(path)
    
    if __name__ == "__main__":
        main()
    

      

    执行
    python convergbk2utf.py d:\test
    可以讲d:\test目录中的所有文件,转码成utf8.
     
    PS:想要做的容错性更高一下的话,可以对要转码的文件类型再加个判断进行过滤,对filename通过分析,只转换你想要转换的文件类型即可。





    转载于:https://www.cnblogs.com/tsbc/p/4450675.html

    展开全文
  • Python文件编码---gbk?OR utf8?

    千次阅读 2013-04-02 22:15:30
    Python文件编码---gbk?OR utf8? windows文件名的编码是cp936...比如你python文件编码是utf8 # -*- coding: utf-8 -*- he='开心.mp3' f=open(he.decode('utf-8').encode('cp936'),'w') f.close() ------------

    Python文件编码---gbk?OR utf8?

    windows文件名的编码是cp936的,你在使用中文文件名的时候转下码就行了。

    比如你python文件编码是utf8

    # -*- coding: utf-8 -*-

    he='开心.mp3'
    f=open(he.decode('utf-8').encode('cp936'),'w')
    f.close()

    -----------------------------------------------
    如果是GB2312,则是最佳选择.GB2312含7千多字.GB2312是国内外软件普遍接受和支持的8bit双字节中文编码.

    但你是gbk.GBK是扩展的GB2312,大部分软件不支持它.用GBK时,论坛内容显示时,一些非GB2312中文字会显示成空白方块.

    utf8好,UTF8是unicode的传送型式.主流浏览器IE和netscape都支持.

    [看一下IE的View->Encoding下的可接受编码,看一下netscape的View->Charactercoding下的可接受编码,找得到GB2312和UTF8,找不到GBK!这就是在GBK和UTF8两种编码中只好选UTF8的原因.]

    ------------------------------------------------
    用编码gbk还是utf8
    ------------------------------------------------
    # -*- coding: UTF-8 -*- 这是个注释吗?
    这是用来说明你的Python源程序文件用使用的编码。缺省情况下你的程序需要使用ascii码来写,但如果在其中写中文的话,python解释器一般会报错,但如果加上你所用的文件编码,python就会自动处理不再报错。

    上述格式还可以写成:

    #coding=utf-8

    #coding:utf-8
    -------------------------------------------------
    Defining Python Source Code Encodings

    Abstract
        This PEPproposes to introduce a syntax to declare the encoding of
        a Pythonsource file. The encoding information is then used by the
        Pythonparser to interpret the file using the given encoding. Most
        notably thisenhances the interpretation of Unicode literals in
        the sourcecode and makes it possible to write Unicode literals
        using e.g.UTF-8 directly in an Unicode aware editor.

    Problem
        In Python2.1, Unicode literals can only be written using the
        Latin-1based encoding "unicode-escape". This makes the
        programmingenvironment rather unfriendly to Python users who live
        and work innon-Latin-1 locales such as many of the Asian
        countries.Programmers can write their 8-bit strings using the
        favoriteencoding, but are bound to the "unicode-escape" encoding
        for Unicodeliterals.

    Proposed Solution
        I propose tomake the Python source code encoding both visible and
        changeableon a per-source file basis by using a special comment
        at the topof the file to declare the encoding.

        To makePython aware of this encoding declaration a number of
        conceptchanges are necessary with respect to the handling of
        Pythonsource code data.

    Defining the Encoding
        Python willdefault to ASCII as standard encoding if no other
        encodinghints are given.

        To definea source code encoding, a magic comment must
        be placedinto the source files either as first or second
        line in thefile, such as:

             # coding=<encoding name>

        or (usingformats recognized by popular editors)

             #!/usr/bin/python
             # -*- coding: <encoding name> -*-

        or

             #!/usr/bin/python
             # vim: set fileencoding=<encodingname> :

        Moreprecisely, the first or second line must match the regular
        expression"coding[:=]\s*([-\w.]+)". The first group of this
        expressionis then interpreted as encoding name. If the encoding
        is unknownto Python, an error is raised during compilation. There
        must not beany Python statement on the line that contains the
        encodingdeclaration.

        To aidwith platforms such as Windows, which add Unicode BOM marks
        to thebeginning of Unicode files, the UTF-8 signature
       '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding aswell
        (even if nomagic encoding comment is given).

        If asource file uses both the UTF-8 BOM mark signature and a
        magicencoding comment, the only allowed encoding for the comment
        is'utf-8'.  Any other encoding will cause anerror.

    Examples
        These aresome examples to clarify the different styles for
        defining thesource code encoding at the top of a Python source
        file:

        1. Withinterpreter binary and using Emacs style file encoding
          comment:

             #!/usr/bin/python
             # -*- coding: latin-1 -*-
             import os, sys
             ...

             #!/usr/bin/python
             # -*- coding: iso-8859-15 -*-
             import os, sys
             ...

             #!/usr/bin/python
             # -*- coding: ascii -*-
             import os, sys
             ...

        2.Without interpreter line, using plain text:

             # This Python file uses the following encoding: utf-8
             import os, sys
             ...

        3. Texteditors might have different ways of defining the file's
          encoding, e.g.

             #!/usr/local/bin/python
             # coding: latin-1
             import os, sys
             ...

        4.Without encoding comment, Python's parser will assume ASCII
          text:

             #!/usr/local/bin/python
             import os, sys
             ...

        5.Encoding comments which don't work:

          Missing "coding:" prefix:

             #!/usr/local/bin/python
             # latin-1
             import os, sys
             ...

          Encoding comment not on line 1 or 2:

             #!/usr/local/bin/python
             #
             # -*- coding: latin-1 -*-
             import os, sys
             ...

          Unsupported encoding:

             #!/usr/local/bin/python
             # -*- coding: utf-42 -*-
             import os, sys
             ...

    Concepts
        The PEP isbased on the following concepts which would have to be
        implementedto enable usage of such a magic comment:

        1. Thecomplete Python source file should use a single encoding.
          Embedding of differently encoded data is not allowed and will
          result in a decoding error during compilation of the Python
          source code.

          Any encoding which allows processing the first two lines inthe
          way indicated above is allowed as source code encoding, this
          includes ASCII compatible encodings as well as certain
          multi-byte encodings such as Shift_JIS. It does not include
          encodings which use two or more bytes for all characters like
          e.g. UTF-16. The reason for this is to keep the encoding
          detection algorithm in the tokenizer simple.

        2.Handling of escape sequences should continue to work as itdoes
          now, but with all possible source code encodings, that is
          standard string literals (both 8-bit and Unicode) are subjectto
          escape sequence expansion while raw string literals onlyexpand
          a very small subset of escape sequences.

        3.Python's tokenizer/compiler combo will need to be updated to
          work as follows:

          1. read the file

          2. decode it into Unicode assuming a fixed per-file encoding

          3. convert it into a UTF-8 byte string

          4. tokenize the UTF-8 content

          5. compile it, creating Unicode objects from the given Unicodedata
             and creating string objects from the Unicode literal data
             by first reencoding the UTF-8 data into 8-bit string data
             using the given file encoding

          Note that Python identifiers are restricted to the ASCII
          subset of the encoding, and thus need no further conversion
          after step 4.

    Implementation
        Forbackwards-compatibility with existing code which currently
        usesnon-ASCII in string literals without declaring an encoding,
        theimplementation will be introduced in two phases:

        1. Allownon-ASCII in string literals and comments, by internally
          treating a missing encoding declaration as a declaration of
          "iso-8859-1". This will cause arbitrary byte strings to
          correctly round-trip between step 2 and step 5 of the
          processing, and provide compatibility with Python 2.2 for
          Unicode literals that contain non-ASCII bytes.

          A warning will be issued if non-ASCII bytes are found in the
          input, once per improperly encoded input file.

        2. Removethe warning, and change the default encoding to "ascii".

        Thebuiltin compile() API will be enhanced to accept Unicode as
        input. 8-bitstring input is subject to the standard procedure for
        encodingdetection as described above.

        If aUnicode string with a coding declaration is passed tocompile(),
        aSyntaxError will be raised.

        SUZUKIHisao is working on a patch; see [2] for details. A patch
        implementingonly phase 1 is available at [1].

    Phases
       Implementation of steps 1 and 2 above were completed in 2.3,
        except forchanging the default encoding to "ascii".

        Thedefault encoding was set to "ascii" in version 2.5.
      
    Scope
        This PEPintends to provide an upgrade path from the current
       (more-or-less) undefined source code encoding situation to amore
        robust andportable definition.

    References
        [1] Phase 1implementation:
           http://python.org/sf/526840
        [2] Phase 2implementation:
           http://python.org/sf/534304

    History
        1.10 andabove: see CVS history
        1.8: Added'.' to the coding RE.
        1.7: Addedwarnings to phase 1 implementation. Replaced the
            Latin-1 default encoding with the interpreter's default
            encoding. Added tweaks to compile().
        1.4 - 1.6:Minor tweaks
        1.3: Workedin comments by Martin v. Loewis:
            UTF-8 BOM mark detection, Emacs style magic comment,
            two phase approach to the implementation

    Copyright
     

       Thisdocument has been placed in the public domain.


    【mysql + python】UnicodeEncodeError: 'latin-1' codec can't encode characters in position 28-29: ordinal not in range(256) 

    数据库表的字符集是utf-8,如果插入的数据是汉字,那么就需要做字符集转换。python中的方法是黄色底纹的代码

     

    def getinfo():

        connection = MySQLdb.connect(host='10.20.149.247',user='user', passwd='bkeep', db='ep')

        cursor = connection.cursor()

        sqlip = """ select  distinct names_ip from xingyi_dns where types='A' """

        sql_update = """ update xingyi_dns set idc='%s' where names_ip = '%s' and types = 'A' """

        cursor.execute(sqlip)

        count = cursor.fetchall()

        #print type(count)

        for i in count:

            ip = i[0]

            idc_a = get_idc_ip(ip)

            if isinstance(idc_a, unicode):

               idc_a = idc_a.encode('utf-8')

            cursor.execute(sql_update %(idc_a,ip))

            print ip

     

    getinfo()

     

    再执行就成功了。



    展开全文
  • utf-8 <...结果如下图所示,utf-8一个中文占3个字节,gbk两个字节 用微软windows系统自带的记事本打开这个csv文件,菜单里选择另存,在弹出的窗口下面有个编码,这时显示出UTF-8的话说明该CSV文
  • 批量转换文件编码,从GBK GB2312编码转换到UTF-8编码 # 2. 支持指定目录下所有的文件的转换,包括子目录中的文件 # 3. 支持检测原始编码,对已经是UTF-8编码的文件,不做转换 # 4. 支持只转换指定扩展名的编码 # 5....
  • 下面python3的方式递归变量当前目录以及子目录,把目录中的*.java文件由gbk转换为utf-8,注意只能用一次,一次之后 当前目录以及子目录下的文件编码均会由gbk转为utf-8。 # Python3GBK转换成utf-8编码,...
  • 1 #!/user/bin/env python ...2 # -*- coding:utf-8 -*- 3 4 temp = "连接" 5 temp_unicode = temp.decode('utf-8') 6 temp_gbk =unicode.encode('gbk') 7 print(temp_gbk) 1 #!/usr/bin/env python ...
  • python3 解释器默认编码为Unicode,由str类型进行表示。二进制数据使用byte类型表示。 字符串通过编码转换成字节串,字节码通过解码成为字符串。 encode:str-->bytes decode:bytes --> str 下面给出一个...
  • py3中只有 str和bytes两种数据类型str: unicode编码(万国码,全世界都能看懂的一种编码方式)s = 'hello袁浩' #内部寸的是一个个unicode编码bytes:十六进制#utf8编码由str到bytes叫做编码b1 = bytes(s,'utf8') #...
  • 1.在python2默认编码是ASCII, python3里默认是unicode。 2.unicode 分为 utf-32(4个字节),utf-16(2个字节),utf-8(1-4个字节), utf-16是现在最常用的unicode版本, 不过在文件里存的还是utf-8,因为utf-8省空间。 3...
  • python3默认是utf8的,爬取gbk网页的时候会出现乱码 解决办法 test.encoding="gbk" test.text text不转换会出现错误,python3字符集不支持转码 第二种方法 test.content.decode("gbk") decode的作用是将...
  • Python:将utf-8格式的文件转换成gbk格式的文件 需求:将utf-8格式的文件转换成gbk格式的文件 实现代码如下: 1 2 3 4 5 6 7 8 9 10 11 def ReadFile...
  • 目录数据转换参考示例1、字节(bytes)字符串(str)2、字符串(str)转为字节数组3、int转为16进制字符串4、16进制字符串转为int5、16进制字符串 / int 转为2进制字符串6、列表转为字符串7、按空格截取字符串到...
  • Python3字符集转换

    2019-01-03 12:02:00
    上一篇:Python文件操作 点击跳转 目录篇:python相关目录篇 点击跳转 下一篇:Python之导入模块的几...unicode_01 = utf_8_01.encode('gbk') #编码成gbk的unicode二进制 print(unicode_01) gbk = unicode_01.dec...
  • source insight不支持 utf8 ,但是在 linux 上查看的时候是 utf8 编码,就会显示不正常,所以写了个 python 小脚本,可以批量转换 py2.7 1 #coding:utf-8 2 ''' 3 GBK UTF-8 工具 4 author: 宁次 5 ...
  • Python编码转换

    2017-11-01 22:48:02
    Python常用的编码格式有3种:unicode,utf-8,gbk有些时候因为某些需要,就例如我们用的是utf-8的编码格式编写的脚本,需要在Windows终端中运行,而Windows终端默认的编码格式是GBK,这时候我们就要把编码转换一下格式...
  • python字符集转换

    千次阅读 2018-06-25 18:01:04
    1、python3 encode和decode: # --*-- coding:utf8 --*--在python文件的开头,告知编译器使用哪一种编码格式来解释文件;encode() 和 decode()字符串 &lt;=&gt; 字节码, 编码和解码就是在字符串和字节码...
  • 字符编码操作 # -*- coding:utf-8 -*- import sys print(sys.getdefaultencoding()) s = "你好" python2写法,将s转换成"gbk" s_to_gbk = s.decode("utf-8").encode("gbk...当文件头声明定义为:utf-8#3.将s转换...
  • python编码转换

    2020-10-09 20:14:15
    # python3中字符串的两种类型:byte,str,str存储Unicode类型,bytes存储byte类型 # 字符串(Unicode) 转换byte----编码 a="北京的金山上太阳照四方!" b=a.encode("utf-8") #成字节码 print(b) #byte字符串...
  • # utf-8 > decode > unicode # utf-8 < encode < unicode # gbk > decode > unicode # gbk < encode < unicode import sys print(sys.getdefaultencoding()) # 字符串要先手动 encode指定...
  • /usr/bin/python2# -*- coding:utf-8 -*-temp = "猪"#解码,需要指定原来是什么编码,解码成Unicodetemp_unicode = temp.decode('utf-8')编码,由unicode编码成gbktemp_gbk = temp_unicode.encode("gbk")print(temp_...
  • 首先需要说的是python3的默认编码是Unicode,在pycharm中只是用utf-8去解释,实际的编码还是Unicode a = "某gbk编码格式" a.decode("gbk").encode("utf-8") #先用gbk方式解码成unicode,再转换成utf-8 python...
  • 最近刚刚到了Python3,好多代码都需要重新整理。。。奔溃中。。。不过Python3可以不声明字符集,直接用中文了,很好很好 问题:编译器utf-8、txt文件utf-8、 解决:直接在open里加一个encoding ...
  • python2和python3的区别

    2021-01-20 03:17:44
    3、python2 的默认编码是ASCII,python3的默认编码是UTF-8。 4、python3字符串解码后会在内存里自动转换成Unicode,而python2不会。如果在文件头指定了解码编码,python2和python3都会按指定解码,所有系统都支持...

空空如也

空空如也

1 2 3 4 5 ... 14
收藏数 279
精华内容 111
关键字:

python3gbk转utf8

python 订阅