Add. 第三届阿里云磐久智维算法大赛

5 months ago · 305130130b
parent df252a9622
commit 305130130b
27 changed files with 51587 additions and 4 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,2 @@
+
+.DS_Store
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/Dockerfile
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/Dockerfile
@ -0,0 +1,25 @@
+# Base Images
+## 从天池基础镜像构建
+FROM registry.cn-shanghai.aliyuncs.com/tcc-public/python:3
+## 把当前文件夹里的文件构建到镜像的根目录下
+ADD . /
+## 指定默认工作目录为根目录（需要把run.sh和生成的结果文件都放在该文件夹下，提交后才能运行）
+WORKDIR /
+
+## 安装所需要的包
+RUN pip config set global.index-url http://mirrors.aliyun.com/pypi/simple/
+RUN pip config set install.trusted-host mirrors.aliyun.com
+RUN pip3 install -r code/requirements.txt
+RUN pip install --upgrade pip
+RUN apt -y update
+RUN apt install zip
+RUN apt install vim -y
+RUN apt install screen -y
+RUN pip install catboost
+RUN pip install scikit-learn
+RUN pip install tqdm
+RUN pip install lightgbm
+RUN pip install gensim==4.1.2
+
+## 镜像启动后统一执行 sh run.sh
+CMD ["sh", "run.sh"]
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/LICENSE
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/LICENSE
@ -0,0 +1,661 @@
+                    GNU AFFERO GENERAL PUBLIC LICENSE
+                       Version 3, 19 November 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The GNU Affero General Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+our General Public Licenses are intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+  Developers that use our General Public Licenses protect your rights
+with two steps: (1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public.
+
+  The GNU Affero General Public License is designed specifically to
+ensure that, in such cases, the modified source code becomes available
+to the community.  It requires the operator of a network server to
+provide the source code of the modified version running there to the
+users of that server.  Therefore, public use of a modified version, on
+a publicly accessible server, gives the public access to the source
+code of the modified version.
+
+  An older license, called the Affero General Public License and
+published by Affero, was designed to accomplish similar goals.  This is
+a different license, not a version of the Affero GPL, but Affero has
+released a new version of the Affero GPL which permits relicensing under
+this license.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 3 of the GNU Affero General Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Remote Network Interaction; Use with the GNU General Public License.
+
+  Notwithstanding any other provision of this License, if you modify the
+Program, your modified version must prominently offer all users
+interacting with it remotely through a computer network (if your version
+supports such interaction) an opportunity to receive the Corresponding
+Source of your version by providing access to the Corresponding Source
+from a network server at no charge, through some standard or customary
+means of facilitating copying of software.  This Corresponding Source
+shall include the Corresponding Source for any work covered by version 3
+of the GNU General Public License that is incorporated pursuant to the
+following paragraph.
+
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the work with which it is combined will remain governed by version
+3 of the GNU General Public License.
+
+  14. Revised Versions of this License.
+
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU Affero General Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU Affero General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU Affero General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+  If the Program specifies that a proxy can decide which future
+versions of the GNU Affero General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as published
+    by the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU AGPL, see
+<https://www.gnu.org/licenses/>.
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/README.md
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/README.md
@ -1,18 +1,96 @@
 # 第三届阿里云磐久智维算法大赛（亚军）

-### [官网地址](<https://tianchi.aliyun.com/competition/entrance/531947/introduction>)


+## [官网地址](<https://tianchi.aliyun.com/competition/entrance/531947/introduction>)

-### [代码地址](https://github.com/yz-intelligence/AI-Competition/tree/main/3rd_PanJiu_AIOps_Competition)
+## [官方数据地址](https://tianchi.aliyun.com/competition/entrance/531947/information?lang=zh-cn)

+## 项目目录结构

+```
+├── Dockerfile 
+├── README.md
+├── code
+│   ├── catboost_fs.py +++++++++++++++++++++++++++++++ 模型训练代码
+│   ├── generate_feature.py ++++++++++++++++++++++++++ 特征生成代码
+│   ├── generate_pseudo_label.py  ++++++++++++++++++++ 伪标签代码
+│   ├── get_crashdump_venus_fea.py +++++++++++++++++++ 新数据特征生成代码
+│   ├── requirements.txt +++++++++++++++++++++++++++++ python包版本
+│   ├── stacking.py ++++++++++++++++++++++++++++++++++ 模型融合代码
+│   └── utils.py +++++++++++++++++++++++++++++++++++++ 小工具脚本
+├── data
+│   ├── preliminary_a_test +++++++++++++++++++++++++++ 初赛A榜测试数据集
+│   ├── preliminary_b_test +++++++++++++++++++++++++++ 初赛B榜测试数据集
+│   └── preliminary_train ++++++++++++++++++++++++++++ 训练集数据
+├── docker_push.sh +++++++++++++++++++++++++++++++++++++++++ Docker镜像构建、push脚本
+├── feature
+│   └── generation +++++++++++++++++++++++++++++++++++ 特征生成文件夹
+├── log ++++++++++++++++++++++++++++++++++++++++++++++++++++ 日志文件夹
+│   ├── catboost.log +++++++++++++++++++++++++++++++++ 模型运行日志
+├── model ++++++++++++++++++++++++++++++++++++++++++++++++++ 模型文件
+├── prediction_result ++++++++++++++++++++++++++++++++++++++ 模型预测结果文件夹
+│   ├── cat_prob_result.csv ++++++++++++++++++++++++++ CATBOOST模型预测概率
+│   ├── catboost_result.csv ++++++++++++++++++++++++++ CATBOOST模型预测结果
+│   └── stacking_result.csv ++++++++++++++++++++++++++ 模型融合结果
+├── run.log ++++++++++++++++++++++++++++++++++++++++++++++++ 代码运行日志
+├── run.sh  ++++++++++++++++++++++++++++++++++++++++++++++++ 代码运行脚本
+├── tcdata  ++++++++++++++++++++++++++++++++++++++++++++++++ 复赛测试集数据文件夹(具体文件请使用初赛相关文件更改文件名替换)
+│   ├── final_crashdump_dataset_b.csv ++++++++++++++++ 复赛B榜新数据文件
+│   ├── final_sel_log_dataset_b.csv ++++++++++++++++++ 复赛测试集日志文件
+│   ├── final_submit_dataset_b.csv +++++++++++++++++++ 复赛测试集ID
+│   └── final_venus_dataset_b.csv ++++++++++++++++++++ 复赛B榜新数据文件
+├── user_data
+│   └── tmp_data +++++++++++++++++++++++++++++++++++++ 临时文件
+└── 答辩PPT
+    └── 悦智AI实验室_20220525.pdf
+```

-### [官方数据地址](https://tianchi.aliyun.com/competition/entrance/531947/information?lang=zh-cn)
+## 运行环境

+Python版本为3.8，各个Python包版本见requirements.txt，使用如下命令即可安装：

+```
+pip install -r code/requirements.txt
+```

-### 终榜亚军
+
+
+## 构建镜像运行代码
+
+### 构建镜像
+
+```
+docker build -t [你的镜像仓库]:[TAG] .
+```
+
+### 运行镜像
+
+```
+docker run  [你的镜像ID] sh run.sh 
+```
+
+### push 镜像
+
+```
+docker push [你的仓库地址]:[TAG]
+```
+
+### 运行&push 镜像
+
+```
+bash docker_push.sh
+```
+
+## 运行代码
+
+```
+bash run.sh
+```
+
+
+
+## 终榜亚军

 <img src="assets/亚军盖章.jpg" alt="亚军盖章" title="亚军盖章" width="250"  height = "350" />

--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/catboost_fs.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/catboost_fs.py
@ -0,0 +1,328 @@
+import datetime
+import os
+import warnings
+
+import numpy as np
+import pandas  as pd
+
+
+from generate_feature import get_beta_target, add_last_next_time4fault, get_feature, \
+    get_duration_minutes_fea, get_nearest_msg_fea, get_server_model_sn_fea_2, \
+    get_server_model_fea, get_msg_text_fea_all, get_key_word_cross_fea, get_server_model_time_interval_stat_fea, \
+    get_w2v_feats, get_key_for_top_fea,get_time_diff_feats_v2
+from model import run_cbt
+from utils import RESULT_DIR, TRAIN_DIR, \
+    TEST_A_DIR, KEY_WORDS, get_word_counter, search_weight, macro_f1, TIME_INTERVAL,PSEUDO_FALG,GENERATION_DIR
+
+warnings.filterwarnings('ignore')
+
+
+def get_label(PSEUDO_FALG):
+    preliminary_train_label_dataset = pd.read_csv(preliminary_train_label_dataset_path)
+    preliminary_train_label_dataset_s = pd.read_csv(preliminary_train_label_dataset_s_path)
+
+    if PSEUDO_FALG:
+        print('获取伪标签LABEL')
+        pseudo_labels = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_labels.csv'))
+        label = pd.concat([preliminary_train_label_dataset,
+                           pseudo_labels,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        label = pd.concat([preliminary_train_label_dataset,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    label['fault_time'] = label['fault_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+    label['fault_time'] = label['fault_time'].apply(lambda x: str(x))
+    return label
+
+
+def get_log_dateset(PSEUDO_FALG):
+    preliminary_sel_log_dataset = pd.read_csv(preliminary_sel_log_dataset_path)
+    preliminary_sel_log_dataset_a = pd.read_csv(preliminary_sel_log_dataset_a_path)
+    if PSEUDO_FALG:
+        print('获取伪标签日志数据')
+        pseudo_sel_log_dataset = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_sel_log_dataset.csv'))
+        log_dataset = pd.concat([preliminary_sel_log_dataset,
+                                 pseudo_sel_log_dataset,
+                                 preliminary_sel_log_dataset_a],
+                                ignore_index=True,
+                                axis=0).sort_values(
+            ['sn', 'time', 'server_model']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        log_dataset = pd.concat([preliminary_sel_log_dataset,
+                                 preliminary_sel_log_dataset_a],
+                                ignore_index=True,
+                                axis=0).sort_values(
+            ['sn', 'time', 'server_model']).reset_index(drop=True)
+    log_dataset['time'] = log_dataset['time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+
+    return log_dataset
+
+def get_fea_distribute(feature_df, feature_importances, dataset_type, top=30):
+    print('根据特征重要性，获取数据集的分布情况，用于验证训练集和测试集是否分布一致')
+    fea_distribute_list = []
+    for i in feature_importances[:top]['fea'].to_list():
+        fea_distribute_tmp = (feature_df[i].value_counts() / len(feature_df)).reset_index().rename(
+            columns={'index': 'value'})
+        fea_distribute_list.append(fea_distribute_tmp)
+
+    fea_distribute = fea_distribute_list[-1]
+    for i in fea_distribute_list[:-1]:
+        fea_distribute = fea_distribute.merge(i, on='value', how='left')
+    fea_distribute['value'] = fea_distribute['value'].apply(lambda x: f'{dataset_type}_{int(x)}')
+    return fea_distribute
+
+
+
+def get_train_test(label, preliminary_submit_dataset_a, log_dataset):
+    print('获取训练集数据与测试集数据')
+    train = label.merge(log_dataset, on='sn', how='left')
+    test = preliminary_submit_dataset_a.merge(log_dataset, on='sn', how='left')
+    #     train['time_interval']  = (pd.to_datetime( train['fault_time'])-train['time']  ).apply(lambda x:x.total_seconds())
+    #     test['time_interval']  = (pd.to_datetime( test['fault_time'])- test['time']  ).apply(lambda x:x.total_seconds())
+    #     train = train.query('time_interval > 0')
+    #     test = test.query('time_interval > 0')
+    print(f'训练集维度:{train.shape},测试集维度:{test.shape}')
+    train = train.drop_duplicates().reset_index(drop=True)
+    test = test.drop_duplicates().reset_index(drop=True)
+    train['time'] = pd.to_datetime(train['time'])
+    test['time'] = pd.to_datetime(test['time'])
+    return train, test
+start_time = datetime.datetime.now()
+
+additional_sel_log_dataset_path = os.path.join(TRAIN_DIR, 'additional_sel_log_dataset.csv')
+preliminary_train_label_dataset_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset.csv')
+preliminary_train_label_dataset_s_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset_s.csv')
+preliminary_sel_log_dataset_path = os.path.join(TRAIN_DIR, 'preliminary_sel_log_dataset.csv')
+
+preliminary_submit_dataset_a_path = os.path.join(TEST_A_DIR, 'final_submit_dataset_b.csv')
+preliminary_sel_log_dataset_a_path = os.path.join(TEST_A_DIR, 'final_sel_log_dataset_b.csv')
+
+print(preliminary_submit_dataset_a_path, preliminary_sel_log_dataset_a_path)
+
+preliminary_submit_dataset_a = pd.read_csv(preliminary_submit_dataset_a_path)
+preliminary_submit_dataset_a.head()
+
+log_dataset = get_log_dateset(PSEUDO_FALG)
+label = get_label(PSEUDO_FALG)
+
+
+next_time_list = [i / TIME_INTERVAL for i in [3, 5, 10, 15, 30, 45, 60, 90, 120, 240, 360, 480, 540, 600]] + [1000000]
+
+label, preliminary_submit_dataset_a = add_last_next_time4fault(label, preliminary_submit_dataset_a, TIME_INTERVAL,
+                                                               next_time_list)
+train, test = get_train_test(label, preliminary_submit_dataset_a, log_dataset)
+train = train.drop_duplicates(['sn', 'fault_time', 'time', 'msg', 'server_model']).reset_index(drop=True)
+
+train['time_interval'] = (pd.to_datetime(train['fault_time']) - pd.to_datetime(train['time'])).apply(
+    lambda x: x.total_seconds())
+test['time_interval'] = (pd.to_datetime(test['fault_time']) - pd.to_datetime(test['time'])).apply(
+    lambda x: x.total_seconds())
+
+all_data = pd.concat([train, test], axis=0, ignore_index=True)
+all_data = all_data.sort_values(['sn','server_model', 'fault_time', 'time'])
+w2v_feats = get_w2v_feats(all_data,
+                          f1_list = ['sn'],
+                          f2_list = ['msg_list', 'msg_0', 'msg_1', 'msg_2'])
+# 获取 time_diff_feats_v2
+time_diff_feats_v2 = get_time_diff_feats_v2(all_data)
+# 获取 server_model_time_interval_stat_fea
+server_model_time_interval_stat_fea = get_server_model_time_interval_stat_fea(all_data)
+
+msg_text_fea = get_msg_text_fea_all(all_data)
+# 获取时间差特征
+duration_minutes_fea = get_duration_minutes_fea(train, test)
+
+# 获取时间server_model特征
+server_model_fea = get_server_model_fea(train, test)
+counter = get_word_counter(train)
+
+# 获取时间 nearest_msg 特征
+nearest_msg_fea = get_nearest_msg_fea(train, test)
+# 获取时间 server_model beta_target 特征
+beta_target_fea = get_beta_target(train, test)
+
+key = ['sn', 'fault_time', 'label', 'server_model']
+
+fea_num = len(KEY_WORDS)
+time_list = [i * TIME_INTERVAL for i in next_time_list]
+train = get_feature(train, time_list, KEY_WORDS, fea_num, key=['sn', 'fault_time', 'label', 'server_model'])
+test = get_feature(test, time_list, KEY_WORDS, fea_num, key=['sn', 'fault_time', 'server_model'])
+
+print('添加 时间差 特征')
+train = train.merge(duration_minutes_fea, on=['sn', 'fault_time', 'server_model'])
+test = test.merge(duration_minutes_fea, on=['sn', 'fault_time', 'server_model'])
+
+print('添加 server_model特征')
+train = train.merge(server_model_fea, on=['sn', 'server_model'])
+test = test.merge(server_model_fea, on=['sn', 'server_model'])
+
+print('添加 w2v_feats')
+train = train.merge(w2v_feats, on=['sn' ])
+test = test.merge(w2v_feats, on=['sn', ])
+
+print('添加 nearest_msg 特征')
+train = train.merge(nearest_msg_fea, on=['sn', 'server_model', 'fault_time'])
+test = test.merge(nearest_msg_fea, on=['sn', 'server_model', 'fault_time'])
+
+print('添加 beta_target 特征')
+train = train.merge(beta_target_fea, on=['sn', 'server_model', 'fault_time'])
+test = test.merge(beta_target_fea, on=['sn', 'server_model', 'fault_time'])
+
+server_model_sn_fea_2 = get_server_model_sn_fea_2(train, test)
+print('添加 server_model_sn_fea_2 特征')
+train = train.merge(server_model_sn_fea_2, on=['sn', 'server_model'])
+test = test.merge(server_model_sn_fea_2, on=['sn', 'server_model'])
+
+print('添加 time_diff_feats_v2 特征')
+train = train.merge(time_diff_feats_v2, on=['sn', 'server_model', 'fault_time'])
+test = test.merge(time_diff_feats_v2, on=['sn', 'server_model', 'fault_time'])
+
+# test.to_csv(os.path.join(GENERATION_DIR,'test.csv'),index =False)
+# train.to_csv(os.path.join(GENERATION_DIR,'train.csv'),index =False)
+
+# crashdump_venus_fea = pd.read_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea.csv') )
+# print('添加 crashdump_venus_fea 特征')
+# print(train.shape,test.shape,crashdump_venus_fea.shape)
+# train = train.merge(crashdump_venus_fea, on=['sn' , 'fault_time'],how = 'left')
+# test = test.merge(crashdump_venus_fea, on=['sn', 'fault_time' ],how = 'left')
+# print(train.shape,test.shape )
+
+# crashdump_venus_fea = pd.read_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea_v1.csv') )
+# print('添加 crashdump_venus_fea 特征')
+# print(train.shape,test.shape,crashdump_venus_fea.shape)
+# train = train.merge(crashdump_venus_fea, on=['sn' , 'fault_time'],how = 'left')
+# test = test.merge(crashdump_venus_fea, on=['sn', 'fault_time' ],how = 'left')
+# print(train.shape,test.shape )
+# test.to_csv(os.path.join(GENERATION_DIR,'test.csv'),index =False)
+# train.to_csv(os.path.join(GENERATION_DIR,'train.csv'),index =False)
+# print('添加 key_for_top_fea 特征')
+# train,test = get_key_for_top_fea(train,test)
+
+# print('添加 w2v_tfidf_doc2v_fea 特征')
+# w2v_tfidf_fea = pd.read_csv(os.path.join(GENERATION_DIR,'w2v_tfidf_fea.csv'))
+# drop_cols = [i for i in w2v_tfidf_fea if 'doc2vec' in i ]+[i for i in w2v_tfidf_fea if 'tfidf' in i ]
+# for col in drop_cols:
+#     del w2v_tfidf_fea[col]
+#
+# train = train.merge(w2v_tfidf_fea, on=['sn'  ], how='left')
+# test = test.merge(w2v_tfidf_fea, on=['sn' ], how='left')
+
+# print('添加 关键词交叉特征  ')
+# train,test = get_key_word_cross_fea(train,test)
+
+# print('添加 server_model_time_interval_stat_fea 特征')
+# train = train.merge(server_model_time_interval_stat_fea, on=['server_model' ],how ='left')
+# test = test.merge(server_model_time_interval_stat_fea, on=['server_model'  ],how ='left')
+
+
+use_less_cols_1 = ['last_last_msg_cnt', 'last_first_msg_cnt','time_diff_1_min',
+       'last_msg_list_unique_LabelEnc', 'last_msg_0_unique_LabelEnc',
+       'last_msg_1_unique_LabelEnc', 'last_msg_2_unique_LabelEnc',
+       'last_msg_list_list_LabelEnc', 'last_msg_0_list_LabelEnc',
+       'last_msg_1_list_LabelEnc', 'last_msg_2_list_LabelEnc',
+       'last_msg_0_first_LabelEnc', 'last_msg_1_first_LabelEnc',
+       'last_msg_2_first_LabelEnc', 'last_msg_0_last_LabelEnc',
+       'last_msg_1_last_LabelEnc', 'last_msg_2_last_LabelEnc',
+       'last_msg_last_LabelEnc', 'last_msg_first_LabelEnc']
+
+use_less_col = [i for i in train.columns if train[i].nunique() < 2] + use_less_cols_1
+
+
+print(f'use_less_col:{len(use_less_col)}')
+use_cols = [i for i in train.columns if i not in ['sn', 'fault_time', 'label', 'server_model'] + use_less_col]
+
+cat_cols = ['server_model_LabelEnc', 'msg_LabelEnc', 'msg_0_LabelEnc', 'msg_1_LabelEnc', 'msg_2_LabelEnc',]
+use_cols = sorted(use_cols)
+
+cat_cols = []
+for i in use_cols:
+    if '_LabelEnc' in i:
+        cat_cols.append(i)
+print('使用的特征维度:',len(use_cols),'类别特征维度:',len(cat_cols))
+# fs = FeatureSelector(data=train[use_cols], labels=train['label'])
+#
+# # 选择出missing value 百分比大于60%的特征
+# fs.identify_missing(missing_threshold=0.9)
+#
+# # # 查看选择出的特征
+# # fs.ops['missing']
+# # 不对feature进行one-hot encoding（默认为False）, 然后选择出相关性大于98%的feature,
+# fs.identify_collinear(correlation_threshold=0.99, one_hot=False)
+#
+# # # 查看选择的feature
+# # fs.ops['collinear']
+#
+# # 选择出只有单个值的feature
+# fs.identify_single_unique()
+#
+# # # 查看选择出的feature
+# # fs.ops['single_unique']
+#
+# train_removed = fs.remove(methods = ['missing', 'single_unique', 'collinear',], keep_one_hot=False)
+# use_cols = train_removed.columns
+# print('特征选择之后，使用的特征维度:',len(use_cols))
+
+
+oof_prob = np.zeros((train.shape[0], 4))
+test_prob = np.zeros((test.shape[0], 4))
+# seeds = [42,4242,40424,1024,2048]
+seeds = [42 ]
+for seed in seeds:
+    oof_prob, test_prob, fea_imp_df, model_list = run_cbt(train[use_cols] , train[['label']] , test[use_cols], k=5,
+                                              seed=seed, cat_cols=cat_cols)
+    oof_prob +=oof_prob/len(seeds)
+    test_prob +=test_prob/len(seeds)
+
+
+weight = search_weight(train, train[['label']], oof_prob, init_weight=[1.0], class_num=4, step=0.001)
+oof_prob = oof_prob * np.array(weight)
+test_prob = test_prob * np.array(weight)
+
+target_df = train[['sn', 'fault_time', 'label']]
+submit_df = train[['sn', 'fault_time']]
+submit_df['label'] = oof_prob.argmax(axis=1)
+
+score = macro_f1(target_df=target_df, submit_df=submit_df)
+print(f'********************** BEST MACRO_F1 : {score} **********************')
+score = round(score, 5)
+
+y_pred = test_prob.argmax(axis=1)
+result = test[['sn', 'fault_time']]
+result['label'] = y_pred
+result = preliminary_submit_dataset_a.merge(result, on=['sn', 'fault_time'], how='left')[['sn', 'fault_time', 'label']]
+result['label'] = result['label'].fillna(0).astype(int)
+
+result.to_csv(os.path.join(RESULT_DIR,f'catboost_result.csv'), index=False)
+print(result['label'].value_counts())
+fea_imp_df = fea_imp_df.reset_index(drop = True)
+fea_imp_df.to_csv(os.path.join(RESULT_DIR,f'./cat_fea_imp_{int(score*100000)}.csv'),index = False)
+
+train_result_prob = pd.DataFrame(oof_prob).add_prefix('cat_class_')
+test_result_prob = pd.DataFrame(test_prob).add_prefix('cat_class_')
+train_result_prob['label'] = train['label']
+train_result_prob['sn'] = train['sn']
+train_result_prob['fault_time'] = train['fault_time']
+test_result_prob['sn'] = test['sn']
+test_result_prob['fault_time'] = test['fault_time']
+
+result_prob = pd.concat([train_result_prob,test_result_prob],ignore_index = True)
+result_prob.to_csv(os.path.join(RESULT_DIR,f'cat_prob_result.csv'),index = False)
+
+end_time = datetime.datetime.now()
+cost_time = end_time - start_time
+print('****************** CATBOOST COST TIME : ',str(cost_time),' ******************')
+
+'''
+ 
+v7: 最优 线下 0.7303
+v8: v7 添加 关键词交叉特征 并作为类别变量输入模型 0.73114
+
+'''
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/generate_feature.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/generate_feature.py
@ -0,0 +1,836 @@
+import datetime
+import os
+import pickle
+from collections import Counter
+from utils import get_new_cols
+import numpy as np
+import pandas as pd
+from tqdm import tqdm
+from gensim.models import Word2Vec
+from utils import GENERATION_DIR
+from utils import KEY_1, KEY_2, KEY_3, KEY_4
+from tqdm import tqdm
+from scipy import stats
+
+def cat2num(df, cat_cols, Transfer2num=True):
+    '''
+
+    :param df:
+    :param cat_cols: 类别特征列表
+    :param Transfer2num: 类别特征转换为数值特征
+    :return:
+    '''
+    if Transfer2num:
+
+        print('Transfer category feature to  num feature ')
+        for col in cat_cols:
+
+            if not os.path.exists(os.path.join(GENERATION_DIR, f'{col}_map.pkl')):
+                print(f'Transfer : {col}')
+                tmp_map = dict(zip(df[col].unique(), range(df[col].nunique())))
+                with open(os.path.join(GENERATION_DIR, f'{col}_map.pkl'), 'wb') as f:
+                    pickle.dump(tmp_map, f)
+            else:
+                with open(os.path.join(GENERATION_DIR, f'{col}_map.pkl'), 'rb') as f:
+                    tmp_map = pickle.load(f)
+            df[f'{col}_LabelEnc'] = df[col].map(tmp_map).fillna(-1).astype(int)
+    else:
+        print('Transfer category feature to  category feature ')
+        for col in cat_cols:
+            df[col] = df[col].astype('category')
+    print('Transfer category feature to  num feature  Down...')
+    return df
+
+def add_minutes(x, minutes=5):
+    dt = datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
+    out_date = (dt + datetime.timedelta(minutes=minutes)
+                ).strftime('%Y-%m-%d %H:%M:%S')
+    return out_date
+
+
+def time_process(df, time_cols, minutes_):
+    df[f'time_{minutes_}'] = df[time_cols].apply(
+        lambda x: add_minutes(str(x), minutes_))
+    return df
+
+
+def get_fea(x, fea):
+    if fea in x:
+        return 1
+    else:
+        return 0
+
+
+def get_last_msg_cnt(x):
+    last_msg = x[-1]
+    cnt = x.count(last_msg)
+    return cnt
+
+
+def get_first_msg_cnt(x):
+    first_msg = x[0]
+    cnt = x.count(first_msg)
+    return cnt
+
+
+def add_last_next_time4fault(label, preliminary_submit_dataset_a,
+                             time_interval, next_time_list):
+    print(f'添加自定义异常出现的时间间隔{time_interval}的前后的时间点')
+    for i in tqdm([-i for i in next_time_list] + next_time_list):
+        label = time_process(label, 'fault_time', i * time_interval)
+        preliminary_submit_dataset_a = time_process(
+            preliminary_submit_dataset_a, 'fault_time', i * time_interval)
+
+    return label, preliminary_submit_dataset_a
+
+
+def get_msg_text_fea(df, msg_type='last'):
+    print(f'获取 msg text {msg_type}特征')
+
+    df_fea = df.groupby(['sn', 'fault_time']).agg(
+        {'msg_list': 'sum', 'msg_0': 'sum', 'msg_1': 'sum', 'msg_2': 'sum'}).reset_index()
+    df_fea['msg_list_unique'] = df_fea['msg_list'].apply(lambda x: str(set(x)))
+    df_fea['msg_0_unique'] = df_fea['msg_0'].apply(lambda x: str(set(x)))
+    df_fea['msg_1_unique'] = df_fea['msg_1'].apply(lambda x: str(set(x)))
+    df_fea['msg_2_unique'] = df_fea['msg_2'].apply(lambda x: str(set(x)))
+
+    df_fea['msg_list_list'] = df_fea['msg_list'].apply(lambda x: str(x))
+    df_fea['msg_0_list'] = df_fea['msg_0'].apply(lambda x: str(x))
+    df_fea['msg_1_list'] = df_fea['msg_1'].apply(lambda x: str(x))
+    df_fea['msg_2_list'] = df_fea['msg_2'].apply(lambda x: str(x))
+
+    df_fea['msg_0_first'] = df_fea['msg_0'].apply(lambda x: x[0])
+    df_fea['msg_1_first'] = df_fea['msg_1'].apply(lambda x: x[0])
+    df_fea['msg_2_first'] = df_fea['msg_2'].apply(lambda x: x[0])
+
+    df_fea['msg_0_last'] = df_fea['msg_0'].apply(lambda x: x[-1])
+    df_fea['msg_1_last'] = df_fea['msg_1'].apply(lambda x: x[-1])
+    df_fea['msg_2_last'] = df_fea['msg_2'].apply(lambda x: x[-1])
+
+    df_fea['msg_last'] = df.groupby(['sn', 'fault_time']).apply(
+        lambda x: x['msg'].to_list()[-1]).values
+    df_fea['msg_first'] = df.groupby(['sn', 'fault_time']).apply(
+        lambda x: x['msg'].to_list()[0]).values
+
+    df_fea['last_msg_cnt'] = df_fea['msg_list'].apply(
+        lambda x: get_last_msg_cnt(x))
+    df_fea['first_msg_cnt'] = df_fea['msg_list'].apply(
+        lambda x: get_first_msg_cnt(x))
+    cat_cols = ['msg_list', 'msg_0', 'msg_1', 'msg_2',
+                'msg_list_unique', 'msg_0_unique', 'msg_1_unique', 'msg_2_unique',
+                'msg_list_list', 'msg_0_list', 'msg_1_list', 'msg_2_list',
+                'msg_0_first', 'msg_1_first', 'msg_2_first', 'msg_0_last', 'msg_1_last',
+                'msg_2_last', 'msg_last', 'msg_first']
+    num_cols = ['last_msg_cnt', 'first_msg_cnt']
+    id_cols = ['sn', 'fault_time']
+
+    df_fea = df_fea.rename(
+        columns={
+            i: f'{msg_type}_{i}' for i in (cat_cols + num_cols)})
+    cat_cols = [f'{msg_type}_{i}' for i in cat_cols]
+    for cat_col in cat_cols:
+        df_fea[cat_col] = df_fea[cat_col].astype(str)
+    df_fea = cat2num(df_fea, cat_cols, Transfer2num=True)
+    for i in cat_cols:
+        del df_fea[i]
+    return df_fea
+
+def add_w2v_feats(all_data,w2v_feats_df,f1,f2,emb_size = 32,window = 5,min_count  =5,):
+    print(f'生成 {f1}_{f2}_w2v 特征')
+
+    df_fea = all_data.groupby(f1).agg({f2:'sum'}).reset_index()
+    df_emb = df_fea[[f1 ]]
+    sencences = df_fea[f2].to_list()
+    if not os.path.exists(os.path.join(GENERATION_DIR, f'{f1}_{f2}_w2v_model.pkl')):
+        print(f'{f1}_{f2}_w2v_model 不存在，开始训练......')
+        model = Word2Vec(sencences, vector_size=emb_size, window=window,
+                         min_count=min_count, sg=0, hs=1, seed=42)
+        with open(os.path.join(GENERATION_DIR, f'{f1}_{f2}_w2v_model.pkl'), 'wb') as f:
+            pickle.dump(model, f)
+    else:
+        print(f'{f1}_{f2}_w2v_model 已存在，开始读取......')
+        with open(os.path.join(GENERATION_DIR, f'{f1}_{f2}_w2v_model.pkl'), 'rb') as f:
+            model = pickle.load(f)
+
+    emb_matrix_mean = []
+    for sent in sencences:
+        vec = []
+        for w in sent:
+            if w in model.wv:
+                vec.append(model.wv[w])
+        if len(vec) >0:
+            emb_matrix_mean.append(np.mean(vec,axis = 0))
+        else:
+            emb_matrix_mean.append([0]*emb_size)
+    df_emb_mean = pd.DataFrame(emb_matrix_mean).add_prefix(f'{f1}_{f2}_w2v_')
+
+    df_emb = pd.concat([df_emb,df_emb_mean],axis = 1)
+    w2v_feats_df = w2v_feats_df.merge(df_emb,on = f1,how ='left')
+    return w2v_feats_df
+def get_w2v_feats(all_data,f1_list,f2_list):
+    all_data['msg_list'] = all_data['msg'].apply(lambda x: [i.strip() for i in x.split(' | ')])
+    all_data['msg_0'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 0)])
+    all_data['msg_1'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 1)])
+    all_data['msg_2'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 2)])
+    w2v_feats_df = all_data[f1_list].drop_duplicates()
+    for f1 in f1_list:
+        for f2 in f2_list:
+            w2v_feats_df = add_w2v_feats(all_data,w2v_feats_df,f1,f2,emb_size = 10,window = 5,min_count  =5,)
+    print(f'w2v_feats 的特征维度: {w2v_feats_df.shape}')
+    return w2v_feats_df
+
+
+
+def get_time_diff_feats_v2(all_data):
+    print('生成时间差特征 time_diff_feats_v2')
+    all_data['duration_seconds'] = all_data['time_interval']
+    all_data['duration_minutes'] = all_data['time_interval'] / 60
+    df_merge_log = all_data[['sn', 'fault_time', 'label', 'time', 'msg',
+                             'server_model', 'time_interval', 'duration_seconds',
+                             'duration_minutes']]
+    df_merge_log['fault_id'] = df_merge_log['sn'] + '_' + df_merge_log['fault_time'] + '_' + df_merge_log[
+        'server_model']
+    f1_list = ['fault_id', 'sn', 'server_model']
+    f2_list = ['duration_minutes', 'duration_seconds']
+    time_diff_feats_v2 = df_merge_log[['sn', 'fault_time', 'fault_id', 'server_model']].drop_duplicates().reset_index(
+        drop=True)
+
+    for f1 in f1_list:
+        for f2 in f2_list:
+            func_opt = ['count', 'nunique', 'min', 'max', 'median', 'sum']
+            for opt in func_opt:
+                tmp = df_merge_log.groupby([f1])[f2].agg([(f'{f2}_in_{f1}_' + opt, opt)]).reset_index()
+                # print(f'{f1}_in_{f2}_{opt}:{tmp.shape}' )
+                time_diff_feats_v2 = time_diff_feats_v2.merge(tmp, on=f1, how='left')
+
+            temp = df_merge_log.groupby([f1])[f2].apply(lambda x: stats.mode(x)[0][0])
+            time_diff_feats_v2[f'{f2}_in_{f1}_mode'] = time_diff_feats_v2[f1].map(temp).fillna(np.nan)
+            secs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
+            for sec in secs:
+                temp = df_merge_log.groupby([f1])[f2].quantile(sec).reset_index(
+                    name=f'log_{f2}_in_{f1}_quantile_' + str(sec * 100))
+                # print(f'log_{f1}_in_{f2}_quantile_{str(sec * 100)}:{tmp.shape}' )
+                time_diff_feats_v2 = pd.merge(time_diff_feats_v2, temp, on=f1, how='left')
+    del time_diff_feats_v2['fault_id']
+    return time_diff_feats_v2
+
+def get_feature(data, time_list, log_fea, fea_num, key):
+    print(f'当前特征维度{data.shape}')
+    fea_df_list = []
+    fea_cnt_list = ['OEM record c2', 'Processor CPU_Core_Error', '001c4c', 'System Event Sys_Event','OEM CPU0 MCERR',
+                    'OEM CPU0 CATERR', 'Reading 0 &lt; Threshold 2 degrees C', '0203c0a80101',
+                    'Unknown CPU0 MCERR', 'Unknown CPU0 CATERR','Memory', 'Correctable ECC logging limit reached',
+                    'Memory MEM_CHE0_Status', 'Memory Memory_Status',  'Memory #0x87', 'Memory CPU0F0_DIMM_Stat',
+                    'Drive Fault', 'NMI/Diag Interrupt', 'Failure detected',  'Power Supply AC lost', ]
+    for time_tmp in tqdm(time_list):
+        print(f'获取异常前后 {time_tmp} min的数据进行聚合')
+        tmp1 = data[(pd.to_datetime(data['time']) < pd.to_datetime(data[f'time_{time_tmp}'])) & (pd.to_datetime(data['time']) > pd.to_datetime(data[f'time_-{time_tmp}']))].sort_values(
+            ['sn', 'fault_time'])
+        tmp1 = tmp1.groupby(key).apply(
+            lambda x: ' | '.join(x['msg'].to_list())).reset_index().rename(columns={0: 'msg'})
+        tmp1[f'msg_len'] = tmp1['msg'].apply(lambda x: len(x.split(' | ')))
+        #         tmp1[f'msg_len_two'] = tmp1['msg'].apply(lambda x: len(x))
+        # 添加数字个数
+        # tmp1[f'msg_num_two'] = tmp1['msg'].apply(
+        #     lambda x: len([int(s) for s in re.findall(r'\b\d+\b', x)]))
+        print(f'根据异常前后 {time_tmp} min的数据的日志数据提取 {fea_num} 个稀疏特征')
+        feature = log_fea + ['msg_len']
+        for fea in feature:
+            tmp1[fea] = tmp1['msg'].apply(lambda x: get_fea(x, fea))
+            # 添加计数特征
+            if fea in fea_cnt_list:
+                tmp1[f'{fea}_cnt'] = tmp1['msg'].apply(lambda x:x.replace('|',' ').replace('_',' ').split(' ').count(fea))
+                feature.append(f'{fea}_cnt')
+        tmp1_new_col_map = {i: i + '_' + str(int(time_tmp)) for i in feature}
+        tmp1 = tmp1.rename(columns=tmp1_new_col_map)
+        del tmp1['msg']
+        fea_df_list.append(tmp1)
+    fea_df = fea_df_list[-1]
+    print(fea_df.shape)
+    for i in fea_df_list[:-1]:
+        fea_df = fea_df.merge(i, on=key, how='left')
+        print(fea_df.shape)
+    return fea_df
+
+
+def get_msg_location(x, num):
+    try:
+        return x[num]
+    except BaseException:
+        return '其它'
+
+
+def get_nearest_msg_fea(train, test):
+    print('生成 nearest_msg 特征')
+    df = pd.concat([train, test], axis=0, ignore_index=True)
+    df['duration_minutes'] = (pd.to_datetime(df['fault_time']) - pd.to_datetime(df['time'])).apply(
+        lambda x: x.total_seconds())
+    df = df.sort_values(
+        ['sn', 'server_model', 'fault_time', 'time']).reset_index(drop=True)
+    df['duration_minutes_abs'] = np.abs(df['duration_minutes'])
+
+    df['duration_minutes_abs_rank'] = df.groupby(['sn', 'server_model', 'fault_time'])['duration_minutes_abs'].rank(
+        method='first', ascending=False)
+
+    key = ['sn', 'server_model', 'fault_time', 'duration_minutes_abs']
+    df = df.sort_values(key, ascending=False)
+    df = df.drop_duplicates(
+        ['sn', 'server_model', 'fault_time', ], keep='first')
+
+    df.loc[df['duration_minutes'] ==
+           df['duration_minutes_abs'], 'last_or_next'] = 1
+    df.loc[df['duration_minutes'] !=
+           df['duration_minutes_abs'], 'last_or_next'] = 0
+    df['msg_cnt'] = df['msg'].map(df['msg'].value_counts())
+    df['msg_0'] = df['msg'].apply(
+        lambda x: get_msg_location(
+            x.split(' | '), 0))
+    df['msg_0_cnt'] = df['msg_0'].map(df['msg_0'].value_counts())
+    df['msg_1'] = df['msg'].apply(
+        lambda x: get_msg_location(
+            x.split(' | '), 1))
+    df['msg_1_cnt'] = df['msg_1'].map(df['msg_1'].value_counts())
+    df['msg_2'] = df['msg'].apply(
+        lambda x: get_msg_location(
+            x.split(' | '), 2))
+    df['msg_2_cnt'] = df['msg_2'].map(df['msg_2'].value_counts())
+    cat_feats = ['msg', 'msg_0', 'msg_1',
+                 'msg_2']  # ,'server_model_day_date','server_model_dayofmonth','server_model_dayofweek','server_model_hour']
+    # for name in cat_feats:
+    #     le = LabelEncoder()
+    #     df[f'{name}_LabelEnc'] = le.fit_transform(df[name])
+    df = cat2num(df,cat_feats)
+    df = df.drop_duplicates().reset_index(drop=True)
+    df = df[['sn', 'server_model', 'fault_time', 'msg_cnt',
+             'msg_0_cnt', 'msg_1_cnt', 'msg_2_cnt',
+             #              'duration_minutes_abs','duration_minutes', 'duration_minutes_abs_rank',
+             'last_or_next', 'msg_LabelEnc', 'msg_0_LabelEnc', 'msg_1_LabelEnc', 'msg_2_LabelEnc']]
+    print(f'生成 nearest_msg 特征完毕，特征维度{df.shape}')
+    return df
+
+def get_server_model_time_interval_stat_fea(all_data):
+    server_model_time_interval_stat_fea = all_data.groupby('server_model').agg({'time_interval':['min','max','mean','median']}).reset_index()
+    server_model_time_interval_stat_fea = get_new_cols(server_model_time_interval_stat_fea,key = ['server_model' ])
+
+    server_model_time_interval_stat_fea.columns  = ['server_model', 'sm_time_interval_min', 'sm_ttime_interval_max',
+           'sm_ttime_interval_mean', 'sm_ttime_interval_median']
+    return server_model_time_interval_stat_fea
+
+def get_server_model_sn_fea_2(train, test):
+    df = pd.concat([train[['sn', 'server_model']],
+                   test[['sn', 'server_model']]], ignore_index=True)
+    df['server_model_count_sn_2'] = df.groupby(
+        ['server_model'])['sn'].transform('count')
+    df['server_model_nunique_sn_2'] = df.groupby(
+        ['server_model'])['sn'].transform('nunique')
+    df['sn_cnt_2'] = df['sn'].map(df['sn'].value_counts())
+    return df.drop_duplicates().reset_index(drop=True)
+
+
+def get_4_time_stat_fea(df):
+    print('     生成时间统计特征')
+    time_stat_fea_df = df.groupby(['sn', 'fault_time', 'server_model']).agg(
+        {'duration_minutes': ['min', 'max', 'mean', 'median', 'skew', 'sum', 'std', 'count'],
+         'log_duration_minutes': ['min', 'max', 'mean', 'median', 'skew', 'sum', 'std'],
+         'time_diff_1': ['min', 'max', 'mean', 'median', 'skew', 'sum', 'std'],
+         'log_time_diff_1': ['min', 'max', 'median'],
+         }).reset_index()
+    new_time_stat_cols = []
+    for i in time_stat_fea_df.columns:
+        if i[0] in ['sn', 'fault_time', 'server_model']:
+            new_time_stat_cols.append(i[0])
+        else:
+            new_time_stat_cols.append(f'{i[0]}_{i[1]}')
+            #             print(f'{i[0]}_{i[1]}')
+            time_stat_fea_df.loc[time_stat_fea_df[i[0]]
+                                 [i[1]] == -np.inf, (i[0], i[1])] = -20
+            time_stat_fea_df.loc[time_stat_fea_df[i[0]]
+                                 [i[1]] == np.inf, (i[0], i[1])] = 30
+    time_stat_fea_df.columns = new_time_stat_cols
+    time_stat_fea_df['duration_minutes_range'] = time_stat_fea_df['duration_minutes_max'] - time_stat_fea_df[
+        'duration_minutes_min']
+    time_stat_fea_df['log_duration_minutes_range'] = time_stat_fea_df['log_duration_minutes_max'] - time_stat_fea_df[
+        'log_duration_minutes_min']
+    time_stat_fea_df['time_diff_1_range'] = time_stat_fea_df['time_diff_1_max'] - \
+        time_stat_fea_df['time_diff_1_min']
+    time_stat_fea_df['log_time_diff_1_range'] = time_stat_fea_df['log_time_diff_1_max'] - time_stat_fea_df[
+        'log_time_diff_1_min']
+    time_stat_fea_df['duration_minutes_freq'] = time_stat_fea_df['duration_minutes_range'] / time_stat_fea_df[
+        'duration_minutes_count']
+    print(f'    生成时间统计特征完毕，特征维度:{time_stat_fea_df.shape}')
+    return time_stat_fea_df
+
+
+def get_time_std_fea(train, test):
+    print('生成 server_model 特征')
+    df = pd.concat([train, test], axis=0, ignore_index=True)
+    # df['year'] = df['time'].dt.year
+    # df['month'] = df['time'].dt.month
+    df['hour'] = df['time'].dt.hour
+    # df['week'] = df['time'].dt.week
+    df['minute'] = df['time'].dt.minute
+    time_std = df.groupby(['sn', 'server_model']).agg(
+        {'hour': 'std', 'minute': 'std'}).reset_index()
+    time_std = time_std.rename(
+        columns={
+            'hour': 'hour_std',
+            'minute': 'minute_std'})
+    return time_std
+
+
+def get_key(all_data):
+    all_data['msg_list'] = all_data['msg'].apply(lambda x: [i.strip() for i in x.split(' | ')])
+    class_fea_cnt_list = []
+    for label in [0,1,2,3]:
+        class_df = all_data.query(f'label =={label}')
+        counter = Counter()
+        for i in class_df['msg_list']:
+            counter.update(i)
+        class_fea_cnt = pd.DataFrame({i[0]:i[1] for i in counter.most_common()},index = [f'fea_cnt_{label}']).T.reset_index().rename(columns = {'index':'fea'})
+        class_fea_cnt_list.append(class_fea_cnt)
+
+    fea_cnt_df = class_fea_cnt_list[0]
+    for tmp in class_fea_cnt_list[1:]:
+        fea_cnt_df = fea_cnt_df.merge(tmp,on = 'fea')
+
+    fea_cnt_df['fea_cnt_sum'] = fea_cnt_df.loc[:,['fea_cnt_0', 'fea_cnt_1', 'fea_cnt_2', 'fea_cnt_3']].sum(1)
+
+    all_fea_cnt = fea_cnt_df['fea_cnt_sum'].sum()
+
+    for i in ['fea_cnt_0', 'fea_cnt_1', 'fea_cnt_2', 'fea_cnt_3']:
+        fea_cnt_df[f'{i}_ratio'] = fea_cnt_df[i]/fea_cnt_df['fea_cnt_sum']
+        fea_cnt_df[f'{i}_all_ratio'] = fea_cnt_df[i]/all_fea_cnt
+
+    fea_cnt_df['fea_cnt_ratio_std'] = fea_cnt_df.loc[:,['fea_cnt_0_ratio','fea_cnt_1_ratio','fea_cnt_2_ratio','fea_cnt_3_ratio', ]].std(1)
+    fea_cnt_df['fea_cnt_std'] = fea_cnt_df.loc[:,['fea_cnt_0', 'fea_cnt_1','fea_cnt_2','fea_cnt_3',]].std(1)
+
+    fea_cnt_df['fea_cnt_all_ratio_std'] = fea_cnt_df.loc[:,['fea_cnt_0_all_ratio','fea_cnt_1_all_ratio',
+           'fea_cnt_2_all_ratio','fea_cnt_3_all_ratio',]].std(1)
+
+    fea_cnt_df = fea_cnt_df[~fea_cnt_df['fea_cnt_ratio_std'].isnull()].sort_values('fea_cnt_ratio_std',ascending = False)
+
+    fea_cnt_df['fea_max'] = np.argmax(fea_cnt_df.loc[:,['fea_cnt_0', 'fea_cnt_1', 'fea_cnt_2', 'fea_cnt_3',]].values,axis = 1)
+    key_0 = fea_cnt_df.query('fea_max ==0 ')['fea'].to_list()
+    key_1 = fea_cnt_df.query('fea_max ==1 ')['fea'].to_list()
+    key_2 = fea_cnt_df.query('fea_max ==2 ')['fea'].to_list()
+    key_3 = fea_cnt_df.query('fea_max ==3 ')['fea'].to_list()
+    # key_1 = ['OEM record c2','Processor CPU_Core_Error','001c4c','System Event Sys_Event','Power Supply PS0_Status','Temperature CPU0_Margin_Temp','Reading 51 &gt; Threshold 85 degrees C','Lower Non-critical going low','Temperature CPU1_Margin_Temp','System ACPI Power State #0x7d','Lower Critical going low']
+    # key_2 = ['OEM CPU0 MCERR','OEM CPU0 CATERR','Reading 0 &lt; Threshold 2 degrees C','0203c0a80101','Unknown CPU0 MCERR','Unknown CPU0 CATERR','Microcontroller #0x3b','System Boot Initiated','Processor #0xfa','Power Unit Pwr Unit Status','Hard reset','Power off/down','System Event #0xff','Memory CPU1A1_DIMM_Stat','000000','Power cycle','OEM record c3','Memory CPU1C0_DIMM_Stat','Reading 0 &lt; Threshold 1 degrees C','IERR']
+    # key_3 = ['Memory','Correctable ECC logging limit reached','Memory MEM_CHE0_Status','Memory Memory_Status','Memory #0x87','Memory CPU0F0_DIMM_Stat','Memory Device Disabled','Memory #0xe2','OS Stop/Shutdown OS Status','System Boot Initiated System Restart','OS Boot BIOS_Boot_Up','System Boot Initiated BIOS_Boot_UP','Memory DIMM101','OS graceful shutdown','OS Critical Stop OS Status','Memory #0xf9','Memory CPU0C0_DIMM_Stat','Memory DIMM111','Memory DIMM021',]
+    # key_4 = ['Drive Fault','NMI/Diag Interrupt','Failure detected','Power Supply AC lost','Power Supply PSU0_Supply','AC out-of-range, but present','Predictive failure','Drive Present','Temperature Temp_DIMM_KLM','Temperature Temp_DIMM_DEF','Power Supply PS1_Status','Identify Status','Power Supply PS2_Status','Temperature DIMMG1_Temp','Upper Non-critical going high','Temperature DIMMG0_Temp','Upper Critical going high','Power Button pressed','System Boot Initiated #0xb8','Deasserted']
+    return key_0,key_1,key_2,key_3
+
+def get_class_key_words_nunique(all_data):
+    print('获取 class_key_words_nunique 特征')
+
+    key_0,key_1,key_2,key_3 = get_key(all_data)
+
+    df = all_data[['sn', 'fault_time', 'msg_list']]
+    df_tmp = df.groupby(['sn' ]).agg({'msg_list':'sum'}).reset_index()
+    df_tmp['class_0_key_words_nunique'] = df_tmp['msg_list'].apply(lambda x:len(set(x)&set(key_0)))
+    df_tmp['class_1_key_words_nunique'] = df_tmp['msg_list'].apply(lambda x:len(set(x)&set(key_1)))
+    df_tmp['class_2_key_words_nunique'] = df_tmp['msg_list'].apply(lambda x:len(set(x)&set(key_2)))
+    df_tmp['class_3_key_words_nunique'] = df_tmp['msg_list'].apply(lambda x:len(set(x)&set(key_3)))
+    del df_tmp['msg_list']
+    return df_tmp
+def get_key_for_top_fea(train,test):
+    KEY_FOR_TOP_COLS = []
+    print('添加 key_for_top_fea 特征')
+    for TIME in [3, 5, 10, 15, 30, 45, 60, 90, 120, 240, 360, 480, 540, 600,60000000]:
+        for i in range(10):
+            train[f'KEY_FOR_TOP_{i}_{TIME}'] = train[f'{KEY_1[i]}_{TIME}'].astype(str)+'_'+train[f'{KEY_2[i]}_{TIME}'].astype(str)+'_'+train[f'{KEY_3[i]}_{TIME}'].astype(str)+'_'+train[f'{KEY_4[i]}_{TIME}'].astype(str)
+            test[f'KEY_FOR_TOP_{i}_{TIME}'] = test[f'{KEY_1[i]}_{TIME}'].astype(str)+'_'+test[f'{KEY_2[i]}_{TIME}'].astype(str)+'_'+test[f'{KEY_3[i]}_{TIME}'].astype(str)+'_'+test[f'{KEY_4[i]}_{TIME}'].astype(str)
+            KEY_FOR_TOP_COLS.append(f'KEY_FOR_TOP_{i}_{TIME}')
+    train = cat2num(train,KEY_FOR_TOP_COLS)
+    test = cat2num(test,KEY_FOR_TOP_COLS)
+    for KEY_FOR_TOP_COL in KEY_FOR_TOP_COLS:
+        del train[KEY_FOR_TOP_COL]
+        del test[KEY_FOR_TOP_COL]
+    return train,test
+
+def get_key_word_cross_fea(train,test):
+    print('获取关键词交叉特征......')
+    KEY_WORDS_MAP  = {'CPU0':KEY_1,'CPU1':KEY_2,'CPU2':KEY_3,'CPU3':KEY_4}
+    KEY_WORDS_CROSS_COLS =[]
+    for KEY_WORDS in KEY_WORDS_MAP:
+        for i in [3, 5, 10, 15, 30, 45, 60, 90, 120, 240, 360, 480, 540, 600,60000000]:
+            KEY_WORDS_COLS = [f'{col}_{i}' for col in KEY_WORDS_MAP[KEY_WORDS]]
+            train[f'{KEY_WORDS}_WORDS_{i}'] = train[KEY_WORDS_COLS].astype(str).sum(1)
+            test[f'{KEY_WORDS}_WORDS_{i}'] = test[KEY_WORDS_COLS].astype(str).sum(1)
+            KEY_WORDS_CROSS_COLS.append(f'{KEY_WORDS}_WORDS_{i}')
+    train = cat2num(train,KEY_WORDS_CROSS_COLS)
+    test = cat2num(test,KEY_WORDS_CROSS_COLS)
+
+    for COLS in KEY_WORDS_CROSS_COLS:
+        del train[COLS]
+        del test[COLS]
+    print('获取关键词交叉特征完毕......')
+    return train,test
+def get_time_quantile_fea(df):
+    print('    生成时间分位数特征')
+    secs = [0.2, 0.4, 0.6, 0.8]
+    time_fea_list = []
+    for sec in tqdm(secs):
+        for time_fea_type in [
+                'duration_minutes', 'log_duration_minutes', 'time_diff_1', 'log_time_diff_1']:
+            temp = df.groupby(['sn', 'server_model', 'fault_time'])[time_fea_type].quantile(sec).reset_index(
+                name=f'{time_fea_type}_' + str(sec * 100))
+
+            time_fea_list.append(temp)
+    time_fea_df = time_fea_list[0]
+    for time_fea in time_fea_list[1:]:
+        time_fea_df = time_fea_df.merge(
+            time_fea, how='left', on=[
+                'sn', 'server_model', 'fault_time'])
+    print(f'    生成时间分位数特征完毕，特征维度:{time_fea_df.shape}')
+    return time_fea_df
+
+
+def get_server_model_fea(train, test):
+    print('生成 server_model 特征')
+    df = pd.concat([train, test], axis=0, ignore_index=True)
+    df['server_model_count_sn'] = df.groupby(
+        ['server_model'])['sn'].transform('count')
+    df['server_model_nunique_sn'] = df.groupby(
+        ['server_model'])['sn'].transform('nunique')
+    #     df['server_model_count'] = df.groupby('server_model')['server_model'].transform('count')
+    #     df['server_model_cnt_quantile'] = df['server_model'].map(
+    #         df['server_model'].value_counts().rank() / len(df['server_model'].unique()))
+    #     df['server_model_cnt_rank'] = df[f'server_model_cnt_quantile'].rank(method='min')
+
+    df['sn_cnt'] = df['sn'].map(df['sn'].value_counts())
+    df['sn_freq'] = df['sn'].map(df['sn'].value_counts() / len(df))
+    df['server_model_cnt'] = df['server_model'].map(
+        df['server_model'].value_counts())
+    df['server_model_freq'] = df['server_model'].map(
+        df['server_model'].value_counts() / len(df))
+    select_cols = ['sn', 'server_model',
+                   'server_model_count_sn', 'server_model_nunique_sn',
+                   'sn_cnt', 'sn_freq', 'server_model_cnt', 'server_model_freq'
+                   #                    'server_model_count','server_model_cnt_quantile', 'server_model_cnt_rank'
+                   ]
+    server_model_fea = df[select_cols]
+
+    cat_feats = [
+        'server_model']  # ,'server_model_day_date','server_model_dayofmonth','server_model_dayofweek','server_model_hour']
+    # for name in cat_feats:
+    #     le = LabelEncoder()
+    #     server_model_fea[f'{name}_LabelEnc'] = le.fit_transform(
+    #         server_model_fea[name])
+    server_model_fea = cat2num(server_model_fea, cat_feats, Transfer2num=True)
+    server_model_fea = server_model_fea.drop_duplicates().reset_index(drop=True)
+    print(f'生成 server_model 特征完毕，特征维度:{server_model_fea.shape}')
+
+    return server_model_fea
+
+
+def get_time_type_msg_unique_fea(df):
+    df['msg_list'] = df['msg'].apply(
+        lambda x: [i.strip() for i in x.split(' | ')])
+
+    df['msg_0'] = df['msg'].apply(
+        lambda x: [
+            get_msg_location(
+                x.split(' | '),
+                0)])
+    df['msg_1'] = df['msg'].apply(
+        lambda x: [
+            get_msg_location(
+                x.split(' | '),
+                1)])
+    df['msg_2'] = df['msg'].apply(
+        lambda x: [
+            get_msg_location(
+                x.split(' | '),
+                2)])
+
+    df = df.groupby(['sn', 'fault_time']).agg(
+        {'msg_list': 'sum', 'msg_0': 'sum', 'msg_1': 'sum', 'msg_2': 'sum'}).reset_index()
+
+    df['msg_set'] = df['msg_list'].apply(lambda x: '|'.join(list(set(x))))
+
+    df['msg_0_set'] = df['msg_0'].apply(lambda x: '|'.join(list(set(x))))
+    df['msg_1_set'] = df['msg_1'].apply(lambda x: '|'.join(list(set(x))))
+    df['msg_2_set'] = df['msg_2'].apply(lambda x: '|'.join(list(set(x))))
+    df = df[['sn', 'fault_time', 'msg_set',
+             'msg_0_set', 'msg_1_set', 'msg_2_set']]
+    return df
+
+
+def get_msg_unique_fea(train, test, time_type='last'):
+    print('生成msg_unique_ fea')
+    common_cols = ['msg_set', 'msg_0_set', 'msg_1_set', 'msg_2_set']
+    df = pd.concat([train, test], axis=0, ignore_index=True)
+    df['time_interval'] = (
+        pd.to_datetime(
+            df['fault_time']) -
+        df['time']).apply(
+            lambda x: x.total_seconds())
+
+    last_fea = get_time_type_msg_unique_fea(df.query('time_interval >0'))
+    last_fea = last_fea.rename(columns={i: f'last_{i}' for i in common_cols})
+    next_fea = get_time_type_msg_unique_fea(df.query('time_interval <0'))
+    next_fea = next_fea.rename(columns={i: f'next_{i}' for i in common_cols})
+    all_fea = get_time_type_msg_unique_fea(df)
+    all_fea = all_fea.rename(columns={i: f'all_{i}' for i in common_cols})
+    msg_unique_fea = all_fea.merge(
+        last_fea, on=['sn', 'fault_time'], how='outer')
+    msg_unique_fea = msg_unique_fea.merge(
+        next_fea, on=['sn', 'fault_time'], how='outer')
+    return msg_unique_fea
+
+
+def get_duration_minutes_fea(train, test):
+    print('生成 duration_minutes 特征')
+    df = pd.concat([train, test], axis=0, ignore_index=True)
+    df['duration_minutes'] = (pd.to_datetime(df['fault_time']) - pd.to_datetime(df['time'])).apply(
+        lambda x: x.total_seconds())
+    df['log_duration_minutes'] = np.log(df['duration_minutes'])
+
+    df = df.sort_values(['sn', 'label', 'server_model',
+                        'fault_time', 'time']).reset_index(drop=True)
+    df['time_diff_1'] = (df.groupby(['sn', 'server_model', 'fault_time'])['time'].diff(1)).apply(
+        lambda x: x.total_seconds())
+    df['time_diff_1'] = df['time_diff_1'].fillna(0)
+    df['log_time_diff_1'] = np.log(df['time_diff_1'])
+
+    # time_quantile_fea_df = get_time_quantile_fea(df)
+    # time_stat_fea_df = get_4_time_stat_fea(df)
+    # df_tmp = time_quantile_fea_df.merge(time_stat_fea_df, on= ['sn',   'server_model','fault_time'],how = 'left')
+    time_stat_fea_df = get_4_time_stat_fea(df)
+    df_tmp = time_stat_fea_df
+    print(f'生成 duration_minutes 特征完毕，特征维度{df_tmp.shape}')
+    return df_tmp
+
+
+def get_msg_text_fea_all(all_data):
+    all_data['label'] = all_data['label'].fillna(-1)
+    all_data['msg_list'] = all_data['msg'].apply(lambda x: [i.strip() for i in x.split(' | ')])
+    all_data['msg_0'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 0)])
+    all_data['msg_1'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 1)])
+    all_data['msg_2'] = all_data['msg'].apply(lambda x: [get_msg_location(x.split(' | '), 2)])
+
+    all_data = all_data.sort_values(['sn', 'fault_time', 'time']).reset_index(drop=True)
+    del all_data['label']
+    last_data = all_data.query('time_interval >0')
+    next_data = all_data.query('time_interval <=0')
+
+    # id_cols = ['sn', 'fault_time', 'label']
+
+    # all_msg_text_fea = get_msg_text_fea(all_data, msg_type='all')
+    last_msg_text_fea = get_msg_text_fea(last_data, msg_type='last')
+    # next_msg_text_fea = get_msg_text_fea(next_data, msg_type='next')
+    msg_text_fea = last_msg_text_fea
+    return msg_text_fea
+
+def get_test_key_words(train,test):
+
+    df = pd.concat([train[['sn', 'fault_time', 'label','msg']],test[['sn', 'fault_time',  'msg']]],ignore_index = True).drop_duplicates(['sn', 'fault_time',  'msg'])
+    df['label'] = df['label'].fillna(5)
+    df['msg_list'] = df['msg'].apply(lambda x:[i.strip() for i in x.split(' | ')])
+    words_cnt_df_list = []
+    for label in df['label'].unique():
+        label = int(label)
+        df_tmp = df.query(f'label == {label}')
+        counter = Counter()
+        for words in df_tmp['msg_list']:
+            words = [i.replace('_',' ') for i in words]
+            # word_list = []
+            # for i in words:
+            #     word_list+=i.split(' ')
+            # words = word_list
+            counter.update(words)
+        words_cnt_df = pd.DataFrame(counter,index = [0]).T.reset_index().rename(columns = {'index':'word',0:f'cnt_{label}'})
+        words_cnt_df_list.append(words_cnt_df)
+    words_cnt_df = words_cnt_df_list[0]
+    for i in words_cnt_df_list[1:]:
+        words_cnt_df = words_cnt_df.merge(i,on = 'word',how = 'outer' )
+
+    words_cnt_df = words_cnt_df.fillna(-1)
+    words_cnt_df1 = words_cnt_df.query('cnt_0 >10 and cnt_2 >10 and cnt_1 >10 and cnt_3>10 and cnt_5>10 ')
+    cnt_class = ['cnt_0','cnt_1','cnt_2','cnt_3','cnt_5']
+    words_cnt_df1['word_cnt_sum'] = words_cnt_df1.loc[:,cnt_class].sum(1)
+    for i in cnt_class:
+        words_cnt_df1[f'{i}_ratio'] = words_cnt_df1[i]/words_cnt_df1['word_cnt_sum']
+    words_cnt_df1['word_cnt_ratio_std'] = words_cnt_df1.loc[:,['cnt_0_ratio','cnt_1_ratio', 'cnt_2_ratio', 'cnt_3_ratio']].std(1)
+    words_cnt_df1['cnt_1_0_diff'] = (words_cnt_df1['cnt_1_ratio'] - words_cnt_df1['cnt_0_ratio'])
+    test_key_words = words_cnt_df1.sort_values('cnt_5',ascending = False)['word'].to_list()[5:40]
+    return test_key_words
+
+def get_w2v_mean(w2v_model,sentences):
+    emb_matrix = list()
+    vec = list()
+    for w in sentences.split():
+        if w in w2v_model.wv:
+            vec.append(w2v_model.wv[w])
+    if len(vec) > 0:
+        emb_matrix.append(np.mean(vec, axis=0))
+    else:
+        emb_matrix.append([0] * w2v_model.vector_size)
+    return emb_matrix
+def get_tfidf_svd(tfv,svd,sentences, n_components=16):
+    X_tfidf = tfv.transform(sentences)
+    X_svd = svd.transform(X_tfidf)
+    return np.mean(X_svd, axis=0)
+def get_w2v_tfidf_fea(all_data):
+    print('w2v编码')
+    df = all_data
+    df['msg_list'] = df['msg'].apply(lambda x: [i.strip().lower().replace(' ','_') for i in x.split(" | ")])
+    df = df.groupby(['sn']).agg({'msg_list': 'sum'}).reset_index()
+    df['text'] = df['msg_list'].apply(lambda x: ' '.join(x))
+
+    sentences_list = df['text'].values.tolist()
+    sentences = []
+    for s in sentences_list:
+        sentences.append([w for w in s.split()])
+    w2v_model = Word2Vec(sentences, vector_size=10, window=3, min_count=5, sg=0, hs=1, seed=2022)
+    df['text_w2v'] = df['text'].apply(lambda x: get_w2v_mean(w2v_model, x)[0])
+
+    print('tfidf编码')
+    X = df['text'].to_list()
+    tfv = TfidfVectorizer(ngram_range=(1, 3), min_df=5, max_features=50000)
+    tfv.fit(X)
+    X_tfidf = tfv.transform(X)
+    svd = TruncatedSVD(n_components=16)  # 降维
+    svd.fit(X_tfidf)
+    df['text_tfidf'] = df['text'].apply(lambda x: get_tfidf_svd(tfv, svd, x.split()))
+
+    print("doc2vec编码")
+    texts = df['text'].tolist()
+    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
+    model = Doc2Vec(documents, window=5, min_count=3, workers=4)
+    docvecs = model.docvecs
+    df['doc2vec'] = [docvecs[i] for i in range(len(docvecs))]
+
+    for i in range(32):
+        df[f'msg_w2v_{i}'] = df['text_w2v'].apply(lambda x: x[i])
+    for i in range(16):
+        df[f'msg_tfv_{i}'] = df['text_tfidf'].apply(lambda x: x[i])
+    for i in range(100):
+        df[f'msg_doc2vec_{i}'] = df['doc2vec'].apply(lambda x: x[i])
+
+    save_cols = [i for i in df.columns if i not in ['msg_list', 'text', 'text_w2v', 'text_tfidf', 'doc2vec']]
+    return df[save_cols]
+
+# w2v_tfidf_fea = get_w2v_tfidf_fea(all_data)
+class BetaEncoder(object):
+
+    def __init__(self, group):
+
+        self.group = group
+        self.stats = None
+
+    # get counts from df
+    def fit(self, df, target_col):
+        # 先验均值
+        self.prior_mean = np.mean(df[target_col])
+        stats = df[[target_col, self.group]].groupby(self.group)
+        # count和sum
+        stats = stats.agg(['sum', 'count'])[target_col]
+        stats.rename(columns={'sum': 'n', 'count': 'N'}, inplace=True)
+        stats.reset_index(level=0, inplace=True)
+        self.stats = stats
+
+    # extract posterior statistics
+    def transform(self, df, stat_type, N_min=1):
+
+        df_stats = pd.merge(df[[self.group]], self.stats, how='left')
+        n = df_stats['n'].copy()
+        N = df_stats['N'].copy()
+
+        # fill in missing
+        nan_indexs = np.isnan(n)
+        n[nan_indexs] = self.prior_mean
+        N[nan_indexs] = 1.0
+
+        # prior parameters
+        N_prior = np.maximum(N_min - N, 0)
+        alpha_prior = self.prior_mean * N_prior
+        beta_prior = (1 - self.prior_mean) * N_prior
+
+        # posterior parameters
+        alpha = alpha_prior + n
+        beta = beta_prior + N - n
+
+        # calculate statistics
+        if stat_type == 'mean':
+            num = alpha
+            dem = alpha + beta
+
+        elif stat_type == 'mode':
+            num = alpha - 1
+            dem = alpha + beta - 2
+
+        elif stat_type == 'median':
+            num = alpha - 1 / 3
+            dem = alpha + beta - 2 / 3
+
+        elif stat_type == 'var':
+            num = alpha * beta
+            dem = (alpha + beta) ** 2 * (alpha + beta + 1)
+
+        elif stat_type == 'skewness':
+            num = 2 * (beta - alpha) * np.sqrt(alpha + beta + 1)
+            dem = (alpha + beta + 2) * np.sqrt(alpha * beta)
+
+        elif stat_type == 'kurtosis':
+            num = 6 * (alpha - beta) ** 2 * (alpha + beta + 1) - \
+                alpha * beta * (alpha + beta + 2)
+            dem = alpha * beta * (alpha + beta + 2) * (alpha + beta + 3)
+
+        # replace missing
+        value = num / dem
+        value[np.isnan(value)] = np.nanmedian(value)
+        return value
+
+
+def get_beta_target(train, test):
+    N_min = 1000
+    feature_cols = []
+
+    # encode variables
+    for c in ['server_model']:
+        # fit encoder
+        be = BetaEncoder(c)
+        be.fit(train, 'label')
+
+        # mean
+        feature_name = f'{c}_mean'
+        train[feature_name] = be.transform(train, 'mean', N_min)
+        test[feature_name] = be.transform(test, 'mean', N_min)
+        feature_cols.append(feature_name)
+
+        # mode
+        feature_name = f'{c}_mode'
+        train[feature_name] = be.transform(train, 'mode', N_min)
+        test[feature_name] = be.transform(test, 'mode', N_min)
+        feature_cols.append(feature_name)
+
+        # median
+        feature_name = f'{c}_median'
+        train[feature_name] = be.transform(train, 'median', N_min)
+        test[feature_name] = be.transform(test, 'median', N_min)
+        feature_cols.append(feature_name)
+
+        # var
+        feature_name = f'{c}_var'
+        train[feature_name] = be.transform(train, 'var', N_min)
+        test[feature_name] = be.transform(test, 'var', N_min)
+        feature_cols.append(feature_name)
+
+        #     # skewness
+        #     feature_name = f'{c}_skewness'
+        #     train[feature_name] = be.transform(train, 'skewness', N_min)
+        #     test[feature_name]  = be.transform(test,  'skewness', N_min)
+        #     feature_cols.append(feature_name)
+
+        # kurtosis
+        feature_name = f'{c}_kurtosis'
+        train[feature_name] = be.transform(train, 'kurtosis', N_min)
+        test[feature_name] = be.transform(test, 'kurtosis', N_min)
+        feature_cols.append(feature_name)
+    df = train.append(test).reset_index(drop=True)
+    df = df[['sn', 'fault_time', 'server_model', 'server_model_mean',
+             'server_model_mode', 'server_model_median', 'server_model_var',
+             'server_model_kurtosis']].drop_duplicates().reset_index(drop=True)
+    return df
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/generate_pseudo_label.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/generate_pseudo_label.py
@ -0,0 +1,73 @@
+import pandas as pd
+import numpy as np
+import os
+from utils import TRAIN_DIR ,TEST_A_DIR,TEST_B_DIR,RESULT_DIR,DATA_DIR
+
+log_dataset_a = pd.read_csv(os.path.join(DATA_DIR,'preliminary_a_test/preliminary_sel_log_dataset_a.csv'))
+log_dataset_b = pd.read_csv(os.path.join(DATA_DIR,'preliminary_b_test/preliminary_sel_log_dataset_b.csv'))
+submit_dataset_a = pd.read_csv(os.path.join(DATA_DIR,'preliminary_a_test/preliminary_submit_dataset_a.csv'))
+submit_dataset_b = pd.read_csv(os.path.join(DATA_DIR,'preliminary_b_test/preliminary_submit_dataset_b.csv'))
+
+log_dataset_c = pd.concat([log_dataset_a,log_dataset_b],ignore_index = True,axis = 0)
+submit_dataset_c = pd.concat([submit_dataset_a,submit_dataset_b],ignore_index = True,axis = 0)
+
+log_dataset_c.to_csv(os.path.join(TEST_A_DIR,'final_log_dataset_c.csv'),index =False)
+submit_dataset_c.to_csv(os.path.join(TEST_A_DIR,'final_submit_dataset_c.csv'),index =False)
+
+
+#
+# cat_prob = pd.read_csv(os.path.join(RESULT_DIR,'../../../TianchiAIOps_bert_model/cat_prob_result.csv'))
+# lgb_prob = pd.read_csv(os.path.join(RESULT_DIR,'../../../TianchiAIOps_bert_model/lgb_prob_result.csv'))
+
+cat_prob = pd.read_csv(os.path.join(RESULT_DIR,'B_prob_7511.csv'))
+lgb_prob = pd.read_csv(os.path.join(RESULT_DIR,'baseline_prob_7495.csv'))
+cat_prob.columns  = ['cat_class_0', 'cat_class_1', 'cat_class_2', 'cat_class_3', 'label', 'sn',
+       'fault_time']
+lgb_prob.columns = ['lgb_class_0', 'lgb_class_1', 'lgb_class_2', 'lgb_class_3', 'label', 'sn',
+       'fault_time']
+
+lgb_prob = lgb_prob[lgb_prob['label'].isnull()]
+cat_prob = cat_prob[cat_prob['label'].isnull()]
+
+cat_prob['cat_prob'] = cat_prob.loc[:,['cat_class_0', 'cat_class_1', 'cat_class_2', 'cat_class_3']].max(1)
+cat_prob['cat_label'] = np.argmax(cat_prob.loc[:,['cat_class_0', 'cat_class_1', 'cat_class_2', 'cat_class_3']].values,axis = 1)
+
+lgb_prob['lgb_prob'] = lgb_prob.loc[:,['lgb_class_0', 'lgb_class_1', 'lgb_class_2', 'lgb_class_3']].max(1)
+lgb_prob['lgb_label'] = np.argmax(lgb_prob.loc[:,['lgb_class_0', 'lgb_class_1', 'lgb_class_2', 'lgb_class_3']].values,axis = 1)
+
+lgb_prob = lgb_prob[['sn','fault_time','lgb_label','lgb_prob']]
+cat_prob = cat_prob[['sn','fault_time','cat_label','cat_prob']]
+
+# prob = cat_prob.merge(lgb_prob,on =['sn','fault_time'],
+#                how = 'left' )
+
+prob = pd.concat([cat_prob,lgb_prob],ignore_index = True)
+prob['cat_prob']=prob['cat_prob'].fillna(1)
+prob['lgb_prob']=prob['lgb_prob'].fillna(1)
+prob.loc[prob['cat_label'].isnull(),'cat_label'] = prob.loc[prob['cat_label'].isnull(),'lgb_label']
+prob.loc[prob['lgb_label'].isnull(),'lgb_label'] = prob.loc[prob['lgb_label'].isnull(),'cat_label']
+
+
+pseudo_labels = prob.query('cat_prob >0.85 and lgb_prob >0.85 and lgb_label == cat_label  ')
+
+pseudo_labels = pseudo_labels[['sn','fault_time','cat_label']].rename(columns = {'cat_label':'label'}).reset_index(drop = True)
+pseudo_labels.to_csv(os.path.join(TRAIN_DIR,'pseudo_labels.csv'),index= False)
+print(f'生成伪标签的数据维度:{pseudo_labels.shape}')
+
+pseudo_sel_log_dataset = pd.read_csv(os.path.join(TEST_A_DIR,'final_sel_log_dataset_c.csv'))
+pseudo_sel_log_dataset = pseudo_sel_log_dataset[pseudo_sel_log_dataset['sn'].isin(pseudo_labels['sn'].to_list())]
+pseudo_sel_log_dataset.to_csv(os.path.join(TRAIN_DIR,'pseudo_sel_log_dataset.csv'),index = False)
+print(f'生成伪标签的日志数据维度:{pseudo_sel_log_dataset.shape}')
+
+# 制作新的测试集
+final_submit_dataset_d= prob.merge(pseudo_labels,on =['sn','fault_time'],how = 'left' )
+final_submit_dataset_d = final_submit_dataset_d[final_submit_dataset_d['label'].isnull()][['sn','fault_time' ]].reset_index(drop = True)
+final_submit_dataset_d.to_csv(os.path.join(TEST_A_DIR,'final_submit_dataset_d.csv'),index= False)
+print(f'生成新的测试集维度:{final_submit_dataset_d.shape}')
+
+final_sel_log_dataset_d = pd.read_csv(os.path.join(TEST_A_DIR,'final_sel_log_dataset_c.csv'))
+final_sel_log_dataset_d = final_sel_log_dataset_d[final_sel_log_dataset_d['sn'].isin(final_submit_dataset_d['sn'].to_list())]
+
+final_sel_log_dataset_d.to_csv(
+    os.path.join(TEST_A_DIR,'final_sel_log_dataset_d.csv'),index = False)
+print(f'生成新的测试集日志数据维度:{final_sel_log_dataset_d.shape}')
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/get_crashdump_venus_fea.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/get_crashdump_venus_fea.py
@ -0,0 +1,353 @@
+import datetime
+import os
+import gc
+import warnings
+import pandas as pd
+import pickle
+from gensim.models.word2vec import Word2Vec
+from gensim.models.doc2vec import Doc2Vec, TaggedDocument
+from sklearn.utils.class_weight import compute_class_weight
+from sklearn.preprocessing import LabelEncoder
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.decomposition import TruncatedSVD
+import numpy as np
+import pandas as pd
+from generate_feature import add_w2v_feats, cat2num
+from generate_feature import get_key
+
+from generate_feature import get_beta_target, add_last_next_time4fault, get_feature, \
+    get_duration_minutes_fea, get_nearest_msg_fea, get_server_model_sn_fea_2, \
+    get_server_model_fea, get_msg_text_fea_all, get_key_word_cross_fea, get_server_model_time_interval_stat_fea, \
+    get_w2v_feats, get_key, get_class_key_words_nunique
+from model import run_cbt, run_lgb
+from utils import RESULT_DIR, TRAIN_DIR, \
+    TEST_A_DIR, KEY_WORDS, TOP_KEY_WORDS, get_word_counter, search_weight, macro_f1, TIME_INTERVAL, PSEUDO_FALG, \
+    GENERATION_DIR
+
+warnings.filterwarnings('ignore')
+
+
+def get_fault_code_list(x):
+    try:
+        x = x.replace('.', ',').split(',')
+    except:
+        x = []
+    return x
+
+
+def get_module_cause_list(x):
+    try:
+        x = x.replace(',', '_').replace('，', '_')
+        x = list(set(x.split('_')))
+    except:
+        x = []
+    return x
+
+
+def get_label(PSEUDO_FALG):
+    preliminary_train_label_dataset = pd.read_csv(preliminary_train_label_dataset_path)
+    preliminary_train_label_dataset_s = pd.read_csv(preliminary_train_label_dataset_s_path)
+
+    if PSEUDO_FALG:
+        print('获取伪标签LABEL')
+        pseudo_labels = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_labels.csv'))
+        label = pd.concat([preliminary_train_label_dataset,
+                           pseudo_labels,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        label = pd.concat([preliminary_train_label_dataset,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    label['fault_time'] = label['fault_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+    label['fault_time'] = label['fault_time'].apply(lambda x: str(x))
+    return label
+
+
+def get_module_cause_code(x, code_name):
+    code_list = []
+    for i in x:
+        if code_name in i:
+            code_list.append(i)
+    return code_list
+
+
+def get_alertname_code(x, alertname):
+    x = x.split(',')
+
+    try:
+        alertname_code = x[x.index(alertname) + 1]
+    except:
+        alertname_code = np.nan
+    return alertname_code
+
+
+def get_alertname_code_2(x, alertname):
+    # x =x.split(',')
+
+    try:
+        alertname_code = x[x.index(alertname) + 1]
+    except:
+        alertname_code = ' '
+    return alertname_code
+
+
+def get_last_msg_cnt(x):
+    last_msg = x[-1]
+    cnt = x.count(last_msg)
+    return cnt
+
+
+def get_first_msg_cnt(x):
+    first_msg = x[0]
+    cnt = x.count(first_msg)
+    return cnt
+
+
+def get_crashdump_venus_data():
+    final_venus_dataset = pd.read_csv(os.path.join(TEST_A_DIR, 'final_venus_dataset_b.csv'))
+    final_crashdump_dataset = pd.read_csv(os.path.join(TEST_A_DIR, 'final_crashdump_dataset_b.csv'))
+    final_crashdump_venus = final_crashdump_dataset.merge(final_venus_dataset, on=['sn', 'fault_time'],
+                                                          how='outer')
+
+    preliminary_venus_dataset = pd.read_csv(os.path.join(TRAIN_DIR, 'preliminary_venus_dataset.csv'))
+    preliminary_crashdump_dataset = pd.read_csv(os.path.join(TRAIN_DIR, 'preliminary_crashdump_dataset.csv'))
+    preliminary_crashdump_venus = preliminary_crashdump_dataset.merge(preliminary_venus_dataset,
+                                                                      on=['sn', 'fault_time'],
+                                                                      how='outer')
+
+    crashdump_venus = pd.concat([final_crashdump_venus, preliminary_crashdump_venus],
+                                ignore_index=True).drop_duplicates()
+    crashdump_venus = crashdump_venus.sort_values(['sn', 'fault_time']).reset_index(drop=True)
+    return crashdump_venus
+
+
+def get_crashdump_venus_fea(crashdump_venus):
+    print('生成 crashdump_venus 特征')
+    crashdump_venus['module_cause_list'] = crashdump_venus['module_cause'].apply(lambda x: get_module_cause_list(x))
+    crashdump_venus['fault_code_list'] = crashdump_venus['fault_code'].apply(lambda x: get_fault_code_list(x))
+
+    code_name_list = ['module', 'cod1', 'cod2', 'addr', 'port']
+    for code_name in code_name_list:
+        crashdump_venus[f'module_cause_{code_name}'] = crashdump_venus['module_cause_list'].apply(
+            lambda x: get_module_cause_code(x, code_name))
+        crashdump_venus[f'module_cause_{code_name}_len'] = crashdump_venus[f'module_cause_{code_name}'].apply(
+            lambda x: len(x))
+        crashdump_venus[f'module_cause_{code_name}'] = crashdump_venus[f'module_cause_{code_name}'].apply(
+            lambda x: '_'.join(set(x)))
+    code_name_list = ['cha', '0x', 'cod', 'core', 'cpu', 'm2m', 'pcu']
+    for code_name in code_name_list:
+        crashdump_venus[f'fault_{code_name}'] = crashdump_venus['fault_code_list'].apply(
+            lambda x: get_module_cause_code(x, code_name))
+        crashdump_venus[f'fault_{code_name}_len'] = crashdump_venus[f'fault_{code_name}'].apply(lambda x: len(x))
+        crashdump_venus[f'fault_{code_name}'] = crashdump_venus[f'fault_{code_name}'].apply(lambda x: '_'.join(set(x)))
+
+    cols_tmp = ['module_cause', 'fault_code', 'module_cause_module',
+                'module_cause_cod1', 'module_cause_cod2', 'module_cause_addr',
+                'module_cause_port', 'fault_cha', 'fault_0x', 'fault_cod', 'fault_core',
+                'fault_cpu', 'fault_m2m', 'fault_pcu', ]
+    new_cat_cols = []
+    crashdump_venus = cat2num(crashdump_venus, cols_tmp)
+    for name in cols_tmp:
+        # le = LabelEncoder()
+        # crashdump_venus[f'{name}_LabelEnc'] = le.fit_transform(crashdump_venus[name])
+        new_cat_cols.append(f'{name}_LabelEnc')
+
+    num_cols = ['fault_pcu_len', 'fault_m2m_len',
+                'fault_cpu_len', 'fault_0x_len', 'fault_cod_len',
+                'module_cause_module_len', 'module_cause_cod1_len',
+                'module_cause_cod2_len', 'module_cause_addr_len',
+                'module_cause_port_len', 'fault_cha_len', 'fault_core_len', ]
+
+    crashdump_venus = crashdump_venus[['sn', 'fault_time'] + new_cat_cols + num_cols]
+    crashdump_venus = crashdump_venus.rename(columns={'fault_time': 'crashdump_fault_time'})
+
+    crashdump_venus['crashdump_fault_time'] = pd.to_datetime(crashdump_venus['crashdump_fault_time'])
+    del crashdump_venus['crashdump_fault_time']
+    print(f'生成 crashdump_venus 特征完毕,特征维度 {crashdump_venus.shape}')
+    return crashdump_venus
+
+
+def get_location_word(x, num):
+    try:
+        return x[num]
+    except:
+        return
+
+
+def get_label(PSEUDO_FALG):
+    preliminary_train_label_dataset = pd.read_csv(preliminary_train_label_dataset_path)
+    preliminary_train_label_dataset_s = pd.read_csv(preliminary_train_label_dataset_s_path)
+
+    if PSEUDO_FALG:
+        print('获取伪标签LABEL')
+        pseudo_labels = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_labels.csv'))
+        label = pd.concat([preliminary_train_label_dataset,
+                           pseudo_labels,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        label = pd.concat([preliminary_train_label_dataset,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    label['fault_time'] = label['fault_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+    label['fault_time'] = label['fault_time'].apply(lambda x: str(x))
+    return label
+
+
+module_list = ['module0','module1','module2','module3','module4','module5','module7','module8','module9',
+             'module10','module11','module12','module13','module14','module17','module18','module19',
+             'in traffic control',
+             'irpp0','irpp1',
+             'pcie rootport 0:0.0','pcie rootport a2:0.0','pcie rootport 2b:3.0',
+             'port a','port c']
+module_list2 = ['module0','module1','module2','module3','module4','module5','module7','module8','module9',
+'module10','module11','module12','module13','module14','module17','module18','module19']
+other_module_list = ['in traffic control', 'irpp0', 'irpp1', 'pcie rootport 0:0.0',
+       'pcie rootport a2:0.0', 'pcie rootport 2b:3.0', 'port a', 'port c']
+module_content_list = ['module0_cod1', 'module0_cod2', 'module0_addr',
+       'module1_cod1', 'module1_cod2', 'module1_addr', 'module2_cod1',
+       'module2_cod2', 'module2_addr', 'module3_cod1', 'module3_cod2',
+       'module3_addr', 'module4_cod1', 'module4_cod2', 'module4_addr',
+       'module5_cod1', 'module5_cod2', 'module5_addr', 'module7_cod1',
+       'module7_cod2', 'module7_addr', 'module8_cod1', 'module8_cod2',
+       'module8_addr', 'module9_cod1', 'module9_cod2', 'module9_addr',
+       'module10_cod1', 'module10_cod2', 'module10_addr', 'module11_cod1',
+       'module11_cod2', 'module11_addr', 'module12_cod1', 'module12_cod2',
+       'module12_addr', 'module13_cod1', 'module13_cod2', 'module13_addr',
+       'module14_cod1', 'module14_cod2', 'module14_addr', 'module17_cod1',
+       'module17_cod2', 'module17_addr', 'module18_cod1', 'module18_cod2',
+       'module18_addr', 'module19_cod1', 'module19_cod2', 'module19_addr']
+fault_code_content_list = ['fault_code_cod1', 'fault_code_cod2',
+       'fault_code_cpu0', 'fault_code_cpu1']
+
+
+crashdump_venus = get_crashdump_venus_data()
+crashdump_venus['module_cause_list'] = crashdump_venus['module_cause'].fillna('_').apply(lambda x:x.split(','))
+crashdump_venus['module_cause'] = crashdump_venus['module_cause'].fillna('_').apply(lambda x:x.replace(':','_').replace(',','_'))
+for module in module_list:
+    crashdump_venus['module_cause'] = crashdump_venus['module_cause'].fillna('_').apply(
+    lambda x:x.replace(f'{module}_',f'{module}:').replace(f'_{module}',f',{module}'))
+crashdump_venus['module_cause'] = crashdump_venus['module_cause'].apply(lambda x:x.replace(':',','))
+
+for module in module_list:
+    crashdump_venus[module] = crashdump_venus['module_cause'].apply(lambda x:get_alertname_code(x,module))
+    crashdump_venus[module] = crashdump_venus.loc[:,module].fillna(' ').apply(lambda x:x.replace('_',' '))
+    crashdump_venus[module] = crashdump_venus[module].apply(lambda x:x.split(' '))
+crashdump_venus['module_cause_new'] = crashdump_venus.loc[:,module_list].sum(1)
+
+
+for module in module_list2:
+    crashdump_venus[f'{module}_cod1'] = crashdump_venus[module].apply(lambda x:[get_alertname_code_2(x,'cod1')])
+    crashdump_venus[f'{module}_cod2'] = crashdump_venus[module].apply(lambda x:[get_alertname_code_2(x,'cod2')])
+    crashdump_venus[f'{module}_addr'] = crashdump_venus[module].apply(lambda x:[get_alertname_code_2(x,'addr')])
+    del crashdump_venus[module]
+    gc.collect()
+
+crashdump_venus['fault_code_list'] = crashdump_venus['fault_code'].fillna(' ').apply(lambda x:x.split('.'))
+for i in ['cod1','cod2','cpu0','cpu1']:
+    crashdump_venus[f'fault_code_{i}'] = crashdump_venus['fault_code_list'].apply(lambda x:[get_alertname_code_2(x,i)])
+
+
+crashdump_venus['other_module_list'] = crashdump_venus.loc[:,other_module_list].sum(1)
+crashdump_venus['module_content_list'] = crashdump_venus.loc[:,module_content_list].sum(1)
+crashdump_venus['module_cause_new'] = crashdump_venus.loc[:,other_module_list+module_content_list].sum(1)
+crashdump_venus['fault_code_content_list'] = crashdump_venus.loc[:,fault_code_content_list].sum(1)
+crashdump_venus['all_crashdump_venus'] = crashdump_venus.loc[:,other_module_list+module_content_list+fault_code_content_list].sum(1)
+
+f1_list = ['sn']
+f2_list = ['other_module_list','module_content_list','module_cause_new','fault_code_content_list','all_crashdump_venus']
+w2v_feats_df = crashdump_venus[f1_list].drop_duplicates()
+w2v_feats_df_list = []
+for f1 in f1_list:
+    for f2 in f2_list:
+        w2v_fea_tmp = add_w2v_feats(crashdump_venus,w2v_feats_df,f1,f2,emb_size = 10,window = 5,min_count  =5,)
+        w2v_feats_df_list.append(w2v_fea_tmp)
+w2v_feats_df = w2v_feats_df_list[0]
+for i in w2v_feats_df_list[1:]:
+    w2v_feats_df = w2v_feats_df.merge(i,on = 'sn',how = 'left')
+
+for i in other_module_list+module_content_list+fault_code_content_list:
+    crashdump_venus[i] = crashdump_venus[i].astype(str)
+
+crashdump_venus = cat2num(crashdump_venus,other_module_list+module_content_list+fault_code_content_list)
+for i in other_module_list+module_content_list+fault_code_content_list:
+    del crashdump_venus[i]
+gc.collect()
+crashdump_venus = crashdump_venus.merge(w2v_feats_df,on ='sn',how ='left').rename(columns ={'fault_time':'crashdump_venus_fault_time'} )
+
+preliminary_train_label_dataset_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset.csv')
+preliminary_train_label_dataset_s_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset_s.csv')
+test = pd.read_csv(os.path.join(TEST_A_DIR, 'final_submit_dataset_b.csv'))[['sn', 'fault_time' ]]
+train = get_label(False)[['sn', 'fault_time', 'label',]]
+
+test_tmp = test[['sn', 'fault_time']]
+test_tmp = test_tmp.merge(crashdump_venus, on='sn').drop_duplicates(['sn', 'fault_time']).reset_index(drop=True)
+train_tmp = train[['sn', 'fault_time', 'label', ]]
+train_tmp = train_tmp.merge(crashdump_venus, on='sn').drop_duplicates(['sn', 'fault_time']).reset_index(drop=True)
+
+
+train_tmp['duration_fault_time'] = pd.to_datetime(train_tmp['fault_time']) - pd.to_datetime(train_tmp['crashdump_venus_fault_time'])
+test_tmp['duration_fault_time'] = pd.to_datetime(test_tmp['fault_time']) - pd.to_datetime(test_tmp['crashdump_venus_fault_time'])
+
+train_tmp['duration_fault_time'] = train_tmp['duration_fault_time'].apply(lambda x:x.total_seconds())
+test_tmp['duration_fault_time']  = test_tmp['duration_fault_time'].apply(lambda x:x.total_seconds())
+
+
+drop_cols = ['sn', 'fault_time', 'fault_code', 'module_cause', 'module','crashdump_venus_fault_time',
+       'module_cause_list', 'module_cause_new', 'fault_code_list','label','duration_fault_time',
+       'other_module_list', 'module_content_list', 'fault_code_content_list',
+       'all_crashdump_venus',]
+use_cols = [i for i in train_tmp.columns if i not in drop_cols]
+
+cat_cols = [f'{i}_LabelEnc' for i in other_module_list+module_content_list+fault_code_content_list]
+
+oof_prob = np.zeros((train.shape[0], 4))
+
+test_prob = np.zeros((test.shape[0], 4))
+# seeds = [42,4242,40424,1024,2048]
+seeds = [42 ]
+for seed in seeds:
+    oof_prob, test_prob, fea_imp_df, model_list = run_cbt(train_tmp[use_cols], train_tmp[['label']], test_tmp[use_cols], k=5,
+                                              seed=seed, cat_cols=cat_cols)
+    oof_prob +=oof_prob/len(seeds)
+    test_prob +=test_prob/len(seeds)
+
+
+weight = search_weight(train_tmp, train_tmp[['label']], oof_prob, init_weight=[1.0], class_num=4, step=0.001)
+oof_prob = oof_prob * np.array(weight)
+test_prob = test_prob * np.array(weight)
+
+
+target_df = train_tmp[['sn', 'fault_time', 'label']].drop_duplicates(['sn', 'fault_time'])
+submit_df = train_tmp[['sn', 'fault_time']]
+submit_df['label'] = oof_prob.argmax(axis=1)
+submit_df = submit_df.drop_duplicates(['sn', 'fault_time'])
+# submit_df = pd.read_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea1.csv')).rename(columns = {'crashdump_venus_label':'label'})
+
+
+score = macro_f1(target_df=target_df, submit_df=submit_df)
+print(f'********************** BEST MACRO_F1 : {score} **********************')
+score = round(score, 5)
+
+print(fea_imp_df[:20])
+y_pred = test_prob.argmax(axis=1)
+result = test_tmp[['sn', 'fault_time']]
+result['label'] = y_pred
+result = result.drop_duplicates(['sn', 'fault_time'])
+
+crashdump_venus_fea = pd.concat([submit_df,result],ignore_index = False,axis = 0)
+crashdump_venus_fea = crashdump_venus_fea.rename(columns = {'label':'crashdump_venus_label_v1'})
+crashdump_venus_fea.to_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea_v1.csv'),index= False)
+print(crashdump_venus_fea['crashdump_venus_label_v1'].value_counts())
+
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/lgb_fs.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/lgb_fs.py
@ -0,0 +1,291 @@
+import os
+import warnings
+
+import numpy as np
+import pandas as pd
+import datetime
+from generate_feature import get_beta_target, add_last_next_time4fault, get_feature, \
+    get_duration_minutes_fea, get_nearest_msg_fea, get_server_model_sn_fea_2, \
+    get_server_model_fea, get_msg_text_fea_all, get_key_word_cross_fea, get_server_model_time_interval_stat_fea, \
+    get_w2v_feats
+from model import run_cbt,run_lgb
+from utils import RESULT_DIR, TRAIN_DIR, \
+    TEST_A_DIR, KEY_WORDS, get_word_counter, search_weight, macro_f1, TIME_INTERVAL,PSEUDO_FALG,GENERATION_DIR
+
+warnings.filterwarnings('ignore')
+
+
+def get_label(PSEUDO_FALG):
+    preliminary_train_label_dataset = pd.read_csv(preliminary_train_label_dataset_path)
+    preliminary_train_label_dataset_s = pd.read_csv(preliminary_train_label_dataset_s_path)
+
+    if PSEUDO_FALG:
+        print('获取伪标签LABEL')
+        pseudo_labels = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_labels.csv'))
+        label = pd.concat([preliminary_train_label_dataset,
+                           pseudo_labels,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        label = pd.concat([preliminary_train_label_dataset,
+                           preliminary_train_label_dataset_s],
+                          ignore_index=True,
+                          axis=0).sort_values(
+            ['sn', 'fault_time']).reset_index(drop=True)
+    label['fault_time'] = label['fault_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+    label['fault_time'] = label['fault_time'].apply(lambda x: str(x))
+    return label
+
+
+def get_log_dateset(PSEUDO_FALG):
+    preliminary_sel_log_dataset = pd.read_csv(preliminary_sel_log_dataset_path)
+    preliminary_sel_log_dataset_a = pd.read_csv(preliminary_sel_log_dataset_a_path)
+    if PSEUDO_FALG:
+        print('获取伪标签日志数据')
+        pseudo_sel_log_dataset = pd.read_csv(os.path.join(TRAIN_DIR, 'pseudo_sel_log_dataset.csv'))
+        log_dataset = pd.concat([preliminary_sel_log_dataset,
+                                 pseudo_sel_log_dataset,
+                                 preliminary_sel_log_dataset_a],
+                                ignore_index=True,
+                                axis=0).sort_values(
+            ['sn', 'time', 'server_model']).reset_index(drop=True)
+    else:
+        print('不使用伪标签数据')
+        log_dataset = pd.concat([preliminary_sel_log_dataset,
+                                 preliminary_sel_log_dataset_a],
+                                ignore_index=True,
+                                axis=0).sort_values(
+            ['sn', 'time', 'server_model']).reset_index(drop=True)
+    log_dataset['time'] = log_dataset['time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
+
+    return log_dataset
+
+def get_fea_distribute(feature_df, feature_importances, dataset_type, top=30):
+    print('根据特征重要性，获取数据集的分布情况，用于验证训练集和测试集是否分布一致')
+    fea_distribute_list = []
+    for i in feature_importances[:top]['fea'].to_list():
+        fea_distribute_tmp = (feature_df[i].value_counts() / len(feature_df)).reset_index().rename(
+            columns={'index': 'value'})
+        fea_distribute_list.append(fea_distribute_tmp)
+
+    fea_distribute = fea_distribute_list[-1]
+    for i in fea_distribute_list[:-1]:
+        fea_distribute = fea_distribute.merge(i, on='value', how='left')
+    fea_distribute['value'] = fea_distribute['value'].apply(lambda x: f'{dataset_type}_{int(x)}')
+    return fea_distribute
+
+
+def get_train_test(label, preliminary_submit_dataset_a, log_dataset):
+    print('获取训练集数据与测试集数据')
+    train = label.merge(log_dataset, on='sn', how='left')
+    test = preliminary_submit_dataset_a.merge(log_dataset, on='sn', how='left')
+    #     train['time_interval']  = (pd.to_datetime( train['fault_time'])-train['time']  ).apply(lambda x:x.total_seconds())
+    #     test['time_interval']  = (pd.to_datetime( test['fault_time'])- test['time']  ).apply(lambda x:x.total_seconds())
+    #     train = train.query('time_interval > 0')
+    #     test = test.query('time_interval > 0')
+    print(f'训练集维度:{train.shape},测试集维度:{test.shape}')
+    train = train.drop_duplicates().reset_index(drop=True)
+    test = test.drop_duplicates().reset_index(drop=True)
+    train['time'] = pd.to_datetime(train['time'])
+    test['time'] = pd.to_datetime(test['time'])
+    return train, test
+
+start_time = datetime.datetime.now()
+
+additional_sel_log_dataset_path = os.path.join(TRAIN_DIR, 'additional_sel_log_dataset.csv')
+preliminary_train_label_dataset_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset.csv')
+preliminary_train_label_dataset_s_path = os.path.join(TRAIN_DIR, 'preliminary_train_label_dataset_s.csv')
+preliminary_sel_log_dataset_path = os.path.join(TRAIN_DIR, 'preliminary_sel_log_dataset.csv')
+
+preliminary_submit_dataset_a_path = os.path.join(TEST_A_DIR, 'final_submit_dataset_b.csv')
+preliminary_sel_log_dataset_a_path = os.path.join(TEST_A_DIR, 'final_sel_log_dataset_b.csv')
+
+print(preliminary_submit_dataset_a_path, preliminary_sel_log_dataset_a_path)
+
+preliminary_submit_dataset_a = pd.read_csv(preliminary_submit_dataset_a_path)
+preliminary_submit_dataset_a.head()
+
+log_dataset = get_log_dateset(PSEUDO_FALG)
+label = get_label(PSEUDO_FALG)
+
+
+next_time_list = [i / TIME_INTERVAL for i in [3, 5, 10, 15, 30, 45, 60, 90, 120, 240, 360, 480, 540, 600]] + [1000000]
+
+label, preliminary_submit_dataset_a = add_last_next_time4fault(label, preliminary_submit_dataset_a, TIME_INTERVAL,
+                                                               next_time_list)
+train, test = get_train_test(label, preliminary_submit_dataset_a, log_dataset)
+train = train.drop_duplicates(['sn', 'fault_time', 'time', 'msg', 'server_model']).reset_index(drop=True)
+
+train['time_interval'] = (pd.to_datetime(train['fault_time']) - pd.to_datetime(train['time'])).apply(
+    lambda x: x.total_seconds())
+test['time_interval'] = (pd.to_datetime(test['fault_time']) - pd.to_datetime(test['time'])).apply(
+    lambda x: x.total_seconds())
+
+all_data = pd.concat([train, test], axis=0, ignore_index=True)
+all_data = all_data.sort_values(['sn','server_model', 'fault_time', 'time'])
+w2v_feats = get_w2v_feats(all_data,
+                          f1_list = ['sn'],
+                          f2_list = ['msg_list', 'msg_0', 'msg_1', 'msg_2'])
+
+# 获取 server_model_time_interval_stat_fea
+server_model_time_interval_stat_fea = get_server_model_time_interval_stat_fea(all_data)
+
+msg_text_fea = get_msg_text_fea_all(all_data)
+# 获取时间差特征
+duration_minutes_fea = get_duration_minutes_fea(train, test)
+
+# 获取时间server_model特征
+server_model_fea = get_server_model_fea(train, test)
+counter = get_word_counter(train)
+
+# 获取时间 nearest_msg 特征
+nearest_msg_fea = get_nearest_msg_fea(train, test)
+# 获取时间 server_model beta_target 特征
+beta_target_fea = get_beta_target(train, test)
+
+key = ['sn', 'fault_time', 'label', 'server_model']
+
+fea_num = len(KEY_WORDS)
+time_list = [i * TIME_INTERVAL for i in next_time_list]
+train = get_feature(train, time_list, KEY_WORDS, fea_num, key=['sn', 'fault_time', 'label', 'server_model'])
+test = get_feature(test, time_list, KEY_WORDS, fea_num, key=['sn', 'fault_time', 'server_model'])
+
+print('添加 时间差 特征')
+train = train.merge(duration_minutes_fea, on=['sn', 'fault_time', 'server_model'])
+test = test.merge(duration_minutes_fea, on=['sn', 'fault_time', 'server_model'])
+
+print('添加 server_model特征')
+train = train.merge(server_model_fea, on=['sn', 'server_model'])
+test = test.merge(server_model_fea, on=['sn', 'server_model'])
+
+print('添加 w2v_feats')
+train = train.merge(w2v_feats, on=['sn' ])
+test = test.merge(w2v_feats, on=['sn', ])
+
+print('添加 nearest_msg 特征')
+train = train.merge(nearest_msg_fea, on=['sn', 'server_model', 'fault_time'])
+test = test.merge(nearest_msg_fea, on=['sn', 'server_model', 'fault_time'])
+
+print('添加 beta_target 特征')
+train = train.merge(beta_target_fea, on=['sn', 'server_model', 'fault_time'])
+test = test.merge(beta_target_fea, on=['sn', 'server_model', 'fault_time'])
+
+server_model_sn_fea_2 = get_server_model_sn_fea_2(train, test)
+print('添加 server_model_sn_fea_2 特征')
+train = train.merge(server_model_sn_fea_2, on=['sn', 'server_model'])
+test = test.merge(server_model_sn_fea_2, on=['sn', 'server_model'])
+
+# crashdump_venus_fea = pd.read_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea.csv') )
+# print('添加 crashdump_venus_fea 特征')
+# print(train.shape,test.shape,crashdump_venus_fea.shape)
+# train = train.merge(crashdump_venus_fea, on=['sn' , 'fault_time'],how = 'left')
+# test = test.merge(crashdump_venus_fea, on=['sn', 'fault_time' ],how = 'left')
+# print(train.shape,test.shape )
+
+crashdump_venus_fea = pd.read_csv(os.path.join(GENERATION_DIR,'crashdump_venus_fea_v1.csv') )
+print('添加 crashdump_venus_fea 特征')
+print(train.shape,test.shape,crashdump_venus_fea.shape)
+train = train.merge(crashdump_venus_fea, on=['sn' , 'fault_time'],how = 'left')
+test = test.merge(crashdump_venus_fea, on=['sn', 'fault_time' ],how = 'left')
+print(train.shape,test.shape )
+test.to_csv(os.path.join(GENERATION_DIR,'test.csv'),index =False)
+train.to_csv(os.path.join(GENERATION_DIR,'train.csv'),index =False)
+
+# print('添加 msg_text_fea 特征')
+# train = train.merge(msg_text_fea, on=['sn', 'fault_time' ], how='left')
+# test = test.merge(msg_text_fea, on=['sn', 'fault_time'], how='left')
+
+# print('添加 关键词交叉特征  ')
+# train,test = get_key_word_cross_fea(train,test)
+
+# print('添加 server_model_time_interval_stat_fea 特征')
+# train = train.merge(server_model_time_interval_stat_fea, on=['server_model' ],how ='left')
+# test = test.merge(server_model_time_interval_stat_fea, on=['server_model'  ],how ='left')
+
+
+use_less_cols_1 = ['last_last_msg_cnt', 'last_first_msg_cnt','time_diff_1_min',
+       'last_msg_list_unique_LabelEnc', 'last_msg_0_unique_LabelEnc',
+       'last_msg_1_unique_LabelEnc', 'last_msg_2_unique_LabelEnc',
+       'last_msg_list_list_LabelEnc', 'last_msg_0_list_LabelEnc',
+       'last_msg_1_list_LabelEnc', 'last_msg_2_list_LabelEnc',
+       'last_msg_0_first_LabelEnc', 'last_msg_1_first_LabelEnc',
+       'last_msg_2_first_LabelEnc', 'last_msg_0_last_LabelEnc',
+       'last_msg_1_last_LabelEnc', 'last_msg_2_last_LabelEnc',
+       'last_msg_last_LabelEnc', 'last_msg_first_LabelEnc']
+
+use_less_col = [i for i in train.columns if train[i].nunique() < 2] + use_less_cols_1
+
+
+print(f'use_less_col:{len(use_less_col)}')
+use_cols = [i for i in train.columns if i not in ['sn', 'fault_time', 'label', 'server_model'] + use_less_col]
+cat_cols = ['server_model_LabelEnc', 'msg_LabelEnc', 'msg_0_LabelEnc', 'msg_1_LabelEnc', 'msg_2_LabelEnc',]
+use_cols = sorted(use_cols)
+print('使用的特征维度:',len(use_cols))
+
+# cat_cols = []
+# for i in use_cols:
+#     if '_LabelEnc' in i:
+#         cat_cols.append(i)
+
+oof_prob = np.zeros((train.shape[0], 4))
+test_prob = np.zeros((test.shape[0], 4))
+# seeds = [42,4242,40424,1024,2048]
+seeds = [42 ]
+for seed in seeds:
+    oof_prob, test_prob, fea_imp_df, model_list = run_lgb(train[use_cols], train[['label']], test[use_cols], k=5,
+                                              seed=seed, cat_cols=cat_cols)
+    oof_prob +=oof_prob/len(seeds)
+    test_prob +=test_prob/len(seeds)
+
+weight = search_weight(train, train[['label']], oof_prob, init_weight=[1.0], class_num=4, step=0.001)
+oof_prob = oof_prob * np.array(weight)
+test_prob = test_prob * np.array(weight)
+
+target_df = train[['sn', 'fault_time', 'label']]
+submit_df = train[['sn', 'fault_time']]
+submit_df['label'] = oof_prob.argmax(axis=1)
+
+score = macro_f1(target_df=target_df, submit_df=submit_df)
+print(f'********************** BEST MACRO_F1 : {score} **********************')
+score = round(score, 5)
+
+y_pred = test_prob.argmax(axis=1)
+result = test[['sn', 'fault_time']]
+result['label'] = y_pred
+result = preliminary_submit_dataset_a.merge(result, on=['sn', 'fault_time'], how='left')[['sn', 'fault_time', 'label']]
+result['label'] = result['label'].fillna(0).astype(int)
+
+result.to_csv(os.path.join(RESULT_DIR, f'lgb_result.csv'), index=False)
+
+fea_imp_df = fea_imp_df.reset_index(drop=True)
+fea_imp_df.to_csv(os.path.join(RESULT_DIR, f'./lgb_fea_imp_{int(score * 100000)}.csv'), index=False)
+
+train_result_prob = pd.DataFrame(oof_prob).add_prefix('lgb_class_')
+test_result_prob = pd.DataFrame(test_prob).add_prefix('lgb_class_')
+train_result_prob['label'] = train['label']
+train_result_prob['sn'] = train['sn']
+train_result_prob['fault_time'] = train['fault_time']
+test_result_prob['sn'] = test['sn']
+test_result_prob['fault_time'] = test['fault_time']
+
+result_prob = pd.concat([train_result_prob,test_result_prob],ignore_index = True)
+result_prob.to_csv(os.path.join(RESULT_DIR,f'lgb_prob_result.csv'),index = False)
+
+
+end_time = datetime.datetime.now()
+cost_time = end_time - start_time
+print('****************** LIGHTGBM COST TIME : ',str(cost_time),' ******************')
+
+'''
+
+v7 最优版本  线下 7356
+v8: v7 添加 关键词交叉特征 线下 0.7357  线上 7338
+v8.1 v7 添加 关键词交叉特征 并作为类别变量输入模型 0.73361
+v8.2 v7 添加 关键词交叉特征 并作为类别变量输入模型  删除 TOP_KEY_WORDS 7117
+v8.3 v7 添加 关键词交叉特征 并作为类别变量输入模型  使用 TOP_KEY_WORDS_2 7260
+v8.3 v7 添加 关键词交叉特征 并作为类别变量输入模型  添加 TOP_KEY_WORDS_2 7260
+
+'''
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/log.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/log.py
@ -0,0 +1,50 @@
+
+import logging
+import os
+
+
+class Logger:
+    def __init__(self, name, log_path, mode='a'):
+        """
+        程序运行日志类的构造函数
+        :param name: 需要保存的日志文件名称，默认后缀名称为 .log
+        :param log_path: 需要保存的日志文件路径
+        :param mode: 日志写入模式， a:追加， w:覆盖
+        使用说明：
+            1、创建日志实例对象
+                logger = Logger("textCNN_train", log_path="../logs").get_log
+            2、将关键信息通过日志实例对象写入日志文件
+                logger.info("")
+        """
+        self.__name = name
+        self.logger = logging.getLogger(self.__name)
+        self.logger.setLevel(logging.DEBUG)
+        self.log_path = log_path
+        self.mode = mode
+
+        # 创建一个handler，用于写入日志文件
+        # log_path = os.path.dirname(os.path.abspath(__file__))
+        # 指定utf-8格式编码，避免输出的日志文本乱码
+        logname = os.path.join(self.log_path, self.__name + '.log')  # 指定输出的日志文件名
+        # 定义handler的输出格式
+        formatter = logging.Formatter(
+            '%(asctime)s-%(filename)s-[日志信息]-[%(module)s-%(funcName)s-line:%(lineno)d]-%(levelname)s: %(message)s')
+
+        fh = logging.FileHandler(logname, mode=self.mode, encoding='utf-8')  # 不拆分日志文件，a指追加模式,w为覆盖模式
+        fh.setLevel(logging.DEBUG)
+
+        # 创建一个handler，用于将日志输出到控制台
+        ch = logging.StreamHandler()
+        ch.setLevel(logging.DEBUG)
+
+        fh.setFormatter(formatter)
+        ch.setFormatter(formatter)
+
+        # 给logger添加handler
+        self.logger.addHandler(fh)
+        self.logger.addHandler(ch)
+
+    @property
+    def get_log(self):
+        """定义一个函数，回调logger实例"""
+        return self.logger
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/model.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/model.py
@ -0,0 +1,150 @@
+import warnings
+import datetime
+import lightgbm as lgb
+import numpy as np
+import pandas as pd
+from catboost import CatBoostClassifier
+from sklearn.model_selection import StratifiedKFold
+
+from utils import N_ROUNDS
+import pickle
+import os
+warnings.filterwarnings('ignore')
+
+
+def get_model_feature_importances(model):
+    feature_importances = pd.DataFrame()
+    feature_importances['fea'] = model.feature_names_
+    feature_importances['importances'] = model.feature_importances_
+    feature_importances = feature_importances.sort_values('importances', ascending=False).reset_index(drop=True)
+
+    return feature_importances
+
+
+def run_cbt(train, target, test, k, seed, NUM_CLASS=4, cat_cols=[]):
+    print('********************** RUN CATBOOST MODEL **********************')
+    print(f'******************  当前的 SEED {seed} ********************** ')
+    folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
+    oof_prob = np.zeros((train.shape[0], NUM_CLASS))
+    test_prob = np.zeros((test.shape[0], NUM_CLASS))
+    feature_importance_df = []
+    offline_score = []
+    model_list = []
+
+    ## K-Fold
+    for fold, (trn_idx, val_idx) in enumerate(folds.split(train, target)):
+        print("FOLD {} IS RUNNING...".format(fold + 1))
+        trn_x, trn_y = train.loc[trn_idx], target.loc[trn_idx]
+        val_x, val_y = train.loc[val_idx], target.loc[val_idx]
+        catboost_model = CatBoostClassifier(
+            iterations=N_ROUNDS,
+            od_type='Iter',
+            od_wait=120,
+            max_depth=8,
+            learning_rate=0.05,
+            l2_leaf_reg=9,
+            random_seed=seed,
+            fold_len_multiplier=1.1,
+            loss_function='MultiClass',
+            logging_level='Verbose',
+            # task_type="GPU"
+
+        )
+
+        start_time = datetime.datetime.now()
+
+        catboost_model.fit(trn_x,
+                           trn_y,
+                           eval_set=(val_x, val_y),
+                           use_best_model=True,
+                           verbose=800,
+                           early_stopping_rounds=100,
+                           cat_features=cat_cols,
+                           )
+        end_time = datetime.datetime.now()
+        model_train_cost_time = end_time - start_time
+        print('****************** 模型训练 COST TIME : ',str(model_train_cost_time),' ******************')
+
+        start_time = datetime.datetime.now()
+        oof_prob[val_idx] = catboost_model.predict_proba(train.loc[val_idx])
+        end_time = datetime.datetime.now()
+        model_pred_cost_time = end_time - start_time
+        print('****************** 模型预测 COST TIME : ', str(model_pred_cost_time), ' ******************')
+        #         catboost_model = catboost_model.get_best_iteration()
+        test_prob += catboost_model.predict_proba(test) / folds.n_splits
+        print(catboost_model.get_best_score())
+        offline_score.append(catboost_model.get_best_score()['validation']['MultiClass'])
+
+        feature_importance_df.append(get_model_feature_importances(catboost_model))
+        model_list.append(catboost_model)
+        with open(os.path.join('../model', f'cat_model_flod_{fold}.pkl'), 'wb') as f:
+            pickle.dump(catboost_model, f)
+    print('\nOOF-MEAN-ERROR score:%.6f, OOF-STD:%.6f' % (np.mean(offline_score), np.std(offline_score)))
+    fea_imp_df = pd.concat(feature_importance_df, ignore_index=True).groupby('fea').agg(
+        {'importances': 'mean'}).reset_index().sort_values('importances', ascending=False).reset_index(drop=True)
+
+    return oof_prob, test_prob, fea_imp_df, model_list
+
+
+def run_lgb(train, target, test, k, seed=42, NUM_CLASS=4, cat_cols=[]):
+    # feats = [f for f in train.columns if f not in ['cust_no', 'label', 'I7', 'I9', 'B6']]
+    #     print('Current num of features:', len(feats))
+    print(f'********************** RUN LGBM MODEL **********************')
+    print(f'******************  当前的 SEED {seed} ********************** ')
+    cols_map = {j: i for i, j in enumerate(train.columns)}
+    cat_cols = [cols_map[i] for i in cat_cols]
+    train = train.rename(columns=cols_map)
+    test = test.rename(columns=cols_map)
+    folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
+    oof_prob = np.zeros((train.shape[0], NUM_CLASS))
+    test_prob = np.zeros((test.shape[0], NUM_CLASS))
+    fea_imp_df_list = []
+    offline_score = []
+    model_list = []
+    ## K-Fold
+    for fold, (trn_idx, val_idx) in enumerate(folds.split(train, target)):
+        params = {
+            "objective": "multiclass",
+            "num_class": NUM_CLASS,
+            "learning_rate": 0.01,
+            "max_depth": -1,
+            "num_leaves": 32,
+            "verbose": -1,
+            "bagging_fraction": 0.8,
+            "feature_fraction": 0.8,
+            "seed": seed,
+            'metric': 'multi_error'
+
+        }
+        print("FOLD {} IS RUNNING...".format(fold + 1))
+        trn_data = lgb.Dataset(train.loc[trn_idx], label=target.loc[trn_idx])
+        val_data = lgb.Dataset(train.loc[val_idx], label=target.loc[val_idx])
+
+        # train
+        params['seed'] = seed
+        lgb_model = lgb.train(
+            params,
+            trn_data,
+            num_boost_round=N_ROUNDS,
+            valid_sets=[trn_data, val_data],
+            early_stopping_rounds=100,
+            verbose_eval=200,
+            categorical_feature=cat_cols,
+
+        )
+        # predict
+        oof_prob[val_idx] = lgb_model.predict(train.loc[val_idx], num_iteration=lgb_model.best_iteration)
+        test_prob += lgb_model.predict(test, num_iteration=lgb_model.best_iteration) / folds.n_splits
+        offline_score.append(lgb_model.best_score['valid_1']['multi_error'])
+        fea_imp = pd.DataFrame()
+        fea_imp['feature_name'] = lgb_model.feature_name()
+        fea_imp['importance'] = lgb_model.feature_importance()
+        fea_imp['feature_name'] = fea_imp['feature_name'].map({str(cols_map[i]): i for i in cols_map})
+        fea_imp = fea_imp.sort_values('importance', ascending=False)
+        fea_imp_df_list.append(fea_imp)
+
+        model_list.append(lgb_model)
+    print('\nOOF-MEAN-ERROR score:%.6f, OOF-STD:%.6f' % (np.mean(offline_score), np.std(offline_score)))
+    fea_imp_df = pd.concat(fea_imp_df_list, ignore_index=True).groupby('feature_name').agg(
+        {'importance': 'mean'}).reset_index().sort_values('importance', ascending=False)
+    return oof_prob, test_prob, fea_imp_df, model_list
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/requirements.txt
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/requirements.txt
@ -0,0 +1 @@
+scikit_learn==1.0.2
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/stacking.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/stacking.py
@ -0,0 +1,43 @@
+import os
+import numpy as np
+import pandas as pd
+from  utils import RESULT_DIR
+lgb_result = pd.read_csv(os.path.join(RESULT_DIR,'lgb_prob_result.csv'))
+lgb_result = lgb_result[lgb_result['label'].isnull()]
+print(lgb_result.columns)
+del lgb_result['label']
+
+cat_result = pd.read_csv(os.path.join(RESULT_DIR,'cat_prob_result.csv'))
+cat_result = cat_result[cat_result['label'].isnull()]
+del cat_result['label']
+
+# bert_result = pd.read_csv(os.path.join(RESULT_DIR,'bert_prob_result.csv'))
+
+model_weight = {'lgb':0.2,'cat':0.8,'bert':0.2}
+print(f'MODEL WEIGHT: {model_weight}')
+# for i in ['bert_class_0', 'bert_class_1', 'bert_class_2','bert_class_3']:
+#     bert_result[i] = bert_result[i]*model_weight['bert']
+
+for i in  ['cat_class_0', 'cat_class_1', 'cat_class_2', 'cat_class_3']:
+    cat_result[i] = cat_result[i]*model_weight['cat']
+
+for i in  ['lgb_class_0', 'lgb_class_1', 'lgb_class_2', 'lgb_class_3']:
+    lgb_result[i] = lgb_result[i]*model_weight['lgb']
+
+result= lgb_result.merge(cat_result,on =['sn', 'fault_time'],how ='left' )
+
+# result= bert_result.merge(cat_result,on =['sn', 'fault_time'],how ='left' )
+#
+# result['class_0'] =result.loc[:,['cat_class_0','bert_class_0']].sum(1)
+# result['class_1'] =result.loc[:,['cat_class_1','bert_class_0']].sum(1)
+# result['class_2'] =result.loc[:,['cat_class_2','bert_class_0']].sum(1)
+# result['class_3'] =result.loc[:,['cat_class_3','bert_class_0']].sum(1)
+
+result['class_0'] =result.loc[:,['lgb_class_0','cat_class_0']].sum(1)
+result['class_1'] =result.loc[:,['lgb_class_1','cat_class_1']].sum(1)
+result['class_2'] =result.loc[:,['lgb_class_2','cat_class_2']].sum(1)
+result['class_3'] =result.loc[:,['lgb_class_3','cat_class_3']].sum(1)
+
+result['label'] = np.argmax(result.loc[:,['class_0', 'class_1', 'class_2', 'class_3']].values,axis = 1)
+result = result[['sn', 'fault_time','label']]
+result.to_csv(os.path.join(RESULT_DIR,'stacking_result.csv'),index = False)
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/utils.py
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/code/utils.py
@ -0,0 +1,232 @@
+import os
+import sys
+from log import Logger
+from collections import Counter
+from tqdm import tqdm
+import numpy as np
+import pandas as pd
+
+ROOT_DIR = os.path.join(sys.path[0], '../')
+LOG_DIR = os.path.join(ROOT_DIR, 'log')
+
+DATA_DIR = os.path.join(ROOT_DIR, 'data')
+TRAIN_DIR = os.path.join(DATA_DIR, 'preliminary_train')
+# 提交docker时 需要打开更换
+MODEL_PATH = os.path.join(ROOT_DIR, './model/deberta-base')
+MODEL_1_PATH = os.path.join(ROOT_DIR, './model')
+TEST_A_DIR = os.path.join(ROOT_DIR, './tcdata')
+# TEST_A_DIR = os.path.join(ROOT_DIR, './tcdata_test')
+PSEUDO_FALG = True
+TEST_B_DIR = os.path.join(ROOT_DIR, 'tcdata')
+
+
+
+RESULT_DIR = os.path.join(ROOT_DIR, 'prediction_result')
+
+FEATURE_DIR = os.path.join(ROOT_DIR, 'feature')
+GENERATION_DIR = os.path.join(FEATURE_DIR, 'generation')
+CORRELATION_DIR = os.path.join(FEATURE_DIR, 'correlation')
+
+
+USER_DATA_DIR = os.path.join(ROOT_DIR, 'user_data')
+USER_MODEL_DIR = os.path.join(USER_DATA_DIR, 'model_data')
+TMP_DIR = os.path.join(USER_DATA_DIR, 'tmp_data')
+N_ROUNDS = 10000
+TIME_INTERVAL = 60
+
+KEY_1 = ['OEM record c2', 'Processor CPU_Core_Error', '001c4c', 'System Event Sys_Event', 'Power Supply PS0_Status',
+         'Temperature CPU0_Margin_Temp', 'Reading 51 &gt; Threshold 85 degrees C', 'Lower Non-critical going low',
+         'Temperature CPU1_Margin_Temp', 'System ACPI Power State #0x7d', 'Lower Critical going low']
+KEY_2 = ['OEM CPU0 MCERR', 'OEM CPU0 CATERR', 'Reading 0 &lt; Threshold 2 degrees C', '0203c0a80101',
+         'Unknown CPU0 MCERR', 'Unknown CPU0 CATERR', 'Microcontroller #0x3b', 'System Boot Initiated',
+         'Processor #0xfa', 'Power Unit Pwr Unit Status', 'Hard reset', 'Power off/down', 'System Event #0xff',
+         'Memory CPU1A1_DIMM_Stat', '000000', 'Power cycle', 'OEM record c3', 'Memory CPU1C0_DIMM_Stat',
+         'Reading 0 &lt; Threshold 1 degrees C', 'IERR']
+KEY_3 = ['Memory', 'Correctable ECC logging limit reached', 'Memory MEM_CHE0_Status', 'Memory Memory_Status',
+         'Memory #0x87', 'Memory CPU0F0_DIMM_Stat', 'Memory Device Disabled', 'Memory #0xe2',
+         'OS Stop/Shutdown OS Status', 'System Boot Initiated System Restart', 'OS Boot BIOS_Boot_Up',
+         'System Boot Initiated BIOS_Boot_UP', 'Memory DIMM101', 'OS graceful shutdown', 'OS Critical Stop OS Status',
+         'Memory #0xf9', 'Memory CPU0C0_DIMM_Stat', 'Memory DIMM111', 'Memory DIMM021', ]
+KEY_4 = ['Drive Fault', 'NMI/Diag Interrupt', 'Failure detected', 'Power Supply AC lost', 'Power Supply PSU0_Supply',
+         'AC out-of-range, but present', 'Predictive failure', 'Drive Present', 'Temperature Temp_DIMM_KLM',
+         'Temperature Temp_DIMM_DEF', 'Power Supply PS1_Status', 'Identify Status', 'Power Supply PS2_Status',
+         'Temperature DIMMG1_Temp', 'Upper Non-critical going high', 'Temperature DIMMG0_Temp',
+         'Upper Critical going high', 'Power Button pressed', 'System Boot Initiated #0xb8', 'Deasserted']
+TOP_KEY_WORDS = ['0203c0a80101', 'Configuration Error', 'Correctable ECC', 'Deasserted', 'Device Enabled', 'Drive Present',
+                 'Event Logging Disabled SEL', 'Failure detected', 'IERR', 'Initiated by hard reset', 'Initiated by power up',
+                 'Initiated by warm reset', 'Log area reset/cleared', 'Memory', 'Memory #0xe2', 'Memory CPU0C0',
+                 'Microcontroller/Coprocessor BMC', 'OEM CPU0 CATERR', 'OEM CPU0 MCERR', 'OS Boot BIOS',
+                 'OS Critical Stop OS Status', 'Power Supply PS1', 'Power Supply PS2', 'Presence detected', 'Processor', 'Processor CPU', 'Processor CPU0',
+                 'Processor CPU1', 'S0/G0: working', 'S4/S5: soft-off', 'Slot / Connector PCIE', 'State Asserted', 'State Deasserted',
+                 'System ACPI Power State ACPI', 'System Boot Initiated', 'System Boot Initiated #0xe0', 'System Boot Initiated BIOS',
+                 'System Event', 'System Event #0x10', 'System Event #0xff', 'Timestamp Clock Sync', 'Transition to Running', 'Uncorrectable ECC',
+                 'Uncorrectable machine check exception', 'Unknown CPU0 CATERR', 'Unknown CPU0 MCERR', 'Unknown Chassis', 'Watchdog2 IPMI',
+                 ]
+TOP_KEY_WORDS_2 = ['Processor CPU0 Status', 'System Boot Initiated BIOS Boot Up', 'Uncorrectable ECC', 'Initiated by power up',
+                   'Configuration Error', 'Processor CPU CATERR', 'Processor CPU1 Status', 'Memory #0xe2', 'IERR', 'Initiated by warm reset',
+                   'State Asserted', 'S4/S5: soft-off', 'Memory #0xf9', 'S0/G0: working', 'boot completed - device not specified', 'Timestamp Clock Sync',
+                   'Presence detected', 'System Boot Initiated #0xe0', 'Drive Fault', 'Power Supply PS1 Status', 'Power off/down', 'OS Boot #0xe9',
+                   'Failure detected', 'Uncorrectable machine check exception', 'Transition to Running', 'Power Supply PS2 Status',
+                   'Memory Device Disabled', 'System Restart', 'System Event #0x10', 'Sensor access degraded or unavailable', 'Unknown #0x17',
+                   'Drive Present', 'Management Subsys Health System Health', 'Power Supply AC lost', 'Microcontroller #0x16']
+CHARATERS = ['#', '&', ]
+# KEY_WORDS = KEY_1 + KEY_2 + KEY_3 + KEY_4 + CHARATERS
+KEY_WORDS = KEY_1 + KEY_2 + KEY_3 + KEY_4 + CHARATERS + TOP_KEY_WORDS
+KEY_WORDS = list(set(KEY_WORDS))
+# cnt_1_0_diff_key_words = ['State Asserted','Processor CPU_CATERR','Unknown #0x17','Microcontroller #0x16','Transition to Running','State Deasserted','Processor #0xfa','Temperature CPU1_Margin_Temp','Temperature CPU0_Margin_Temp','Power cycle','Management Subsys Health System_Health','Sensor access degraded or unavailable','Power off/down','System ACPI Power State #0x7d']
+# key_words_0 = ['Temperature CPU0_Margin_Temp','Lower Critical going low','System ACPI Power State #0x7d','Temperature CPU1_Margin_Temp','Lower Non-critical going low','Uncorrectable machine check exception','Reading 0 &lt; Threshold 1 degrees C','000000','Unknown #0x19','Temperature DIMMG1_Temp','Reading 0 &lt; Threshold 0 degrees C','001c4c','IERR','Upper Critical going high','Unknown Chassis_control','Temperature DIMMG0_Temp','Upper Non-critical going high','Temperature Temp_DIMM_DEF','Power cycle','Processor CPU0_Status','Temperature Temp_DIMM_KLM','Processor CPU1_Status','Management Subsys Health System_Health']
+# key_words_1 = ['Processor #0xfa','State Deasserted','Power off/down','Power cycle','IERR','Unknown #0x17','Management Subsys Health System_Health','Processor CPU_CATERR','Reading 0 &lt; Threshold 1 degrees C','','Sensor access degraded or unavailable','Transition to Running','State Asserted','Microcontroller #0x16','Processor CPU0_Status','Processor CPU1_Status','Slot / Connector PCIE_Status','Fault Status','System ACPI Power State ACPI_PWR_Status','Management Subsystem Health System_Health','Configuration Error','Uncorrectable machine check exception','Timestamp Clock Sync']
+# key_words_2 = ['Memory #0xe2','Memory Device Disabled','Memory #0x87','Memory #0xf9','Correctable ECC','Memory CPU0D0_DIMM_Stat','Uncorrectable ECC','Memory CPU1B0_DIMM_Stat','System Boot Initiated BIOS_Boot_UP','System Restart','Presence Detected','Temperature CPU0_Temp','boot completed - device not specified','Log almost full','Device Present','Legacy OFF state','System Boot Initiated #0xe0','System Event #0x10','Legacy ON state','OS Boot #0xe0','Unknown #0xc5','System Boot Initiated #0xb8','Event Logging Disabled SEL_Status']
+# key_words_3 = ['Drive Fault','Failure detected','Drive Present','Temperature Temp_DIMM_KLM','Temperature Temp_DIMM_DEF','Power Supply PS4_Status','Upper Non-critical going high','Temperature DIMMG0_Temp','Temperature DIMMG1_Temp','Power Supply PS3_Status','Upper Critical going high','Predictive failure','Power Supply AC lost','Unknown #0x19','Power Unit Power Unit','AC out-of-range, but present','Power Supply PS1_Status','Power Supply PS2_Status','Log area reset/cleared','Microcontroller/Coprocessor BMC_Boot_Up','System Boot Initiated #0xb8','Power Button pressed','Device Present']
+# top_key_words = [ 'Configuration Error','Uncorrectable ECC','Processor CPU0_Status','Initiated by power up','','Presence Detected','Processor CPU1_Status','S0/G0: working','Processor CPU_CATERR','Presence detected','S4/S5: soft-off','Upper Critical going high','Memory #0xe2','IERR','Initiated by warm reset','State Asserted','Upper Non-critical going high','boot completed - device not specified','Memory Device Disabled','Timestamp Clock Sync','Lower Critical going low','Transition to Running','Memory #0xf9','Power Supply PS1_Status']
+# key_words_1_desc = ['#0xfa', '#0x','#0xff','CATERR','cycle','Unit','IERR','IPMI','#0x17', 'Running','#0x7c','Unknown','CPU', 'Sensor','CPU0','CPU1','Subsys']
+#
+# key_words = cnt_1_0_diff_key_words +key_words_0+key_words_1+key_words_2+key_words_3+top_key_words+key_words_1_desc
+# key_words = list(set(key_words))
+# KEY_WORDS = key_words+CHARATERS
+
+
+def create_dir(dir):
+    """
+    创建目录
+    :param dir: 目录名
+    :return:
+    """
+    if not os.path.exists(dir):
+        os.mkdir(dir)
+        print(f'{dir}目录不存在,创建{dir}目录成功.')
+    else:
+        print(f'{dir}目录已存在.')
+
+
+def create_all_dir():
+    """
+    创建所有需要的目录
+    :return:
+    """
+    create_dir(ROOT_DIR)
+    create_dir(LOG_DIR)
+
+    # create_dir(MODEL_DIR)
+    create_dir(RESULT_DIR)
+
+    create_dir(FEATURE_DIR)
+    create_dir(GENERATION_DIR)
+    create_dir(CORRELATION_DIR)
+
+    create_dir(DATA_DIR)
+    create_dir(TRAIN_DIR)
+    create_dir(TEST_A_DIR)
+    # create_dir(TEST_B_DIR)
+
+    create_dir(USER_DATA_DIR)
+    create_dir(USER_MODEL_DIR)
+    create_dir(TMP_DIR)
+
+
+def clean_str(string):
+    return string
+
+
+def my_tokenizer(s):
+    return s.split(' | ')
+
+
+def get_word_counter(data):
+    print('获取异常日志计数字典')
+
+    counter = Counter()
+    for string_ in tqdm(data['msg']):
+        string_ = string_.strip()
+        counter.update(my_tokenizer(clean_str(string_)))
+    return counter
+
+
+def macro_f1(target_df: pd.DataFrame, submit_df: pd.DataFrame):
+    """
+    计算得分
+    :param target_df: [sn,fault_time,label]
+    :param submit_df: [sn,fault_time,label]
+    :return:
+    """
+
+    weights = [5 / 11, 4 / 11, 1 / 11, 1 / 11]
+
+    # weights = [3 / 7, 2 / 7, 1 / 7, 1 / 7]
+    overall_df = target_df.merge(
+        submit_df, how='left', on=[
+            'sn', 'fault_time'], suffixes=[
+            '_gt', '_pr'])
+    overall_df.fillna(-1)
+    macro_F1 = 0.
+    for i in range(len(weights)):
+        TP = len(overall_df[(overall_df['label_gt'] == i)
+                 & (overall_df['label_pr'] == i)])
+        FP = len(overall_df[(overall_df['label_gt'] != i)
+                 & (overall_df['label_pr'] == i)])
+        FN = len(overall_df[(overall_df['label_gt'] == i)
+                 & (overall_df['label_pr'] != i)])
+        precision = TP / (TP + FP) if (TP + FP) > 0 else 0
+        recall = TP / (TP + FN) if (TP + FN) > 0 else 0
+        F1 = 2 * precision * recall / \
+            (precision + recall) if (precision + recall) > 0 else 0
+        macro_F1 += weights[i] * F1
+    return macro_F1
+
+
+def search_weight(train, valid_y, raw_prob, init_weight=[
+                  1.0], class_num=4, step=0.001):
+    weight = init_weight.copy() * class_num
+    oof = train[['sn', 'fault_time']]
+    oof['label'] = raw_prob.argmax(axis=1)
+    f_best = macro_f1(train[['sn', 'fault_time', 'label']], oof)
+    print("Inint Score:", f_best)
+
+    #     f_best = f1_score(y_true=valid_y, y_pred=raw_prob.argmax(axis=1),average='macro')
+    flag_score = 0
+    round_num = 1
+    while (flag_score != f_best):
+        print("round: ", round_num)
+        round_num += 1
+        flag_score = f_best
+        for c in range(class_num):
+            for n_w in range(0, 2000, 10):
+                num = n_w * step
+                new_weight = weight.copy()
+                new_weight[c] = num
+                prob_df = raw_prob.copy()
+                prob_df = prob_df * np.array(new_weight)
+
+                oof['label'] = prob_df.argmax(axis=1)
+                f = macro_f1(train[['sn', 'fault_time', 'label']], oof)
+                #                 f = f1_score(y_true=valid_y, y_pred=prob_df.argmax(axis=1),average='macro')
+                if f > f_best:
+                    weight = new_weight.copy()
+                    f_best = f
+                    print(f"class:{c}, new_weight:{num}, f1 score: {f}")
+    print(
+        f'********************** SEARCH BEST WEIGHT : {weight} **********************')
+    return weight
+
+
+def get_new_cols(df, key=['sn', 'fault_time']):
+    if isinstance(df.columns[0], tuple):
+
+        new_cols = []
+        for i in df.columns:
+            if i[0] in key:
+                new_cols.append(i[0])
+            else:
+                new_cols.append(f'{i[0]}_{i[1]}')
+        df.columns = new_cols
+        return df
+    else:
+        print('当前的DataFrame没有二级列名，请检查。')
+        return df
+
+
+if __name__ == '__main__':
+    # create_all_dir()
+    logger = Logger(name=os.path.basename(__file__).split(
+        '.py')[0], log_path=LOG_DIR, mode="w").get_log
+    print(len(KEY_WORDS))
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/data/数据集下载地址
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/data/数据集下载地址
@ -0,0 +1 @@
+https://tianchi.aliyun.com/competition/entrance/531947/information?lang=zh-cn
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/docker_push.sh
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/docker_push.sh
@ -0,0 +1,6 @@
+# 创建镜像 并提交到你的镜像仓库
+rm -rf result.zip
+# built 镜像
+docker build -t [你的仓库地址]:[TAG] .
+# push 镜像
+docker push [你的仓库地址]:[TAG]
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/log/catboost.log
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/log/catboost.log
@ -0,0 +1,106 @@
+use_less_col:335
+使用的特征维度: 1762
+********************** RUN CATBOOST MODEL **********************
+******************  当前的 SEED 42 ********************** 
+FOLD 1 IS RUNNING...
+0:	learn: 1.2939006	test: 1.2943096	best: 1.2943096 (0)	total: 135ms	remaining: 22m 27s
+800:	learn: 0.2057676	test: 0.2778188	best: 0.2778188 (800)	total: 1m 2s	remaining: 11m 55s
+1600:	learn: 0.1533318	test: 0.2698555	best: 0.2698522 (1599)	total: 2m 4s	remaining: 10m 54s
+Stopped by overfitting detector  (100 iterations wait)
+
+bestTest = 0.2677497192
+bestIteration = 2222
+
+Shrink model to first 2223 iterations.
+{'learn': {'MultiClass': 0.12163532058790176}, 'validation': {'MultiClass': 0.26774971916097773}}
+FOLD 2 IS RUNNING...
+0:	learn: 1.2947765	test: 1.2944610	best: 1.2944610 (0)	total: 81.8ms	remaining: 13m 38s
+800:	learn: 0.2009925	test: 0.2969940	best: 0.2969940 (800)	total: 1m 2s	remaining: 11m 53s
+Stopped by overfitting detector  (100 iterations wait)
+
+bestTest = 0.2898436422
+bestIteration = 1413
+
+Shrink model to first 1414 iterations.
+{'learn': {'MultiClass': 0.15671545706553627}, 'validation': {'MultiClass': 0.2898436422052235}}
+FOLD 3 IS RUNNING...
+0:	learn: 1.2956904	test: 1.2979653	best: 1.2979653 (0)	total: 83.6ms	remaining: 13m 55s
+800:	learn: 0.2010365	test: 0.3031897	best: 0.3031249 (796)	total: 1m 2s	remaining: 11m 56s
+1600:	learn: 0.1521093	test: 0.2952955	best: 0.2952927 (1598)	total: 2m 4s	remaining: 10m 54s
+Stopped by overfitting detector  (100 iterations wait)
+
+bestTest = 0.2948664255
+bestIteration = 1799
+
+Shrink model to first 1800 iterations.
+{'learn': {'MultiClass': 0.13764700334845772}, 'validation': {'MultiClass': 0.2948664254808659}}
+FOLD 4 IS RUNNING...
+0:	learn: 1.2944941	test: 1.2931731	best: 1.2931731 (0)	total: 83.8ms	remaining: 13m 58s
+800:	learn: 0.2055831	test: 0.2798750	best: 0.2798750 (800)	total: 1m 2s	remaining: 11m 54s
+1600:	learn: 0.1555797	test: 0.2733073	best: 0.2732265 (1590)	total: 2m 4s	remaining: 10m 54s
+Stopped by overfitting detector  (100 iterations wait)
+
+bestTest = 0.2729804824
+bestIteration = 1672
+
+Shrink model to first 1673 iterations.
+{'learn': {'MultiClass': 0.14819996336927216}, 'validation': {'MultiClass': 0.27298048242230794}}
+FOLD 5 IS RUNNING...
+0:	learn: 1.2909100	test: 1.2914652	best: 1.2914652 (0)	total: 86.9ms	remaining: 14m 29s
+800:	learn: 0.2014462	test: 0.2983963	best: 0.2983963 (800)	total: 1m 2s	remaining: 11m 55s
+1600:	learn: 0.1523926	test: 0.2909189	best: 0.2907775 (1582)	total: 2m 4s	remaining: 10m 54s
+Stopped by overfitting detector  (100 iterations wait)
+
+bestTest = 0.2898741689
+bestIteration = 1887
+
+Shrink model to first 1888 iterations.
+{'learn': {'MultiClass': 0.13391467495348316}, 'validation': {'MultiClass': 0.289874168865446}}
+
+OOF-MEAN-ERROR score:0.283063, OOF-STD:0.010657
+Inint Score: 0.7240031522090993
+round:  1
+class:0, new_weight:1.01, f1 score: 0.7242893038330873
+class:0, new_weight:1.02, f1 score: 0.7244468658037289
+class:0, new_weight:1.03, f1 score: 0.7247189260435818
+class:0, new_weight:1.05, f1 score: 0.7247883133652404
+class:0, new_weight:1.06, f1 score: 0.7253074441711662
+class:0, new_weight:1.07, f1 score: 0.7255838308898628
+class:0, new_weight:1.09, f1 score: 0.7258591461588992
+class:0, new_weight:1.1, f1 score: 0.7263732069942956
+class:0, new_weight:1.11, f1 score: 0.7269810148203093
+class:0, new_weight:1.12, f1 score: 0.727085092104794
+class:0, new_weight:1.19, f1 score: 0.7275673332111965
+class:0, new_weight:1.2, f1 score: 0.7277300984468054
+class:0, new_weight:1.21, f1 score: 0.7300337938032027
+class:0, new_weight:1.22, f1 score: 0.7302916982856817
+class:0, new_weight:1.32, f1 score: 0.7302972834627351
+class:0, new_weight:1.33, f1 score: 0.7305212560605624
+class:0, new_weight:1.34, f1 score: 0.7307742905548762
+class:0, new_weight:1.3800000000000001, f1 score: 0.731115696618317
+class:0, new_weight:1.3900000000000001, f1 score: 0.7311341774607671
+class:0, new_weight:1.4000000000000001, f1 score: 0.7321211157346706
+class:0, new_weight:1.41, f1 score: 0.732530278451288
+class:0, new_weight:1.42, f1 score: 0.7326514907204666
+class:0, new_weight:1.43, f1 score: 0.7326655042252155
+class:0, new_weight:1.44, f1 score: 0.7340465325949609
+class:0, new_weight:1.45, f1 score: 0.7349701799847135
+class:2, new_weight:0.47000000000000003, f1 score: 0.7351355277520346
+class:2, new_weight:0.51, f1 score: 0.7352366908078052
+class:2, new_weight:0.52, f1 score: 0.7354704485017871
+class:2, new_weight:0.53, f1 score: 0.7356003615547112
+class:2, new_weight:0.54, f1 score: 0.7358162977063339
+class:2, new_weight:0.55, f1 score: 0.7360528528073605
+class:2, new_weight:0.6, f1 score: 0.7360930396635706
+class:2, new_weight:0.62, f1 score: 0.7361315695490319
+class:3, new_weight:0.77, f1 score: 0.736236509770795
+class:3, new_weight:0.79, f1 score: 0.7362861930960579
+class:3, new_weight:0.8, f1 score: 0.73637330491084
+class:3, new_weight:0.81, f1 score: 0.7364039172775363
+class:3, new_weight:0.8200000000000001, f1 score: 0.7365143106346561
+class:3, new_weight:0.8300000000000001, f1 score: 0.7366247783303799
+round:  2
+class:2, new_weight:0.55, f1 score: 0.7366811046538598
+round:  3
+********************** SEARCH BEST WEIGHT : [1.45, 1.0, 0.55, 0.8300000000000001] **********************
+********************** BEST MACRO_F1 : 0.7366811046538598 **********************
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/model/model.pkl
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/model/model.pkl
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/cat_prob_result.csv
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/cat_prob_result.csv
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/catboost_result.csv
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/catboost_result.csv
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/lgb_prob_result.csv
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/lgb_prob_result.csv
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/stacking_result.csv
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/prediction_result/stacking_result.csv
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/run.log
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/run.log
@ -0,0 +1,12 @@
+Archive:  model.zip
+   creating: model/deberta-base/
+  inflating: model/debert_model_v21_128_fs_flod_5.h5  
+  inflating: model/debert_model_v21_128_fs_flod_6.h5  
+  inflating: model/debert_model_v21_128_fs_flod_8.h5  
+  inflating: model/README.txt        
+  inflating: model/weight_cs6399_fold_8_v21_128_fs.npy  
+  inflating: model/weight_cs6558_fold_5_v21_128_fs.npy  
+  inflating: model/weight_cs6614_fold_6_v21_128_fs.npy  
+  inflating: model/weight_fs6138_fold_8_v21_128_fs.npy  
+  inflating: model/weight_fs6280_fold_5_v21_128_fs.npy  
+  inflating: model/weight_fs6359_fold_6_v21_128_fs.npy  
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/run.sh
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/run.sh
@ -0,0 +1,5 @@
+rm -rf model
+#unzip model.zip
+python3 code/get_crashdump_venus_fea.py
+python3 code/catboost_fs.py
+zip -j result.zip prediction_result/catboost_result.csv
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/tcdata/数据集下载地址
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/tcdata/数据集下载地址
@ -0,0 +1 @@
+https://tianchi.aliyun.com/competition/entrance/531947/information?lang=zh-cn
--- a/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/答辩PPT/悦智AI实验室_20220525.pdf
+++ b/机器学习竞赛实战_优胜解决方案/第三届阿里云磐久智维算法大赛/答辩PPT/悦智AI实验室_20220525.pdf
				`@ -0,0 +1 @@`
				`https://tianchi.aliyun.com/competition/entrance/531947/information?lang=zh-cn`