Avatar billede amix Nybegynder
13. februar 2004 - 13:06 Der er 1 kommentar og
1 løsning

XML parser - problemer med fread

Hejsa

Jeg har et script som parser nogle rimelig store XML filer (på henholdsvis 500 MB og 1.2 GB).
Anyway, jeg bruger scriptet til at parse data far XML dokumenterne og indsætte dem i en database (alt dette virker fint). Dog har jeg fundet en meget stor fejl, som laver at data i databasen bliver korrupte!

Fejlen ligger i fread, mere specifikt denne linie:
    // parse XML
        while ($data = fread($fp, 4096))
        {
              // error... :(
              if (!xml_parse($this->xml_parser, $data, feof($fp)))
              {
                    $ec = xml_get_error_code($this->xml_parser);
                    die('XML parser error (error code ' . $ec . '): ' . xml_error_string($ec) .
        "\nThe error was found on line: " . xml_get_current_line_number($this->xml_parser));
              }

Dette er en meget normal måde at parse XML på - altså ved at bruge 4096 pr, gang. Dog laver denne metode korruption - altså dataen bliver ikke parset rigtigt!

Dette har jeg løst, men ikke på en smart måde! Jeg har valgt at den skal læse hele filen på en gang! Dvs. indskifte dette ind: while ($data = fread($fp, filesize($this->xml_file)))

Nu undre jeg mig over hvorfor 4096 byte pr. gang virker på små XML filer og ikke på store!? Endvidere er der nogen som har en idé til hvorledes man kan løse problemet - altså, jeg kan loade den 500 mb fil ind, men ikke en på 1.2GB :-/

Jeg har prøvet med at sætte at den skal hente flere bytes, men dette forbedrede ikke resultatet.

Tak på forhånd.
Avatar billede amix Nybegynder
13. februar 2004 - 17:52 #1
Arv.
Fuck it. Det er php som er fatsvag.

<?php
$currentTag = "";

$fields = array();
$values = array();

$xml_file="data.xml";

function startElementHandler($parser, $name, $attributes)
{
      global $currentTag, $table;
      $currentTag = $name;

      if (strtolower($currentTag) == "table")
      {
            $table = $attributes["name"];
      }

}

function endElementHandler($parser, $name)
{
      global $fields, $values, $count, $currentTag;

      global $connection, $table;

      if (strtolower($name) == "record")
      {
            $query = "INSERT INTO $table";
            $query .= "(" . join(", ", $fields) . ")";
            $query .= " VALUES(\"" . join("\", \"", $values) . "\");";

          echo "$query\n";

            $fields = array();
            $values = array();
            $count = 0;
            $currentTag = "";
      }

}

function characterDataHandler($parser, $data)
{
      global $fields, $values, $currentTag, $count;
      if (trim($data) != "")
      {
            $fields[$count] = $currentTag;

            $values[$count] = mysql_escape_string($data);
            $count++;
      }
}

$xml_parser = xml_parser_create();

xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE, TRUE);


xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, FALSE);

xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler");
xml_set_character_data_handler($xml_parser, "characterDataHandler");

if (!($fp = fopen($xml_file, "rb")))
{
      die("File I/O error: $xml_file");
}

while ($data = fread($fp, 2))
{
      if (!xml_parse($xml_parser, $data, feof($fp)))
      {
            $ec = xml_get_error_code($xml_parser);
            die("XML parser error (error code " . $ec . "): " . xml_error_string($ec) .
"<br>Error occurred at line " . xml_get_current_line_number($xml_parser));
      }
}

xml_parser_free($xml_parser);


?>

<?xml version="1.0"?>
<table name="readings">
      <record>
            <a>56565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656</a>
            <b>12565656565656565656565656565656565656565656565656565656565656565622</b>
            <c>785656565656565656565656565656565656565656565656565656565656565656.5</c>
      </record>
      <record>
            <x>456565656565656565656565656565656565656565656565656565656565656565</x>
            <y>-565656565656565656565656565656565656565656565656565656565656565610</y>
      </record>
      <record>
            <x>156565656565656565656565656565656565656565656565656565656565656562</x>
            <b>105656565656565656565656565656565656565656565656565656565656565656459</b>
            <a>7565656565656565656565656565656565656565656565656565656565656565656</a>
            <y>95656565656565656565656565656565656565656565656565656565656565656</y>
      </record>
</table>

Outputtet bliver:
INSERT INTO readings(a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c) VALUES("56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "12", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "22", "78", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", ".5"); INSERT INTO readings(x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y) VALUES("4", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "5", "-", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "10"); INSERT INTO readings(x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y) VALUES("1", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "2", "1", "05", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "64", "59", "75", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "6", "9", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56");

Og hvis man sætter at den læser flere bytes per gang:
INSERT INTO readings(a, b, c) VALUES("56565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656", "12565656565656565656565656565656565656565656565656565656565656565622", "785656565656565656565656565656565656565656565656565656565656565656.5"); INSERT INTO readings(x, y) VALUES("456565656565656565656565656565656565656565656565656565656565656565", "-565656565656565656565656565656565656565656565656565656565656565610"); INSERT INTO readings(x, b, a, y) VALUES("156565656565656565656565656565656565656565656565656565656565656562", "105656565656565656565656565656565656565656565656565656565656565656459", "7565656565656565656565656565656565656565656565656565656565656565656", "95656565656565656565656565656565656565656565656565656565656565656");

Det er sgu lidt fucked up.

Det jeg har tænkt mig nu er at splitte mine 500 filer i små filer og så læse dem ind på en gang.
Avatar billede amix Nybegynder
16. februar 2004 - 09:07 #2
Har løst den på en anden måde
Avatar billede Ny bruger Nybegynder

Din løsning...

Tilladte BB-code-tags: [b]fed[/b] [i]kursiv[/i] [u]understreget[/u] Web- og emailadresser omdannes automatisk til links. Der sættes "nofollow" på alle links.

Loading billede Opret Preview
Kategori
Vi tilbyder markedets bedste kurser inden for webudvikling

Log ind eller opret profil

Hov!

For at kunne deltage på Computerworld Eksperten skal du være logget ind.

Det er heldigvis nemt at oprette en bruger: Det tager to minutter og du kan vælge at bruge enten e-mail, Facebook eller Google som login.

Du kan også logge ind via nedenstående tjenester